Choosing the Right Data Quality Monitoring Solution
- ML Monitoring
- Machine Learning
May 18, 2022
I am starting this article with an opening you might roll your eyes at, but bear with me because it helps drive home a point: Unity recently reported an estimated $110 million dollar loss in 2022 due to data quality issues, showing that you can have a number of teams executing well in your company and be on the path to profitability, but that one miss in your ML systems can spoil all of that positive momentum.
The whole essence of the article series on data quality monitoring is to help you understand that, look, bad data is bad business—it practically makes it impossible to make good business judgments.
As I discussed in the previous article in this series, the best way to ensure high-quality data — or “good” data — is to monitor the data for issues throughout the data system. To do this, you would need to choose a tool to monitor the quality of your data — a crucial part of your data quality monitoring approach.
The goal of any optimal data quality monitoring solution should be to help you, a data whiz, find and fix data quality issues quickly.
In this article, we will take a look at:
- What an ideal data quality monitoring solution is,
- Several open source and SaaS data quality monitoring tools,
- Making a decision: Should you build or buy a data quality monitoring tool?
With the array of data quality tools available in the market, this article focuses on tools that are either specifically built for data quality monitoring or have monitoring components as one of their features.
An ideal data quality monitoring solution
What should an ideal monitoring solution have? First, let’s take a page from the DevOps criteria for good observability tools.
A good data quality monitoring solution should be:
It should not be hard for anyone on the team to use; this includes the visualization component and setting up monitors. The tool should take the onboarding configuration, data source integration, and maintenance burden off the team or engineer working with it. It will be valuable to have smart defaults and be configurable programmatically.
Ideally, the tool should automatically find services that would make monitoring more manageable, such as finding suitable baselines and suggesting data quality indicators for common data quality issues. Some solutions also offer ML-assisted services for monitoring hard-to-detect issues, making the process easier.
When a tool is self-service, it’ll inevitably make it easy for the people in the team to adopt a culture of monitoring.
The tool should be scalable to meet growing data volume and increased cardinality. It should also be generalizable to different use cases and work with various architectures for your data stack. A generalizable tool allows you to write custom data quality indicators (DQIs) for business-specific use cases.
The tool should be collaborative because it provides monitoring for data teams and organizations—not just data engineers. You should easily be able to share data, insights, dashboards, and reports with others via Slack channels (or other team chats), accurate alerting, e-mails, and other information radiators.
The tool should provide a holistic view of your data, including end-to-end data lineage tracking, from the data consumer to the data transformation process and the data producer. You should also have the ability to see the health of the overall data stack to get a picture of failures that occur outside your data that may affect data quality.
The tool should allow you to automate the monitoring process and workflow programmatically. This includes writing automation scripts, YAML code, queries, and performing real-time analysis to transform insights into action without manual intervention.
It is important that the tool has some form of security leverage against external attacks, protecting your data. These tools will have one or more global or regional security and compliance certifications in most cases.
The tool should take snapshots of statistical information of your data before and after changes—it should have source control management. For example, what was the schema before a transformation change at a specific time? Did the rate at which data streamed into the system change?
Data quality monitoring tools
In this breakdown of available data quality tools, we focus on solutions that are either specifically built for data quality monitoring or have monitoring components as one of their features. Some of these tools also focus on other aspects of data quality management beyond monitoring (such as profiling, logging, and testing).
Open source tooling
Several open source tools are available for general data quality management, some of which Sam Bail discussed in this video. Of course, you can choose to host these solutions on your infrastructure, but some of them have Cloud offerings available. Let’s look at a few of them:
whylogs is the only library that enables logging, testing, and monitoring of a data or ML application without the need for raw data to leave the user’s environment. It automatically creates statistical summaries of datasets, called profiles, which have similar properties to the logs produced by other software applications.
whylogs profiles have three properties that make them ideal for data logging and monitoring: descriptive, lightweight, and mergeable.
Some of the features of whylogs include:
- Accurate data profiling: It calculates statistics from 100% of the data, never requiring sampling, ensuring an accurate representation of data distributions.
- Lightweight runtime: It utilizes approximate statistical methods to achieve a minimal memory footprint that scales with the number of features in the data.
- Works with any architecture and data stack: It scales with your system, from local environments to live production systems in multi-node clusters. It also works well with batch and streaming architectures and your data stack.
- Configuration-free: It infers the data schema, requiring zero manual configuration to get started.
- Small storage footprint: It turns data batches and streams into statistical fingerprints, 10-100MB uncompressed.
- Unlimited metrics: It collects all possible statistical metrics about structured or unstructured data.
whylogs has Python and Spark APIs that can work anywhere your team works. In addition, you can check out examples on how to get started with whylogs.
Pandera is a dataframe data validation and testing tool that is lightweight and adaptable for projects of any scale. It allows you to keep track of the quality of your data by monitoring it on a regular basis and doing statistical validation tests.
Pandera offers a flexible and expressive API for doing data validation on various dataframes types, resulting in more understandable and robust data processing pipelines. If you have an understanding of what your data should look like, this tool can help you ensure you can track bad data.
Some of the features of Pandera include:
- Self-service: It is lightweight and easy to get started with.
- The API is flexible and can scale to large projects with complex dataframe types such as Dask and Koalas.
- Integration with a rich ecosystem of python tools like pydantic, fastapi, and mypy.
- It also comes with common data validation checks and you can also write your custom checks.
- Automatable: Seamless integration with existing data analysis/processing pipelines via function decorators.
You can see more features on the documentation page.
Great Expectations (GE)
Great Expectations is a tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams. It is similar to Pandera but offers a more robust ecosystem of features and is more scalable to larger projects and data systems.
GE allows you to write declarative data tests based on what you expect from the data, get validation results from those tests, and output a report that documents the current state of your data.
Some of the features of Great Expectations include:
- Self-service: Smart defaults and production-ready validation in your data pipelines.
- Easy to integrate with a large variety of data tools and designed to be interoperable with your data stacks.
- You can write assertions (known as Expectations) to validate the quality of your data and detect when there are issues.
- Collaboration: With GE, you can render your data quality validation tests into documentation for your data.
You can learn more about Great Expectations by visiting the documentation.
Deequ, originally built and previously maintained by Amazon Labs, is a library written in Scala and built on Apache Spark to define "unit tests for data," which measure data quality in large datasets. It is currently at the end of life, and in fact, whylogs is based on a lot of the work done by the Deequ team at AWS.
Deequ lets you generate data quality metrics on your Spark dataset, specify and validate data quality limitations, and keep track of data distribution changes. You may focus on determining how your data should appear rather than creating checks and verification techniques on your own.
To use Deequ with Python, PyDeequ provides an open source Python wrapper over Deequ.
Some of the features of Deequ include:
- Self-service: Deequ provides a wide range of data quality indicators that you can choose or automatically suggest to you to implement in your Spark and Python pipelines. These indicators are helpful for computing data quality metrics.
- Dynamic: Deequ is often used to test and monitor data quality issues at scale.
- Holistic: As a user, you're primarily concerned with creating a set of data quality constraints that need to be verified. Deequ generates a data quality report that includes the constraint verification results.
- Automatable: PyDeequ supports a set of core APIs that make automation possible.
There are various SaaS solutions to choose from that can ensure data observability and reliability for your organization. These solutions vary in pricing and instrument the data system—how they get data for monitoring, including metrics they can monitor.
WhyLabs is an observability platform that monitors data pipelines and ML applications for data quality regressions, data drift, and model performance degradation.
WhyLabs is built on top of whylogs, which we discussed earlier. Once whylogs profiles your data, the library outputs can be used to test, monitor, and debug data on the WhyLabs data health monitoring platform.
Some of the features of WhyLabs include:
- Self-service: It is easy to use, with the set-up only taking minutes.
- Data profiling is a necessary process before model building. You can upload data profiles to WhyLabs for centralized and customizable monitoring/alerting of model inputs, outputs, and performance.
- Dynamic: It can scale to terabytes, handle your large-scale data, keep compute requirements low, and integrate with either batch or streaming data pipelines.
- It provides some integrations for data pipelines.
- With a rich dashboard and collaborative features, you can provide authorized and controlled access to what team-mates should view and work with.
- WhyLabs can notify you of data quality issues in real-time through popular channels like Slack and email.
- Holistic: WhyLabs provides a holistic insight into your data and helps track the data lineage across your data system.
- Automatable: You can interact with WhyLabs programmatically through an API.
- Privacy-preserving: WhyLabs maintains data privacy and does not capture sensitive raw data but rather the statistical properties of the data. Your raw data never leaves your environment. It is also SOC 2 Type 2 compliant.
One major advantage WhyLabs has over other tools is that it leverages whylogs which enables it to monitor data in motion, including when data is moving through data pipelines in Python, Spark, or Kafka.
This is in comparison to other platforms like Monte Carlo, Soda Cloud, and others (which we will review below) that all operate on static data in the data warehouse or data source.
With sample datasets already loaded through the sandbox, you can get up and running with the WhyLabs platform for free. It will help you get a feel of what a solution you want to choose should have.
Metaplane is an end-to-end observability platform for monitoring the data quality in your system. It continuously monitors the data flowing through your data stack so you can track the quality of your data.
Metaplane collects metrics, metadata, lineage, and logs on your data, trains anomaly detection models on historical values, and then sends you alerts for outliers with the opportunity to offer model feedback in cases of false positives.
Some of the features of Metaplane include:
- It’s one of the easiest tools on this list to get started with, including a seamless set-up process.
- You can add data validation tests without writing code.
- Dynamic: Metaplane integrates with every part of the modern data stack.
- Collaborative: Metaplane enables team collaboration and like most of the SaaS tools in this category, can notify you of issues through popular channels like Slack/PagerDuty/email.
- Holistic: Metaplane provides an end-to-end view of your data system. It creates a link between the data in your warehouse and the dashboards that your stakeholders utilize.
You can get started with Metaplane from the documentation page.
Monte Carlo is an end-to-end data observability platform that provides monitoring and altering solutions for data quality issues affecting your data system. It’s one of the most feature-rich data observability tools on the market. Let’s look at some of Monte Carlo’s features per the documentation.
Some of the features of Monte Carlo include:
- Self-service: It is easy to configure and get started with, especially with smart defaults and ML-powered incident monitoring and resolution, making troubleshooting easier.
- Dynamic: It is flexible and scales with your data stack.
- Collaboration: It has code-free implementation with your existing data stack for out-of-the-box coverage and smooth cooperation with your coworkers.
- Holistic: It gives you complete visibility into all of your data assets.
- Automatable: Interact with Monte Carlo programmatically through the APIs, SDKs, and the CLI. Also, write YAML code to add monitors from your CI//CD workflow.
- Privacy-preserving: It maps your data assets at rest without requiring data extraction from your surroundings; therefore, it doesn't store or process your data.
To get started with Monte Carlo, you can request a demo (and learn about the pricing options). Check out the developer hub and their blog.
Soda Cloud is a platform you can use to automatically detect issues with your data and notify your team of those issues.
Soda Cloud leverages Soda SQL, a free, open-source command-line program and Python module that uses user-defined inputs to generate SQL queries that analyze datasets in a data source for data quality issues including incorrect, missing, or unexpected data.
Through Soda Cloud, you can visualize the results of your data validation tests, see historical measurements, and set up alerts that help you effectively monitor your data quality.
Some of the features of Soda Cloud include:
- Self-service: Setting up and integrating your stack with Soda Cloud is pretty easy. What you might find challenging is working with the YAML configs when using Soda SQL.
- Dynamic: Soda Cloud is scalable to large datasets and integrates with your data stack.
- Collaborative: You can work together with your team to monitor and manage data quality, assign roles, and set up alerts for the proper individuals to deal with problems using Soda Cloud.
- Holistic: Soda Cloud gives your team a holistic view of your data health.
- Automatable: Soda Cloud can automatically help you detect anomalies in your data, and with YAML configs, you can automate data validation with code.
According to their product page, Databand helps you monitor and control your data’s quality, even when you can’t control your sources. You can also orchestrate your data pipelines with this tool and self-host them.
- Self-service and collaboration: Databand is easy to configure with no-code and programmatic configurations to track your assets (such as the metadata) and easy to collaborate with teammates.
- Dynamic: It provides a variety of native integrations to common data tools, making it easy to plug them into your existing data stack.
- Holistic: It can help you track data lineage end-to-end, giving you a holistic view of your data.
- Privacy-preserving: You can control what data asset gets monitored by Databand to protect sensitive data.
Databand has a Cloud and a self-hosted offering for on-premise usage.
If you want to learn about other monitoring and observability solutions available, this article goes into some other solutions.
Making a decision: Should you build or buy a data quality monitoring tool?
Like any software solution, there are different approaches you could take in choosing a software solution. Deciding whether to build a solution, buy one, or go with a combination of both can be influenced by the organization’s:
- Business requirements,
- Technical requirements,
- Size and resources (talent and budget),
- Implementation timeline.
In considering the business requirements, consider the business problem and the deliverables.
What is the business use case? Is it complex enough to allow you to consider building a custom solution, or can existing solutions handle it well?
Specifically, you want to look at your defined data SLAs, SLOs, and SLIs. For example, if the data uptime requirement for the use case is 96%, are there solutions that can deliver you such service availability?
For example, Uber built its data quality monitoring solution (DQM) due to the diversity of business problems they solve at scale. Therefore, they required a flexible enough tool for their use cases and can scale, and had to opt-in on building internally, among other factors.
In this case, you probably have an Ops, software, and data team, so you need to gauge your existing resources in compliance with the existing solutions you want to consider.
You may want to consider building a custom solution if the solutions you are considering cannot leverage—and are non-interoperable with—your existing resources (be it the talents or infrastructure) and cannot integrate with them. It should also be the case if such solutions are not compliant with your business rules, logic, and policies.
These organizations are most likely solving many data problems with many use cases. They may:
- Have valuable intellectual property (IP) they want to keep in-house and would not want to compromise,
- Have internal data, infrastructure, and software teams to build in-house software tools.
They will likely have the budget to build an in-house solution and therefore have the option of choosing to develop in-house or purchasing an enterprise solution that can meet their requirements.
These organizations may have many data problems, but perhaps not quite at the level of more giant corporations. They may:
- Have internal data and software teams,
- Have new monitoring needs and requirements.
They may not have all the resources and budget to build an in-house solution, but they still need to customize the solution to their business cases, perhaps with some IP to protect. One could argue that a hybrid solution might be helpful, but this is dependent on other factors like leveraging an existing solution that can run on-premise.
Smaller organizations and startups
For these organizations, the data problems and use cases they are solving may not be at the level of the two previous categories. In addition, they would likely not have the talent to build and maintain custom solutions, and might not have the budget either.
It might seem logical to buy the license for an affordable solution. Still, this can be subjective because some startups (like Blinkist) are solving data problems at an unprecedented scale.
If you have a short timeline to monitor your data quality issues or implement one in your data stack, you might consider using an existing solution. However, if you plan on building a custom solution, you might want to use an open source solution as a placeholder.
At the same time, you develop an in-house solution or purchase the license of a full-featured tool that would be valuable for the development period.
Choosing a data quality monitoring tool is not a standalone effort; it depends on your data quality strategy. Therefore, consider the factors you learned in the previous section before you decide.
Also, ensure you know your most vital monitors to understand the capabilities and limits of the solution you are setting for and if it’s enough to solve your challenges.
- Try WhyLabs’ Free Self-Service Data Monitoring Platform — https://whylabs.ai/free.
- Join the WhyLabs Slack community to discuss ideas and share data logging and observability feedback.
References and Resources
How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots
Feb 23, 2023
- AI Observability
Robust & Responsible AI Newsletter - Issue #5
Mar 10, 2023
Detecting Financial Fraud in Real-Time: A Guide to ML Monitoring
Mar 7, 2023
- ML Monitoring
Achieving Ethical AI with Model Performance Tracing and ML Explainability
Feb 2, 2023
- ML Monitoring
Detecting and Fixing Data Drift in Computer Vision
Jan 26, 2023
- ML Monitoring
BigQuery Data Monitoring with WhyLabs
Jan 17, 2023
Robust & Responsible AI Newsletter - Issue #4
Dec 22, 2022
WhyLabs Private Beta: Real-time Data Monitoring on Prem
Dec 21, 2022
Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data
Dec 21, 2022
- Data Science
- Machine Learning