A Comprehensive Overview Of Data Quality Monitoring
- ML Monitoring
- Data Quality
Feb 2, 2024
As data becomes increasingly vital in the digital age, the role of data quality monitoring in building effective data and machine learning systems has grown immensely. Data quality monitoring involves the ongoing assessment and validation of data to ensure accuracy, consistency, and reliability. Forrester's Online Global Survey on Data Quality and Trust reveals that 42% of data teams spend over 40% of their time validating data, highlighting its critical role. Moreover, Gartner research suggests that poor data quality results in an average annual loss of $15 million for organizations.
But how do you ensure the reliability of your data pipelines? Are you effectively checking and validating the right data? And when it comes to machine learning, the integrity and quality of the data are not just beneficial but essential for the success of your models. Neglecting continuous data quality monitoring can lead to significant business damages, such as poor application performance, compliance issues, customer churn, and revenue loss.
This article, the first in a series, offers a comprehensive overview of data quality monitoring. It will guide you through the strategies for continuous quality assessment, ensuring your data remains an asset rather than a liability.
To implement data quality monitoring in your own applications - sign up for a free WhyLabs starter account or request a demo!
What is data quality?
Data quality refers to the degree to which data meets the specific needs and expectations of its usage in various stages of the data lifecycle, including collection, processing, storage, and analysis. In this guide, we understand data quality to mean the health of your data as it flows through the entire data system.
From a technical standpoint, high-quality data should be valuable to consumers - whether they are services, applications, or users - and beneficial to the business's objectives. It involves ensuring that the data ingested into your system is accurate, consistent, and relevant to the task at hand.
But what exactly makes data 'good' or 'healthy'? Let’s understand the dimensions that define good data quality.
Dimensions of data quality
High-quality data is characterized by six key attributes. You will learn these attributes and their corresponding applications in a use case. SolarTech Innovations, a fictional company specializing in solar energy solutions, uses a data-driven approach to optimize its operations and customer service.
The company relies on various data types, including customer data, operational data, and market analysis. Let’s learn how the data quality dimensions apply to 'SolarTech Innovations' data.
What is data quality monitoring?
Monitoring data quality is about measuring, analyzing, and improving data quality to fit the business purposes and meet business expectations. With the explosion of real-time machine learning and business intelligence, the only approach to successfully validate dynamic data is to monitor its quality continually and evaluate it using a set of relevant quality metrics.
For example, data quality monitoring in a real-time analytics system might involve real-time checks for accuracy and consistency, ensuring that incoming data streams are correct, up-to-date, and synchronized across various platforms.
The necessity of monitoring the quality of your data
To understand why you need to monitor the quality of your data, you need to know where data quality issues could stem from in the entire data lifecycle and the types of issues you’ll likely find at each stage. Poor data quality has led to widespread financial collapse, inaccurate analyses, misguided business decisions, and ultimately negative impacts on reputation.
Where data quality issues could stem from
There are three significant areas where issues degrade data quality:
- Directly from the data source(s).
- Inside the data systems or pipelines (during transformation and manipulation).
- When downstream systems consume them.
Data quality issues directly from the data source(s)
Okay, so you have probably encountered this scenario: It’s Friday, and you are so excited about the weekend. Everything seemed to have worked well during the week, and you are about to log off from slack. Then, suddenly, your data analyst hastily calls you to inform you that the report dashboard for a particular segment of customers was broken and has led to wrong conclusions for a week now.
You start panicking because, of course, those tests you wrote should have caught these issues in the pipeline, right? If you are lucky enough, it took you anywhere from a few hours to days of thoroughly searching for the root cause of the problem. Aha! Alas, find out that mobile application developers made changes to the schema of the Firebase table that collects data from the app.
Since you were not informed and did not write validation tests to cover such edge cases, the bad data made its way through to your reporting layer. While there are several issues that you cannot account for that stem directly from data sources, there are common issues including:
- Duplicate and delayed events,
- Stale or inaccurate data gets ingested into the data system,
- An outlier in a crucial field that goes undetected until it shows up in the data reporting layer or ingested by a machine learning training pipeline,
- You may only notice data distribution drifts when a production ML application outputs strange results because the training data distribution differs significantly from the production data distribution.
- Missing events (events not coming in because of broken integrations),
- Wrong data type and format (wrong dollar amount; incorrect product code),
- Incorrect data syntax and semantics.
Data quality issues inside the data systems or pipelines
Good data pipelines mean good data hygiene and ultimately healthy data. But here’s the catch: Building good data pipelines is challenging, whether it’s an ETL, ELT, or even an rETL-based pipeline. It isn’t easy! There is a reason you might find bad or buggy transformations that lead to data quality issues.
For example, you may miswrite your transformation steps, causing your pipeline steps to execute in the wrong order. Or may write data validation tests in your pipeline in the wrong order.
The transformations within the data pipeline can cause data downtime, corrupt the data, and even cause problems for downstream consumers.
Data quality issues when downstream systems consume them
Data quality issues are least common at this stage, but they do occur. For example, a code modification in your ML pipeline could prevent an API from gathering data for a live or offline model. In addition, perhaps your BI analysis tool may no longer be receiving updates from your data source, producing stale reports due to some software upgrade or dependency changes.
These issues overall impact the quality of the dataset fed to downstream consumers, such as a machine learning training pipeline or analytics software. It would inevitably affect the results of such a system, potentially rendering the entire project futile.
Continuously monitoring—and going forward, observing—your data quality will guarantee that you can:
- Detect most—if not all—data quality issues,
- Troubleshoot these issues before they can cause silent or non-silent errors,
- Continuously report on—and improve—the quality of data you are using to solve business problems.
It would not just increase the trust you have in your data; reliability would likely go through the roof too.
Now you have an idea of where these issues might occur, let’s focus on what types of data quality issues you could encounter.
Challenges with monitoring data quality
The core challenges with monitoring data quality are:
1. Diversity in business use cases
Each business use case, such as time-series analysis versus recommendation engines, has unique data requirements and standards. This diversity makes it challenging to define universal standards for data quality. Continuous engagement with stakeholders to understand and define appropriate metrics (data SLAs) for each case is crucial.
2. Challenges stemming from the data entity
Data scale
With the exponential growth in the volume of data comes the need to validate the dataset in real-time to ensure businesses can make the most of their data.
Monitoring data quality with increasing data volume can be challenging. The increased data volume can lead to increased data quality issues that must be monitored, triaged, and troubleshooted.
Another problem with the data scale is its ephemeral nature. Monitoring may also be challenging if you do not persist high-volume data in storage or compute memory.
Data cardinality
High data volume will inevitably lead to high data cardinality (the number of unique values in a column). This is not necessarily bad (because high cardinal data could also mean more valuable data), but this could cause your data structure to be inconsistent. So you often need to adapt your data quality monitoring techniques to account for structural changes.
Data fragmentation
Data assets (such as metadata) and dependencies are scattered across systems and organizational use cases, each containing a small amount of information, making it difficult to centralize quality monitoring and management efforts.
Data legalities
In most situations, data security is crucial, and it’s often vital that whatever solution or technique you use to monitor the quality of the data does not compromise its security.
Other legal issues revolve around monitoring data with personally identifiable information (PII) and ensuring the data values are protected while using the monitoring technique or solution.
3. Challenges due to data infrastructure
The complexity of the data architecture
You can link this challenge to the use case diversity challenge mentioned earlier. Most business use cases require data to be continuously streamed into data pipelines in real-time and not just at scheduled intervals (batch mode).
The high data throughput makes monitoring the quality of the data complex because users and systems have to be proactive in triaging and troubleshooting data quality issues in real-time.
Data pipeline workflow complexity
Workflows have grown increasingly complex, especially with the data scale, the explosion of use cases, and the advent of workflow orchestration tools. It is hard to monitor data pipelines with a dozen stages and many branches, and it is even more challenging when someone makes an unanticipated change to the pipeline.
Data source orchestration
When monitoring data quality, ingesting data from various sources adds a lot of overhead. The larger the number of sources, the more likely you would have to deal with similar and varying data quality issues.
Looking at these challenges, you will realize that monitoring data quality requires a thoughtful and continuous approach. Therefore, you need to consider these challenges to develop your continuous monitoring approach to data quality.
What should you monitor in your data systems?
Knowing what to monitor in your data system to ensure high-quality data is tricky because it is often problem-dependent and has no set standards. Like monitoring regular software systems, there are several custom and standard metrics. However, we can take a page from software monitoring and observability.
Software observability has three pillars:
- Logs (record of events)
- Metrics (quantitative measures of certain aspects)
- Traces (record of the path taken through the software)
Adapting these to data systems, we arrive at five key pillars of data observability:
1. Volume
- Monitors whether all expected data has arrived in the system. For example, check for significant deviations in data volume that might indicate missing or excessive data.
2. Freshness
- Monitors the time the data arrived in the data system, its recency, and whether it is up-to-date with its dependencies.
3. Schema
- Monitors the data structure and dimensions. It includes how they have changed, including the source of the change, who made the changes, and their impact.
4. Distribution
- Evaluates whether the data distribution falls within an acceptable range and if transformations within the data pipeline are accurate. This could involve analyzing the statistical properties of the data.
5. Lineage:
- Traces the flow of data from upstream sources to downstream consumers. You can also trace who interacts with your data and at what stages.
In most solutions, a data profiler (such as whylogs) generates the metadata about the data from the upstream source (or data producer). The metadata provides the information required to observe the data system appropriately with the five data observability pillars listed above. It can be a good source for the data quality metric.
Before we continue, you should know a subtle but crucial difference between data observability and monitoring. Monitoring tells you when something is wrong with the quality of your data. At the same time, observability enables you to understand why you encounter data quality issues and what to do about them.
Choosing what to monitor in your data system
Choosing what to monitor depends on the type of business problem you are solving, so you need to make some considerations before deciding what to monitor. These considerations are based on your data profile, problem, resources, and the operational impact of the data.
Here are vital considerations, adapted from Frank Dravis's whitepaper on data quality strategy:
1. Frequency of data consumption
Assess how often downstream systems use the data (hourly, daily, weekly, etc.). This frequency guides the monitoring approach and highlights critical issues to prioritize.
2. Importance of the downstream system
ML models used for mission-critical, life-dependent, routine operations? Or is it a BI tool used for end-of-the-month reporting? Sometimes, ML Engineers might want to deploy ML models for healthcare applications, highly regulated applications, or those prone to biases.
Considering this, you will know the most important issues to monitor—especially when you talk to domain experts.
3. Cost of monitoring
After establishing the monitoring system and process, the direct costs are labor and system resources—ideally, the better the monitoring technology, the lower the labor costs. You may want to log many issues, but knowing the cost of monitoring, you can hone in on monitoring the most critical issues—including those you don’t know.
4. Operational impact
Can you estimate the cost of poor data quality provided into these systems in cases where the complex dataset feeds crucial downstream consumers? How impactful are data quality issues encountered concerning the department’s or organization’s targets?
There are two aspects to consider here:
- Impact of assessing operational (production) data during live operations.
- Impact of the process on all stakeholders involved.
This will inform you whether to manually audit the quality assessment process, partially or fully automate it.
Data quality indicators (DQI)
Data Quality Indicators (DQIs) serve as key metrics for assessing data quality within a system, akin to Key Performance Indicators (KPIs) in business analytics. To be effective, DQIs should meet the following criteria:
- Well-defined and measurable: Each DQI should have a clear definition and a quantifiable measure. For example, a DQI for data accuracy might measure the percentage of records that pass a set of accuracy checks.
- Relevant to the business use case: The DQI should directly apply to the specific business context. For instance, in a retail business, a DQI might focus on the accuracy and completeness of inventory data.
- Align with business requirements and data SLAs: DQIs should reflect the business's needs and the agreed-upon data service level agreements (SLAs).
You can define your DQIs around the five pillars of data observability discussed earlier in this section.
Data quality monitoring solutions usually have built-in DQIs dependent on your data. Of course, you may need to make some maneuvers to monitor unstructured data, but most tools would allow you to use a query (like SQL) or programming language (like Python) to define your indicators.
Since business use cases differ, you may also need to define DQIs specific to business use cases to ensure your bases are covered.
Key metrics to assess data quality
You can use several key metrics to monitor and maintain high data quality. These metrics provide insights into various aspects of data quality and are instrumental in identifying areas for improvement:
- Ratio of data to errors: This metric measures the total number of errors relative to the dataset size ingested into your pipeline. For instance, a high ratio may indicate systemic data collection or entry issues.
- Number of empty values: Empty fields in a dataset can signal incomplete data. Tracking this metric helps in assessing the completeness of the data.
- Data transformation error rate: This metric measures the frequency of errors occurring due to transformations in the data pipeline. A high error rate could indicate data processing rules or logic issues.
- Amount of dark data: 'Dark data' refers to data collected but not used, often due to quality issues. By performing data discovery and profiling, you can quantify how much of your data falls into this category, indicating potential areas where data quality improvements could unlock value.
When selecting metrics, it’s crucial to consider the specific needs and context of your business and data systems. You can use tools such as data profiling software, and analytical queries to measure these metrics effectively. Regular monitoring, coupled with an understanding of your system's typical performance benchmarks, will guide you in managing the quality of your data.
How to implement data quality monitoring
Implementing data quality monitoring involves several crucial steps, whether you're addressing existing issues or planning for a new data quality approach.
Addressing data quality issues in your data systems
Perform root-cause analysis of the data with issues
Identify the stage(s) in your data system where data quality issues have consistently plagued you. Essentially, get an idea of these issues—it does not have to be elaborate.
Profile the problematic dataset
Familiarize yourself with the data, including the schema and other statistical information on the data. How you do this depends on the complexity of the dataset you need to profile.
After that, explore all the dependencies of the complex dataset:
- What workflows are your datasets dependent on?
- Understand how your most critical data assets flow through your system, including the pipelines that ingest them.
- Where is the data transformed?
- What are the downstream systems that consume them?
Make considerations on what to monitor
Based on the issues identified, determine what aspects of your data need the most attention in terms of monitoring.
Choose a continuous monitoring approach
Choose an approach that can, at the bare minimum, help you track those issues—these are the must-haves. Depending on the organization’s processes and existing tool stack, that could be an open-source or enterprise solution. The most important thing is that the tool checks the boxes to monitor issues that consistently affect your system.
We’ll dive deeper into your continuous monitoring approach in the section below.
Implement the continuous monitoring solution
Deploying the solution you end up using will involve configuring and fine-tuning the monitoring software for the business use case for the considerations you made earlier on what you want to monitor.
Test the continuous monitoring solution
Test the entire solution to make sure it works as expected. For example, you are getting accurate alerting (not false positives and negatives), and the solution is scalable, generalizable, and easy to use by others.
Managing reports
Determine the format and recipients of the reports generated by your monitoring solution, and establish protocols for action based on these reports.
Planning for a New Problem
Understand business, data, and technical requirements
What requirements must be met to consider your data suitable for the business problem in question? Understanding the business requirements for your use case would help you develop the data and technical requirements needed to monitor the quality of your data successfully.
Define key metrics
You also want to define the key metrics relevant to the specific business use case. For example, a key metric may be for a data pipeline to output an average customer order daily at 8 p.m. to the data presentation layer (such as a Tableau dashboard).
The goal here is to align the business requirements with the technical requirements for the data system. What SLAs, SLOs, and SLIs would be required to run the system successfully? The output from this step should give you the insights necessary to set metrics for the five pillars of data observability we discussed earlier.
Relate metrics to data assets and entities
Once you identify the key metrics, you can relate them to the datasets, data assets, and data dependencies that would help achieve such metrics. This way, you identify the most crucial data flows you want to monitor and manage.
This goal is to help you understand how your data flows end-to-end. From here, you can also use the key metrics to write data validation tests based on the issues you anticipate—your "known unknowns"—and choose what to monitor.
Choosing a data quality monitoring solution
The goal of any optimal data quality monitoring solution should be to help you, as an Engineer, find and fix data quality issues quickly. In addition, such a solution should also help you overcome the data quality monitoring challenges we discussed earlier.
In the second article in this series, you will learn how to choose an ideal data quality monitoring solution for your use case.
Conclusion
In summary, data quality monitoring answers the question of trust and reliability with your data: How reliable is the data your pipeline is ingesting across your entire data system? As an engineer, you want to understand the quality of the product (in this case, data) you are working on to ensure the systems you are building are reliable and won’t fail and cause business harm.
A lack of control or visibility into data quality can lead to inaccurate insights and poor decisions, resulting in lost income or a poor customer experience.
In the second article in this series, we will look at:
- What an ideal data quality monitoring solution is.
- Several open-source and SaaS data quality monitoring tools.
- Making a decision: Should you build or buy a data quality monitoring tool?
Other links
- Try WhyLabs’s Free Self-Service Data Monitoring Platform — https://whylabs.ai/whylabs-free-sign-up
- Join the Slack Community to discuss ideas and share feedback on data logging and observability.
References and Resources
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI