WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

WhyLabs Team

Feb 2, 2024

Back to Blog

A Comprehensive Overview Of Data Quality Monitoring

ML Monitoring
Data Quality

WhyLabs Team

Feb 2, 2024

As data becomes increasingly vital in the digital age, the role of data quality monitoring in building effective data and machine learning systems has grown immensely. Data quality monitoring involves the ongoing assessment and validation of data to ensure accuracy, consistency, and reliability. Forrester's Online Global Survey on Data Quality and Trust reveals that 42% of data teams spend over 40% of their time validating data, highlighting its critical role. Moreover, Gartner research suggests that poor data quality results in an average annual loss of $15 million for organizations.

But how do you ensure the reliability of your data pipelines? Are you effectively checking and validating the right data? And when it comes to machine learning, the integrity and quality of the data are not just beneficial but essential for the success of your models. Neglecting continuous data quality monitoring can lead to significant business damages, such as poor application performance, compliance issues, customer churn, and revenue loss.

This article, the first in a series, offers a comprehensive overview of data quality monitoring. It will guide you through the strategies for continuous quality assessment, ensuring your data remains an asset rather than a liability.

To implement data quality monitoring in your own applications - sign up for a free WhyLabs starter account or request a demo!

What is data quality?

Data quality refers to the degree to which data meets the specific needs and expectations of its usage in various stages of the data lifecycle, including collection, processing, storage, and analysis. In this guide, we understand data quality to mean the health of your data as it flows through the entire data system.

From a technical standpoint, high-quality data should be valuable to consumers - whether they are services, applications, or users - and beneficial to the business's objectives. It involves ensuring that the data ingested into your system is accurate, consistent, and relevant to the task at hand.

But what exactly makes data 'good' or 'healthy'? Let’s understand the dimensions that define good data quality.

Dimensions of data quality

High-quality data is characterized by six key attributes. You will learn these attributes and their corresponding applications in a use case. SolarTech Innovations, a fictional company specializing in solar energy solutions, uses a data-driven approach to optimize its operations and customer service.

The company relies on various data types, including customer data, operational data, and market analysis. Let’s learn how the data quality dimensions apply to 'SolarTech Innovations' data.

Dimension	Definition	Use Case (SolarTech)
Accuracy	The extent to which data correctly represents the real-world entities or events it is supposed to depict.	Accurate measurements of solar panel performance data ensure correct billing and service quality.
Consistency	Data should be consistent across different systems and over time.	Consistency in customer data across CRM and service databases is crucial for effective customer management.
Validity (or Relevance)	The data should be pertinent and suitable for the business purpose it is intended for.	Market analysis data must be relevant to the current trends in renewable energy to inform strategic decisions.
Integrity	This involves the completeness and wholeness of the data, ensuring no critical parts are missing or misrepresented.	Complete and comprehensive data on supply chain logistics is necessary for efficient operations.
Timeliness	The data should be up-to-date and available when needed to reflect current information.	Timely data on energy production and usage patterns is vital for real-time monitoring and response.
Uniqueness	Each data element should be unique without duplication, ensuring clarity and reducing redundancy.	Unique customer records are essential to avoid duplications that could lead to service delivery or marketing errors.

What is data quality monitoring?

Monitoring data quality is about measuring, analyzing, and improving data quality to fit the business purposes and meet business expectations. With the explosion of real-time machine learning and business intelligence, the only approach to successfully validate dynamic data is to monitor its quality continually and evaluate it using a set of relevant quality metrics.

For example, data quality monitoring in a real-time analytics system might involve real-time checks for accuracy and consistency, ensuring that incoming data streams are correct, up-to-date, and synchronized across various platforms.

The necessity of monitoring the quality of your data

To understand why you need to monitor the quality of your data, you need to know where data quality issues could stem from in the entire data lifecycle and the types of issues you’ll likely find at each stage. Poor data quality has led to widespread financial collapse, inaccurate analyses, misguided business decisions, and ultimately negative impacts on reputation.

Where data quality issues could stem from

There are three significant areas where issues degrade data quality:

Directly from the data source(s).
Inside the data systems or pipelines (during transformation and manipulation).
When downstream systems consume them.

AN ETL SCENARIO FOR EXTRACTING DATA ASSETS FROM SOURCES, TRANSFORMING THEM THROUGH A DATA PIPELINE, AND LOADING THEM TO THE DATA WAREHOUSE FOR CONSUMPTION BY DOWNSTREAM SYSTEMS. | IMAGE SOURCE: AUTHOR.

Data quality issues directly from the data source(s)

Okay, so you have probably encountered this scenario: It’s Friday, and you are so excited about the weekend. Everything seemed to have worked well during the week, and you are about to log off from slack. Then, suddenly, your data analyst hastily calls you to inform you that the report dashboard for a particular segment of customers was broken and has led to wrong conclusions for a week now.

You start panicking because, of course, those tests you wrote should have caught these issues in the pipeline, right? If you are lucky enough, it took you anywhere from a few hours to days of thoroughly searching for the root cause of the problem. Aha! Alas, find out that mobile application developers made changes to the schema of the Firebase table that collects data from the app.

Since you were not informed and did not write validation tests to cover such edge cases, the bad data made its way through to your reporting layer. While there are several issues that you cannot account for that stem directly from data sources, there are common issues including:

Duplicate and delayed events,
Stale or inaccurate data gets ingested into the data system,
An outlier in a crucial field that goes undetected until it shows up in the data reporting layer or ingested by a machine learning training pipeline,
You may only notice data distribution drifts when a production ML application outputs strange results because the training data distribution differs significantly from the production data distribution.
Missing events (events not coming in because of broken integrations),
Wrong data type and format (wrong dollar amount; incorrect product code),
Incorrect data syntax and semantics.

An *ELT* scenario for extracting data assets from sources, but another team altered one of many sources. The data gets loaded into the warehouse, and if tests cannot catch it, it gets loaded to downstream sources for consumption. | Source: Author.

Data quality issues inside the data systems or pipelines

Good data pipelines mean good data hygiene and ultimately healthy data. But here’s the catch: Building good data pipelines is challenging, whether it’s an ETL, ELT, or even an rETL-based pipeline. It isn’t easy! There is a reason you might find bad or buggy transformations that lead to data quality issues.

For example, you may miswrite your transformation steps, causing your pipeline steps to execute in the wrong order. Or may write data validation tests in your pipeline in the wrong order.

The transformations within the data pipeline can cause data downtime, corrupt the data, and even cause problems for downstream consumers.

An ELT scenario for extracting data assets from sources, loading them into the data warehouse, and buggy transformations that do not cause pipeline execution failure, which ends up sending corrupt data back to the warehouse and downstream consumers. | Source: Author.

Data quality issues when downstream systems consume them

Data quality issues are least common at this stage, but they do occur. For example, a code modification in your ML pipeline could prevent an API from gathering data for a live or offline model. In addition, perhaps your BI analysis tool may no longer be receiving updates from your data source, producing stale reports due to some software upgrade or dependency changes.

These issues overall impact the quality of the dataset fed to downstream consumers, such as a machine learning training pipeline or analytics software. It would inevitably affect the results of such a system, potentially rendering the entire project futile.

Continuously monitoring—and going forward, observing—your data quality will guarantee that you can:

Detect most—if not all—data quality issues,
Troubleshoot these issues before they can cause silent or non-silent errors,
Continuously report on—and improve—the quality of data you are using to solve business problems.

It would not just increase the trust you have in your data; reliability would likely go through the roof too.

Now you have an idea of where these issues might occur, let’s focus on what types of data quality issues you could encounter.

Challenges with monitoring data quality

The core challenges with monitoring data quality are:

1. Diversity in business use cases

Each business use case, such as time-series analysis versus recommendation engines, has unique data requirements and standards. This diversity makes it challenging to define universal standards for data quality. Continuous engagement with stakeholders to understand and define appropriate metrics (data SLAs) for each case is crucial.

2. Challenges stemming from the data entity

Data scale

With the exponential growth in the volume of data comes the need to validate the dataset in real-time to ensure businesses can make the most of their data.

Monitoring data quality with increasing data volume can be challenging. The increased data volume can lead to increased data quality issues that must be monitored, triaged, and troubleshooted.

Another problem with the data scale is its ephemeral nature. Monitoring may also be challenging if you do not persist high-volume data in storage or compute memory.

Data cardinality

High data volume will inevitably lead to high data cardinality (the number of unique values in a column). This is not necessarily bad (because high cardinal data could also mean more valuable data), but this could cause your data structure to be inconsistent. So you often need to adapt your data quality monitoring techniques to account for structural changes.

Data fragmentation

Data assets (such as metadata) and dependencies are scattered across systems and organizational use cases, each containing a small amount of information, making it difficult to centralize quality monitoring and management efforts.

Data legalities

In most situations, data security is crucial, and it’s often vital that whatever solution or technique you use to monitor the quality of the data does not compromise its security.

Other legal issues revolve around monitoring data with personally identifiable information (PII) and ensuring the data values are protected while using the monitoring technique or solution.

3. Challenges due to data infrastructure

The complexity of the data architecture

You can link this challenge to the use case diversity challenge mentioned earlier. Most business use cases require data to be continuously streamed into data pipelines in real-time and not just at scheduled intervals (batch mode).

The high data throughput makes monitoring the quality of the data complex because users and systems have to be proactive in triaging and troubleshooting data quality issues in real-time.

Data pipeline workflow complexity

Workflows have grown increasingly complex, especially with the data scale, the explosion of use cases, and the advent of workflow orchestration tools. It is hard to monitor data pipelines with a dozen stages and many branches, and it is even more challenging when someone makes an unanticipated change to the pipeline.

Data source orchestration

When monitoring data quality, ingesting data from various sources adds a lot of overhead. The larger the number of sources, the more likely you would have to deal with similar and varying data quality issues.

Looking at these challenges, you will realize that monitoring data quality requires a thoughtful and continuous approach. Therefore, you need to consider these challenges to develop your continuous monitoring approach to data quality.

What should you monitor in your data systems?

Knowing what to monitor in your data system to ensure high-quality data is tricky because it is often problem-dependent and has no set standards. Like monitoring regular software systems, there are several custom and standard metrics. However, we can take a page from software monitoring and observability.

Software observability has three pillars:

Logs (record of events)
Metrics (quantitative measures of certain aspects)
Traces (record of the path taken through the software)

Adapting these to data systems, we arrive at five key pillars of data observability:

1. Volume

Monitors whether all expected data has arrived in the system. For example, check for significant deviations in data volume that might indicate missing or excessive data.

2. Freshness

Monitors the time the data arrived in the data system, its recency, and whether it is up-to-date with its dependencies.

3. Schema

Monitors the data structure and dimensions. It includes how they have changed, including the source of the change, who made the changes, and their impact.

4. Distribution

Evaluates whether the data distribution falls within an acceptable range and if transformations within the data pipeline are accurate. This could involve analyzing the statistical properties of the data.

5. Lineage:

Traces the flow of data from upstream sources to downstream consumers. You can also trace who interacts with your data and at what stages.

*FIVE KEY PILLARS OF DATA OBSERVABILITY ARE VOLUME, FRESHNESS, SCHEMA, DISTRIBUTION, AND LINEAGE. | SOURCE: AUTHOR*

In most solutions, a data profiler (such as whylogs) generates the metadata about the data from the upstream source (or data producer). The metadata provides the information required to observe the data system appropriately with the five data observability pillars listed above. It can be a good source for the data quality metric.

Re-imagine data monitoring with whylogs and Apache Spark in this comprehensive guide by Andy Dang, CTO and Co-Founder at WhyLabs.

Before we continue, you should know a subtle but crucial difference between data observability and monitoring. Monitoring tells you when something is wrong with the quality of your data. At the same time, observability enables you to understand why you encounter data quality issues and what to do about them.

Choosing what to monitor in your data system

Choosing what to monitor depends on the type of business problem you are solving, so you need to make some considerations before deciding what to monitor. These considerations are based on your data profile, problem, resources, and the operational impact of the data.

MONITORING A DATA SYSTEM INVOLVES TRACKING FUNCTIONAL METRICS LIKE DATA QUALITY METRICS, PIPELINE METRICS, AND OPERATIONAL METRICS (LIKE COMPUTE) WITH A SOLUTION. | SOURCE: AUTHOR.

Here are vital considerations, adapted from Frank Dravis's whitepaper on data quality strategy:

1. Frequency of data consumption

Assess how often downstream systems use the data (hourly, daily, weekly, etc.). This frequency guides the monitoring approach and highlights critical issues to prioritize.

2. Importance of the downstream system

ML models used for mission-critical, life-dependent, routine operations? Or is it a BI tool used for end-of-the-month reporting? Sometimes, ML Engineers might want to deploy ML models for healthcare applications, highly regulated applications, or those prone to biases.

Considering this, you will know the most important issues to monitor—especially when you talk to domain experts.

3. Cost of monitoring

After establishing the monitoring system and process, the direct costs are labor and system resources—ideally, the better the monitoring technology, the lower the labor costs. You may want to log many issues, but knowing the cost of monitoring, you can hone in on monitoring the most critical issues—including those you don’t know.

4. Operational impact

Can you estimate the cost of poor data quality provided into these systems in cases where the complex dataset feeds crucial downstream consumers? How impactful are data quality issues encountered concerning the department’s or organization’s targets?

There are two aspects to consider here:

Impact of assessing operational (production) data during live operations.
Impact of the process on all stakeholders involved.

This will inform you whether to manually audit the quality assessment process, partially or fully automate it.

Data quality indicators (DQI)

Data Quality Indicators (DQIs) serve as key metrics for assessing data quality within a system, akin to Key Performance Indicators (KPIs) in business analytics. To be effective, DQIs should meet the following criteria:

Well-defined and measurable: Each DQI should have a clear definition and a quantifiable measure. For example, a DQI for data accuracy might measure the percentage of records that pass a set of accuracy checks.
Relevant to the business use case: The DQI should directly apply to the specific business context. For instance, in a retail business, a DQI might focus on the accuracy and completeness of inventory data.
Align with business requirements and data SLAs: DQIs should reflect the business's needs and the agreed-upon data service level agreements (SLAs).

You can define your DQIs around the five pillars of data observability discussed earlier in this section.

Data quality monitoring solutions usually have built-in DQIs dependent on your data. Of course, you may need to make some maneuvers to monitor unstructured data, but most tools would allow you to use a query (like SQL) or programming language (like Python) to define your indicators.

Since business use cases differ, you may also need to define DQIs specific to business use cases to ensure your bases are covered.

Key metrics to assess data quality

You can use several key metrics to monitor and maintain high data quality. These metrics provide insights into various aspects of data quality and are instrumental in identifying areas for improvement:

Ratio of data to errors: This metric measures the total number of errors relative to the dataset size ingested into your pipeline. For instance, a high ratio may indicate systemic data collection or entry issues.
Number of empty values: Empty fields in a dataset can signal incomplete data. Tracking this metric helps in assessing the completeness of the data.
Data transformation error rate: This metric measures the frequency of errors occurring due to transformations in the data pipeline. A high error rate could indicate data processing rules or logic issues.
Amount of dark data: 'Dark data' refers to data collected but not used, often due to quality issues. By performing data discovery and profiling, you can quantify how much of your data falls into this category, indicating potential areas where data quality improvements could unlock value.

When selecting metrics, it’s crucial to consider the specific needs and context of your business and data systems. You can use tools such as data profiling software, and analytical queries to measure these metrics effectively. Regular monitoring, coupled with an understanding of your system's typical performance benchmarks, will guide you in managing the quality of your data.

In this post, you will learn why validating data quality is essential in the MLOps process and how to use the open-source whylogs library to monitor data quality in Python.

How to implement data quality monitoring

Implementing data quality monitoring involves several crucial steps, whether you're addressing existing issues or planning for a new data quality approach.

Addressing data quality issues in your data systems

Perform root-cause analysis of the data with issues

Identify the stage(s) in your data system where data quality issues have consistently plagued you. Essentially, get an idea of these issues—it does not have to be elaborate.

Profile the problematic dataset

Familiarize yourself with the data, including the schema and other statistical information on the data. How you do this depends on the complexity of the dataset you need to profile.

After that, explore all the dependencies of the complex dataset:

What workflows are your datasets dependent on?
Understand how your most critical data assets flow through your system, including the pipelines that ingest them.
Where is the data transformed?
What are the downstream systems that consume them?

Make considerations on what to monitor

Based on the issues identified, determine what aspects of your data need the most attention in terms of monitoring.

Choose a continuous monitoring approach

Choose an approach that can, at the bare minimum, help you track those issues—these are the must-haves. Depending on the organization’s processes and existing tool stack, that could be an open-source or enterprise solution. The most important thing is that the tool checks the boxes to monitor issues that consistently affect your system.

We’ll dive deeper into your continuous monitoring approach in the section below.

Implement the continuous monitoring solution

Deploying the solution you end up using will involve configuring and fine-tuning the monitoring software for the business use case for the considerations you made earlier on what you want to monitor.

Test the continuous monitoring solution

Test the entire solution to make sure it works as expected. For example, you are getting accurate alerting (not false positives and negatives), and the solution is scalable, generalizable, and easy to use by others.

Managing reports

Determine the format and recipients of the reports generated by your monitoring solution, and establish protocols for action based on these reports.

Planning for a New Problem

Understand business, data, and technical requirements

What requirements must be met to consider your data suitable for the business problem in question? Understanding the business requirements for your use case would help you develop the data and technical requirements needed to monitor the quality of your data successfully.

Define key metrics

You also want to define the key metrics relevant to the specific business use case. For example, a key metric may be for a data pipeline to output an average customer order daily at 8 p.m. to the data presentation layer (such as a Tableau dashboard).

The goal here is to align the business requirements with the technical requirements for the data system. What SLAs, SLOs, and SLIs would be required to run the system successfully? The output from this step should give you the insights necessary to set metrics for the five pillars of data observability we discussed earlier.

Relate metrics to data assets and entities

Once you identify the key metrics, you can relate them to the datasets, data assets, and data dependencies that would help achieve such metrics. This way, you identify the most crucial data flows you want to monitor and manage.

This goal is to help you understand how your data flows end-to-end. From here, you can also use the key metrics to write data validation tests based on the issues you anticipate—your "known unknowns"—and choose what to monitor.

Choosing a data quality monitoring solution

The goal of any optimal data quality monitoring solution should be to help you, as an Engineer, find and fix data quality issues quickly. In addition, such a solution should also help you overcome the data quality monitoring challenges we discussed earlier.

In the second article in this series, you will learn how to choose an ideal data quality monitoring solution for your use case.

Conclusion

In summary, data quality monitoring answers the question of trust and reliability with your data: How reliable is the data your pipeline is ingesting across your entire data system? As an engineer, you want to understand the quality of the product (in this case, data) you are working on to ensure the systems you are building are reliable and won’t fail and cause business harm.

A lack of control or visibility into data quality can lead to inaccurate insights and poor decisions, resulting in lost income or a poor customer experience.

In the second article in this series, we will look at:

What an ideal data quality monitoring solution is.
Several open-source and SaaS data quality monitoring tools.
Making a decision: Should you build or buy a data quality monitoring tool?

References and Resources

WhyLabs Team

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

A Comprehensive Overview Of Data Quality Monitoring

What is data quality?

Dimensions of data quality

What is data quality monitoring?

The necessity of monitoring the quality of your data

Where data quality issues could stem from

Data quality issues directly from the data source(s)

Data quality issues inside the data systems or pipelines

Data quality issues when downstream systems consume them

Challenges with monitoring data quality

1. Diversity in business use cases

2. Challenges stemming from the data entity

3. Challenges due to data infrastructure

What should you monitor in your data systems?

1. Volume

2. Freshness

3. Schema

4. Distribution

5. Lineage:

Choosing what to monitor in your data system

1. Frequency of data consumption

2. Importance of the downstream system

3. Cost of monitoring

4. Operational impact

Data quality indicators (DQI)

Key metrics to assess data quality

How to implement data quality monitoring

Addressing data quality issues in your data systems

Planning for a New Problem

Conclusion

Other links

References and Resources

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty