Re-imagine Data Monitoring with whylogs and Apache Spark
- Open Source
- Data Quality
Nov 23, 2022
In the era of microservices, decentralized ML architectures, and complex data pipelines, data quality has become a more significant challenge than ever. When data is involved in complex business processes and decisions, bad data can and will affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is costly and cumbersome, while data monitoring is often fragmented and performed ad hoc.
whylogs is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform-agnostic approach to data quality and monitoring, working with different modes of data operations, including streaming, batch, and IoT data.
This blog post provides an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and how users can apply this integration into existing data and ML pipelines.
Complex ML model pipelines
When looking at the bigger picture of the ML lifecycle from Google Cloud AI, it contains many components - a slight evolution from where we came from in the DevOps world. The anatomy of an ML model pipeline has become much more complex due to data pipeline, infrastructure, etc. - creating a gap in the solution for deploying models effectively or what we like to call “the data tooling gap.”
There are solutions available to address these gaps, including Feature Stores, model serving, and model registries. However, when it comes to monitoring, we still apply a DevOps monitoring framework on machine learning pipelines, but monitoring individual metric points often fails to provide adequate insights into machine learning data due to the big nature, and high dimensional aspects of them. Also, complex data types, such as images, audio, and text, complicate things.
Data issues are overlooked in production
In working with various data science teams, we’ve seen a common trend in issues encountered in ML pipelines in production. Most of the issues faced in production don’t actually originate from the production stage, they start long before the model is deployed during pre-production. Data quality issues can occur at any step before production, and not unlike code in DevOps, bad data can lead to broken systems.
However, we often don’t view data quality control with the same weight as we view code review. This is something in the field of machine learning and data engineering that we need to change. To shift our mindset about ML pipelines, we need to enable data observability, which requires a different paradigm for tracking these telemetry signals. At WhyLabs, we have been building the open standard for data logging that addresses these use cases.
Monitoring starts with data logging
So, how do we address this gap without reinventing the wheel? There have been many lessons learned from DevOps that we can take into the data world. Just like how programmers start to develop programs with if statements, we can also do something similar with data logging if it’s lightweight, regardless of where your model runs. Now models run in very complex infrastructure, from a Spark ML pipeline to things like your smartwatch or LTE devices. It’s critical to have a framework that can go across these deployment infrastructures while having a common language underneath the two with processes and tooling around it.
Data profiling vs. sampling
A common approach we see in production is sampling - where a sample of your production dataset is taken, and analysis runs on top of it. But when it comes to deployed data, it’s big and messy and therefore unworthy for many companies to run analysis or very expensive to run analysis on top of the full data set.
Data profiling is a powerful technique for monitoring data and is applied as a data logging solution. If done right, you can achieve both scalability and lightweightness, and detect outlier events independent of the distribution of data.
"Sampling is basically rolling the dice to get the right sampling strategy with your data, you may get a representative distribution, but otherwise, the sample just turns into noise"
When you sample data, you’re storing individual data points, and that can come with privacy and security concerns if you don’t have adequate protection for that purpose. In contrast, profiling works in an aggregate manner. You’d look at the overall statistics of the whole collection of data points rather than individual data points. So It also comes with the benefit of not actually storing your users' or models' data, making it friendlier to privacy, security, and compliance aspects of the modern ML business requirements.
Logging ML data at scale with whylogs
When talking about terabytes of data - let’s talk about how we think about this problem at scale.
There are four key paradigms we want to call out, and they overlap but touch on different angles of this problem:
- Approximations rather than exact results
We basically model this problem as a stochastic process. We use Apache Datasketches, which is the open source implementation for this. Using these statistics, we focused on histogram, frequent items and cardinality. They provide high fidelity in terms of the nature of your data.
In the old traditional approach, when you want to process metrics, you have to store it in the data lake or data warehouse, whether you do it in a sampling way or you do it in the data pipeline ELT or ETL way, you have to store it somewhere, and then you run a query on top of that.
Because you’re running it in a very distributed manner, whether in a Spark job or whether it’s in a distributed real-time inference model, you can add these results directly without having to do expensive IO operations.
- Batch and streaming support
You can log your data regardless of whether it’s a batch of data set, streaming data set, or even when you deploy it to IoT infrastructure. Combined with a generic processing and distributed processing engine like Spark, you can imagine how this approach can easily scale to terabytes of data.
whylogs implements these requirements as an open source data logging framework, and has Apache Spark as a first class integration. Since whylogs requires only a single pass of data, the integration is highly efficient: no shuffling is required to build whylogs profiles with Spark. Users can start in Java, Python, Scala or Kotlin and even R when using Spark in these languages.
Apache Spark integration
When you scan your data, you have various partitions in a distributed manner in a Spark cluster - whylogs implements a custom aggregator on top of this that hooks into your data frame - looking at your schema and various implementations such as group by key. Then we build a collection of metadata. Within that metadata, we also build sketches and other metrics. So for each partition, you get a single profile. In the end reduced because of custom aggregator - merged into a global profile without causing data to shuffle.
In the case of Spark, it's highly efficient - the final result contains information around your data such as metadata schema, and you can even break it down into group by key so that you can slice and dice your data as well. Then the data can be stored either in whylogs in direct binary form or in apache because the output is a Spark dataset/data frame object.
Simple Spark API
Taking advantage of Spark power in building the API, with a couple lines of code you can import whylogs utilities, enabling you to extend the data set API to run various meta data and aggregation to output a whylogs profiles. Then you can choose to store it in AWS S3 or other storage.
Then you can run it in Python:
In some cases, users want to use pySpark directly without having to do it in scala, so we built a pySpark bridge enabling you to run the analysis directly in the Jupyter notebook if you have a backing spot cluster or you can run it with a local spot instance using pySpark set up.
And finally, because of the streaming nature of whylogs, it can also run as an accumulator mode if you don’t want to double-scan the data you have. Imagine before storing the data in AWS S3, you can put in the whylogs accumulator before writing AWS S3 to parquet, and whylogs will then collect the data there, and then the API will extract the data out of that. Once the data is collected and stored, you can apply various analyses with Jupyter notebooks.
Catch distribution in a few lines of code
For example, here is how a user might be able to catch distribution drift by looking at the quantiles distribution over time by passing the data and visualizing it in whylogs.
You can see the example notebook here.
Scalable monitoring at input feature granularity
Because whylogs knows about the data schema in Spark, Python, or Java, we perform type detection. Because each feature can be lightweight, we can store all features across a data set so you don’t have to configure it manually like in traditional DevOps monitoring - for example what metrics to collect. With whylogs’ lightweight approach, it collects all the metrics you want, and then you can store them and run more advanced analysis in a Jupyter notebook or libraries for data drift detection.
Monitoring layer for ML applications
To achieve a true monitoring layer for ML applications, profiling needs to be deployed everywhere. You’ll need a robust system to store and run analysis on that. And finally, you should be able to use the signal as a feedback loop for your pipeline.
In this high-level architecture diagram, you can see what the monitoring layer of an ML application looks like. Starting with your ML telemetry data collection, you then collect whylogs profiles wherever you run your ML or data pipeline. So when you deploy your model with SageMaker or MLFlow, you can also deploy it in batch inference with SageMaker and Spark or when you build the model during offline training and testing, or in the Jupyter notebook. Then you can run more advanced deployment modes such as IoT or sensor data where the memory is much more limited. So you can imagine storing this whylogs profile on the SD card and shipping them back once you have an internet connection, which is a common use case we’ve seen when deploying whylogs to more performance and network-sensitive devices.
Then once collected, you would want to store them in a storage system, and then by tagging and partitioning them in the right way, you can run monitoring and alerting, and visualization layers on top of this.
With alerting and monitoring, you can then set up a trigger when data in production looks different from the data in training. Or hook it into a Slack webhook so that your team gets alerted about abnormal data behavior in prod to be diagnosed further.
At WhyLabs, our vision is to make it seamless to drop whylogs anywhere in your pipeline, enabling practitioners to have complete visibility over data quality. With the Apache Spark and whylogs integration, users can achieve large scale data profiling, and easily apply this integration into existing data and ML pipelines.
If you’re interested in trying whylogs or getting involved with our community of AI builders, here are some steps you can take:
- Check out the whylogs GitHub repository (don’t forget to give us a ⭐)
- Try out the Example Notebooks
- Join the Robust & Responsible AI Community Slack workspace
Preparing for the EU AI Act: Insights, Impact, and What It Means for You
Feb 28, 2024
- AI Observability
- Generative AI
- LLM Security
Step-by-Step Guide to Selecting a Data Quality Monitoring Solution in 2024
Feb 16, 2024
- ML Monitoring