Data Logging with whylogs
- Data Logging
- open source
- ML Monitoring
May 31, 2022
tl;dr: Users can detect data drift, prevent ML model performance degradation, validate the quality of their data, and more in a single, lightning-fast, easy-to-use package. The v1 release brings a simpler API, new data constraints, new profile visualizations, faster performance, and a usability refresh.
What's data logging?
Logs are an essential part of monitoring and observability for classic software applications. They enable users to track changes in their application over time by capturing key metrics such as uptime, bounce rate, and load time. But what about monitoring and observability for data pipelines and machine learning applications? We built whylogs as the open standard for data logging because the metrics tracked by classic telemetry libraries simply do not cover the breadth of needs for these new types of applications.
With whylogs, users can generate statistical summaries (termed whylogs profiles) from data as it flows through their data pipelines and into their machine learning models. With these statistical summaries, users can track changes in their data over time, picking up on data drift or data quality problems. whylogs profiles can be generated on both tabular and complex data. Plus whylogs runs natively in both Python and JVM environments, and supports both batch processing (e.g. Apache Spark) and streaming (e.g. Apache Kafka).
Getting whylogs installed in Python is as easy as
pip install whylogs . Then, to log your data and generate a profile, simply run
results = why.log(pandas_df) in your Python environment.
To see how easy it is and try it out for yourself right away, check out the Getting Started notebook; you can even skip setting it up in your own environment by running the example in Google Colab! To read more about whylogs, you can check out our GitHub readme.
What can I use whylogs for?
After generating whylogs profiles, users are able to:
- Track changes in their dataset
- Generate data constraints to know whether their data looks the way it should
- Quickly visualize key summary statistics about their datasets
These three functionalities enable a variety of use cases for data scientists, data engineers, and machine learning engineers:
- Detect data drift in model input features
- Detect training-serving skew, concept drift, and model performance degradation
- Validate data quality in model inputs or in a data pipeline
- Perform exploratory data analysis of massive datasets
- Track data distributions & data quality for ML experiments
- Enable data auditing and governance across the organization
- Standardize data documentation practices across the organization
- And more
What's new with v1?
With the launch of whylogs v1, we are proud to announce general availability of a host of new features and functionalities. We focused our efforts on improving both the usefulness and the usability of the whylogs library and our work has primarily been focused on five key development areas based on top use cases and feedback from users:
- API simplification
- Profile constraints
- Profile visualizer
- Performance improvements
- Usability refresh
Generating whylogs profiles has never been easier. It now takes a single line of code:
results = why.log(pandas_df).
In the original whylogs v0 API, if you wanted to log a dataframe, you would need to start by initializing a session. Within that session, you would need to create a logger, and then, finally, within that logger call a
log_dataframe() function. We heard from our users that these concepts were often confusing and slowed them down. The new simplified API enables users to easily create whylogs profiles as artifacts to represent their datasets.
The ability to log data easily helps data scientists and machine learning engineers ensure data quality, with more ML models reliably deployed in production, and more best practices followed.
Teams that use whylogs save hours of extra work by preventing data bugs before they have an opportunity to propagate throughout the entire data pipeline. With profile constraints, users can define tests for their data and get alerted if data doesn’t look the way they expect it to. This enables data unit testing and data quality validation.
Constraints can be set on features such as the values in the dataset (e.g. I want to get notified if a value in my “credit score” column is less than 300 or more than 850), the data type (e.g. I want to check whether my “name” column has integers in it), the cardinality (e.g. I want to ensure that my “user_id” column has no duplicate values), and more.
Better yet, setting up constraints is even easier with the `generate_constraints()` function. By calling this function on a whylogs profile, users can automatically generate a suite of constraints based on the data in that profile, and then check every future profile against that suite of constraints.
To get started with constraints, check out this example notebook. Also, stay tuned for an upcoming blog post dedicated to using profile constraints.
In addition to getting automatically notified about potential issues in data, it’s also useful to be able to inspect your data manually. With the profile visualizer, users can generate interactive reports about their profiles (either a single profile or comparing profiles against each other) directly in a Jupyter notebook environment. This enables exploratory data analysis, data drift detection, and data observability.
The profile visualizer lets a user create visualizations such as:
- a summary drift report - an interactive feature-by-feature breakdown of two profiles with matching schemas
- a double histogram - overlaying the distributions of a single feature from two profiles
- feature statistics - a detailed summary of a single feature from a single profile.
To learn more about the profile visualizer, check out this example notebook and stay tuned for an upcoming blog post dedicated to the profile visualizer.
whylogs v1 is built for scale and optimized for massive data sets. With whylogs v1 users can profile massive amounts of data faster than ever before. In our tests, we saw a more than 500x improvement in the speed of whylogs generating profiles for large datasets. Wow!
One of the major causes for performance improvements with whylogs v1 is a change from row-level operations to columnar operations. Columnar operations allow us to take advantage of vectorization built into the NumPy and pandas packages, which significantly speeds up the process of generating profiles by pushing summarization from slow Python code to lightning-fast C code. Importantly, we can take advantage of vectorization without changes to the end user experience.
By utilizing vectorization and other performance improving fixes on the backend, we are able to log one million rows of data per second. This number grows sub-linearly, so profiling larger datasets takes less time per row than profiling smaller ones, while smaller datasets are still easily profiled in under a second. With these performance improvements, everybody, from startups monitoring just one or two models to Fortune 500 enterprises logging billions of records, can all benefit from whylogs.
Not all whylogs v1 improvements are changes to the codebase. In addition to making the library more useful (by adding new features) and more usable (by simplifying the API and speeding it up), we ensured that it’s easy for users to get started with the library. To do that, we updated our documentation and examples.
Documentation is at the core of the user experience of an open source library. That’s why we focused on readable, comprehensive, and informative documentation for the entire whylogs project. We retooled our documentation generation to use Sphinx and set up our deployment pipeline so that new merges to whylogs don’t get approved if they don’t have associated documentation.
In addition to reading the docs, we also love seeing examples of how to use our favorite libraries. That’s why we refreshed our example library with examples that both demonstrate particular functionalities of the product, and show how to use whylogs end-to-end for specific use cases. The examples are organized so that it’s easy for users to find and peruse the examples that are most relevant to them.
What does the community say about whylogs v1?
whylogs is a community-driven initiative that’s supported by data professionals, developers, and scientists across the industry. As part of our v1 release, we gathered feedback from our community members and supporters, including from users, partners, and thought leaders in the data and ML space. Here’s what some of them had to say:
The whylogs project is the open standard for data logging, enabling applications spanning from data quality validation to ML model monitoring. whylogs enables everybody who relies on data to run their data and ML applications robustly and responsibly. With the release of whylogs v1, new features and functionality enable our users to get more value with the library than ever before.
If you are interested in trying out whylogs or getting involved with our community of AI builders, here are some steps you can take:
Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs
Aug 17, 2023
- Machine Learning
Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs
Sep 11, 2023
- ML Monitoring
Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring
Aug 10, 2023
- ML Monitoring
WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups
Aug 8, 2023
Hugging Face and LangKit: Your Solution for LLM Observability
Jul 26, 2023
Safeguarding and Monitoring Large Language Model (LLM) Applications
Jul 11, 2023