blog bg left
Back to Blog

Data Logging with whylogs

tl;dr: Users can detect data drift, prevent ML model performance degradation, validate the quality of their data, and more in a single, lightning-fast, easy-to-use package. The v1 release brings a simpler API, new data constraints, new profile visualizations, faster performance, and a usability refresh.

What's data logging?

Logs are an essential part of monitoring and observability for classic software applications. They enable users to track changes in their application over time by capturing key metrics such as uptime, bounce rate, and load time. But what about monitoring and observability for data pipelines and machine learning applications? We built whylogs as the open standard for data logging because the metrics tracked by classic telemetry libraries simply do not cover the breadth of needs for these new types of applications.

Andrew Ng says: "...[whylogs] makes it easy for developers to maintain real time logs and monitor ML deployments"

What's whylogs?

With whylogs, users can generate statistical summaries (termed whylogs profiles) from data as it flows through their data pipelines and into their machine learning models. With these statistical summaries, users can track changes in their data over time, picking up on data drift or data quality problems. whylogs profiles can be generated on both tabular and complex data. Plus whylogs runs natively in both Python and JVM environments, and supports both batch processing (e.g. Apache Spark) and streaming (e.g. Apache Kafka).

whylogs offers a host of integrations

Getting whylogs installed in Python is as easy as pip install whylogs . Then, to log your data and generate a profile, simply run results = why.log(pandas_df) in your Python environment.

To see how easy it is and try it out for yourself right away, check out the Getting Started notebook; you can even skip setting it up in your own environment by running the example in Google Colab! To read more about whylogs, you can check out our GitHub readme.

What can I use whylogs for?

After generating whylogs profiles, users are able to:

  1. Track changes in their dataset
  2. Generate data constraints to know whether their data looks the way it should
  3. Quickly visualize key summary statistics about their datasets

These three functionalities enable a variety of use cases for data scientists, data engineers, and machine learning engineers:

  • Detect data drift in model input features
  • Detect training-serving skew, concept drift, and model performance degradation
  • Validate data quality in model inputs or in a data pipeline
  • Perform exploratory data analysis of massive datasets
  • Track data distributions & data quality for ML experiments
  • Enable data auditing and governance across the organization
  • Standardize data documentation practices across the organization
  • And more
Yahoo! Japan and Stitch Fix are two prominent whylogs users

What's new with v1?

With the launch of whylogs v1, we are proud to announce general availability of a host of new features and functionalities. We focused our efforts on improving both the usefulness and the usability of the whylogs library and our work has primarily been focused on five key development areas based on top use cases and feedback from users:

  1. API simplification
  2. Profile constraints
  3. Profile visualizer
  4. Performance improvements
  5. Usability refresh

API Simplification

Generating whylogs profiles has never been easier. It now takes a single line of code: results = why.log(pandas_df).

In the original whylogs v0 API, if you wanted to log a dataframe, you would need to start by initializing a session. Within that session, you would need to create a logger, and then, finally, within that logger call a log_dataframe() function. We heard from our users that these concepts were often confusing and slowed them down. The new simplified API enables users to  easily create whylogs profiles as artifacts to represent their datasets.

The ability to log data easily helps data scientists and machine learning engineers ensure data quality, with more ML models reliably deployed in production, and more best practices followed.

Profile Constraints

A constraint report

Teams that use whylogs save hours of extra work by preventing data bugs before they have an opportunity to propagate throughout the entire data pipeline. With profile constraints, users can define tests for their data and get alerted if data doesn’t look the way they expect it to. This enables data unit testing and data quality validation.

Constraints can be set on features such as the values in the dataset (e.g. I want to get notified if a value in my “credit score” column is less than 300 or more than 850), the data type (e.g. I want to check whether my “name” column has integers in it), the cardinality (e.g. I want to ensure that my “user_id” column has no duplicate values), and more.

Better yet, setting up constraints is even easier with the `generate_constraints()` function. By calling this function on a whylogs profile, users can automatically generate a suite of constraints based on the data in that profile, and then check every future profile against that suite of constraints.

To get started with constraints, check out this example notebook. Also, stay tuned for an upcoming blog post dedicated to using profile constraints.

Profile Visualizer

An screenshot of the drift report visualization

In addition to getting automatically notified about potential issues in data, it’s also useful to be able to inspect your data manually. With the profile visualizer, users can generate interactive reports about their profiles (either a single profile or comparing profiles against each other) directly in a Jupyter notebook environment. This enables exploratory data analysis, data drift detection, and data observability.

The profile visualizer lets a user create visualizations such as:

  • a summary drift report - an interactive feature-by-feature breakdown of two profiles with matching schemas
  • a double histogram - overlaying the distributions of a single feature from two profiles
  • feature statistics - a detailed summary of a single feature from a single profile.

To learn more about the profile visualizer, check out this example notebook and stay tuned for an upcoming blog post dedicated to the profile visualizer.

Performance Improvements

whylogs v1 is built for scale and optimized for massive data sets. With whylogs v1 users can profile massive amounts of data faster than ever before. In our tests, we saw a more than 500x improvement in the speed of whylogs generating profiles for large datasets. Wow!

Benchmarks show the performance improvements of whylogs v1

One of the major causes for performance improvements with whylogs v1 is a change from row-level operations to columnar operations. Columnar operations allow us to take advantage of vectorization built into the NumPy and pandas packages, which significantly speeds up the process of generating profiles by pushing summarization from slow Python code to lightning-fast C code. Importantly, we can take advantage of vectorization without changes to the end user experience.

By utilizing vectorization and other performance improving fixes on the backend, we are able to log one million rows of data per second. This number grows sub-linearly, so profiling larger datasets takes less time per row than profiling smaller ones, while smaller datasets are still easily profiled in under a second. With these performance improvements, everybody, from startups monitoring just one or two models to Fortune 500 enterprises logging billions of records, can all benefit from whylogs.

Usability Refresh

Not all whylogs v1 improvements are changes to the codebase. In addition to making the library more useful (by adding new features) and more usable (by simplifying the API and speeding it up), we ensured that it’s easy for users to get started with the library. To do that, we updated our documentation and examples.

New documentation!

Documentation is at the core of the user experience of an open source library. That’s why we focused on  readable, comprehensive, and informative documentation for the entire whylogs project. We retooled our documentation generation to use Sphinx and set up our deployment pipeline so that new merges to whylogs don’t get approved if they don’t have associated documentation.

New examples!

In addition to reading the docs, we also love seeing examples of how to use our favorite libraries. That’s why we refreshed our example library with examples that both demonstrate particular functionalities of the product, and show how to use whylogs end-to-end for specific use cases. The examples are organized so that it’s easy for users to find and peruse the examples that are most relevant to them.

What does the community say about whylogs v1?

whylogs is a community-driven initiative that’s supported by data professionals, developers, and scientists across the industry. As part of our v1 release, we gathered feedback from our community members and supporters, including from users, partners, and thought leaders in the data and ML space. Here’s what some of them had to say:

Quotes from the community about whylogs

Conclusion

The whylogs project is the open standard for data logging, enabling applications spanning from data quality validation to ML model monitoring. whylogs enables everybody who relies on data to run their data and ML applications robustly and responsibly. With the release of whylogs v1, new features and functionality enable our users to get more value with the library than ever before.

If you are interested in trying out whylogs or getting involved with our community of AI builders, here are some steps you can take:

Other posts

Re-imagine Data Monitoring with whylogs and Apache Spark

An overview of how the whylogs integration with Apache Spark achieves large scale data profiling, and how users can apply this integration into existing data and ML pipelines.

ML Monitoring in Under 5 Minutes

A quick guide to using whylogs and WhyLabs to monitor common issues with your ML models to surface data drift, concept drift, data quality, and performance issues.

AIShield and WhyLabs: Threat Detection and Monitoring for AI

The seamless integration of AIShield’s security insights on WhyLabs AI observability platform delivers comprehensive insights into ML workloads and brings security hardening to AI-powered enterprises.

Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask

Profiling large-scale data for use cases such as anomaly detection, drift detection, and data validation with Fugue on Spark, Ray or Dask.

Monitoring Image Data with whylogs v1

When operating computer vision systems, data quality and data drift issues always pose the risk of model performance degradation. Whylabs provides a simple yet highly customizable solution for maintaining observability into data to detect issues and take action sooner.

WhyLabs Private Beta: Real-time, No-code, Cloud Storage Data Profiling

We’re excited to announce our Private Beta release for a no-code integration option for WhyLabs, allowing users to bypass the need to integrate whylogs into their data pipeline.

Data and ML Monitoring is Easier with whylogs v1.1

The release of whylogs v1.1 brings many features to the whylogs data logging API, making it even easier to monitor your data and ML models!

Model Monitoring for Financial Fraud Classification

Model monitoring is helping the financial services industry avoid huge losses caused by performance degradation in their fraud transaction models.

Robust & Responsible AI Newsletter - Issue #3

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...