blog bg left
Back to Blog

Data Logging With whylogs

tl;dr: Users can detect data drift, prevent ML model performance degradation, validate the quality of their data, and more in a single, lightning-fast, easy-to-use package. The v1 release brings a simpler API, new data constraints, new profile visualizations, faster performance, and a usability refresh.

What's data logging?

Logs are an essential part of monitoring and observability for classic software applications. They enable users to track changes in their application over time by capturing key metrics such as uptime, bounce rate, and load time. But what about monitoring and observability for data pipelines and machine learning applications? We built whylogs as the open standard for data logging because the metrics tracked by classic telemetry libraries simply do not cover the breadth of needs for these new types of applications.

Andrew Ng says: "...[whylogs] makes it easy for developers to maintain real time logs and monitor ML deployments"

What's whylogs?

With whylogs, users can generate statistical summaries (termed whylogs profiles) from data as it flows through their data pipelines and into their machine learning models. With these statistical summaries, users can track changes in their data over time, picking up on data drift or data quality problems. whylogs profiles can be generated on both tabular and complex data. Plus whylogs runs natively in both Python and JVM environments, and supports both batch processing (e.g. Apache Spark) and streaming (e.g. Apache Kafka).

whylogs offers a host of integrations

Getting whylogs installed in Python is as easy as pip install whylogs . Then, to log your data and generate a profile, simply run results = why.log(pandas_df) in your Python environment.

To see how easy it is and try it out for yourself right away, check out the Getting Started notebook; you can even skip setting it up in your own environment by running the example in Google Colab! To read more about whylogs, you can check out our GitHub readme.

What can I use whylogs for?

After generating whylogs profiles, users are able to:

  1. Track changes in their dataset
  2. Generate data constraints to know whether their data looks the way it should
  3. Quickly visualize key summary statistics about their datasets

These three functionalities enable a variety of use cases for data scientists, data engineers, and machine learning engineers:

  • Detect data drift in model input features
  • Detect training-serving skew, concept drift, and model performance degradation
  • Validate data quality in model inputs or in a data pipeline
  • Perform exploratory data analysis of massive datasets
  • Track data distributions & data quality for ML experiments
  • Enable data auditing and governance across the organization
  • Standardize data documentation practices across the organization
  • And more
Yahoo! Japan and Stitch Fix are two prominent whylogs users

What's new with v1?

With the launch of whylogs v1, we are proud to announce general availability of a host of new features and functionalities. We focused our efforts on improving both the usefulness and the usability of the whylogs library and our work has primarily been focused on five key development areas based on top use cases and feedback from users:

  1. API simplification
  2. Profile constraints
  3. Profile visualizer
  4. Performance improvements
  5. Usability refresh

API Simplification

Generating whylogs profiles has never been easier. It now takes a single line of code: results = why.log(pandas_df).

In the original whylogs v0 API, if you wanted to log a dataframe, you would need to start by initializing a session. Within that session, you would need to create a logger, and then, finally, within that logger call a log_dataframe() function. We heard from our users that these concepts were often confusing and slowed them down. The new simplified API enables users to  easily create whylogs profiles as artifacts to represent their datasets.

The ability to log data easily helps data scientists and machine learning engineers ensure data quality, with more ML models reliably deployed in production, and more best practices followed.

Profile Constraints

A constraint report

Teams that use whylogs save hours of extra work by preventing data bugs before they have an opportunity to propagate throughout the entire data pipeline. With profile constraints, users can define tests for their data and get alerted if data doesn’t look the way they expect it to. This enables data unit testing and data quality validation.

Constraints can be set on features such as the values in the dataset (e.g. I want to get notified if a value in my “credit score” column is less than 300 or more than 850), the data type (e.g. I want to check whether my “name” column has integers in it), the cardinality (e.g. I want to ensure that my “user_id” column has no duplicate values), and more.

Better yet, setting up constraints is even easier with the `generate_constraints()` function. By calling this function on a whylogs profile, users can automatically generate a suite of constraints based on the data in that profile, and then check every future profile against that suite of constraints.

To get started with constraints, check out this example notebook. Also, stay tuned for an upcoming blog post dedicated to using profile constraints.

Profile Visualizer

An screenshot of the drift report visualization

In addition to getting automatically notified about potential issues in data, it’s also useful to be able to inspect your data manually. With the profile visualizer, users can generate interactive reports about their profiles (either a single profile or comparing profiles against each other) directly in a Jupyter notebook environment. This enables exploratory data analysis, data drift detection, and data observability.

The profile visualizer lets a user create visualizations such as:

  • a summary drift report - an interactive feature-by-feature breakdown of two profiles with matching schemas
  • a double histogram - overlaying the distributions of a single feature from two profiles
  • feature statistics - a detailed summary of a single feature from a single profile.

To learn more about the profile visualizer, check out this example notebook and stay tuned for an upcoming blog post dedicated to the profile visualizer.

Performance Improvements

whylogs v1 is built for scale and optimized for massive data sets. With whylogs v1 users can profile massive amounts of data faster than ever before. In our tests, we saw a more than 500x improvement in the speed of whylogs generating profiles for large datasets. Wow!

Benchmarks show the performance improvements of whylogs v1

One of the major causes for performance improvements with whylogs v1 is a change from row-level operations to columnar operations. Columnar operations allow us to take advantage of vectorization built into the NumPy and pandas packages, which significantly speeds up the process of generating profiles by pushing summarization from slow Python code to lightning-fast C code. Importantly, we can take advantage of vectorization without changes to the end user experience.

By utilizing vectorization and other performance improving fixes on the backend, we are able to log one million rows of data per second. This number grows sub-linearly, so profiling larger datasets takes less time per row than profiling smaller ones, while smaller datasets are still easily profiled in under a second. With these performance improvements, everybody, from startups monitoring just one or two models to Fortune 500 enterprises logging billions of records, can all benefit from whylogs.

Usability Refresh

Not all whylogs v1 improvements are changes to the codebase. In addition to making the library more useful (by adding new features) and more usable (by simplifying the API and speeding it up), we ensured that it’s easy for users to get started with the library. To do that, we updated our documentation and examples.

New documentation!

Documentation is at the core of the user experience of an open source library. That’s why we focused on  readable, comprehensive, and informative documentation for the entire whylogs project. We retooled our documentation generation to use Sphinx and set up our deployment pipeline so that new merges to whylogs don’t get approved if they don’t have associated documentation.

New examples!

In addition to reading the docs, we also love seeing examples of how to use our favorite libraries. That’s why we refreshed our example library with examples that both demonstrate particular functionalities of the product, and show how to use whylogs end-to-end for specific use cases. The examples are organized so that it’s easy for users to find and peruse the examples that are most relevant to them.

What does the community say about whylogs v1?

whylogs is a community-driven initiative that’s supported by data professionals, developers, and scientists across the industry. As part of our v1 release, we gathered feedback from our community members and supporters, including from users, partners, and thought leaders in the data and ML space. Here’s what some of them had to say:

Quotes from the community about whylogs

Conclusion

The whylogs project is the open standard for data logging, enabling applications spanning from data quality validation to ML model monitoring. whylogs enables everybody who relies on data to run their data and ML applications robustly and responsibly. With the release of whylogs v1, new features and functionality enable our users to get more value with the library than ever before.

If you are interested in trying out whylogs or getting involved with our community of AI builders, here are some steps you can take:

Other posts

Visually Inspecting Data Profiles for Data Distribution Shifts

This short tutorial shows how to inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.

Choosing the Right Data Quality Monitoring Solution

In the second article in this series, we break down what to look for in a data quality monitoring solution, open source and Saas tools available, and how to decide on the best one for your organization.

A Comprehensive Overview Of Data Quality Monitoring

In the first article in this series, we provide a detailed overview of why data quality monitoring is crucial for building successful data and machine learning systems and how to approach it.

WhyLabs Now Available in AWS Marketplace

AWS customers worldwide can now quickly deploy the WhyLabs AI Observatory to monitor, understand, and debug their machine learning models deployed in AWS.

Deploying and Monitoring Made Easy with TeachableHub and WhyLabs

Deploying a model into production and maintaining its performance can be harrowing for many Data Scientists, especially without specialized expertise and equipment. Fortunately, TeachableHub and WhyLabs make it easy to get models out of the sandbox and into a production-ready environment.

How Observability Uncovers the Effects of ML Technical Debt

Many teams test their machine learning models offline but conduct little to no online evaluation after initial deployment. These teams are flying blind—running production systems with no insight into their ongoing performance.

Deploy your ML model with UbiOps and monitor it with WhyLabs

Machine learning models can only provide value for a business when they are brought out of the sandbox and into the real world... Fortunately, UbiOps and WhyLabs have partnered together to make deploying and monitoring machine learning models easy.

AI Observability for All

We’re excited to announce our new Starter edition: a free tier of our model monitoring solution that allows users to access all of the features of the WhyLabs AI observability platform. It is entirely self-service, meaning that users can sign up for an account and get started right away.

WhyLabs Achieves SOC 2 Type 2 Certification!

We are very happy to announce that we successfully completed our SOC 2 Type 2 examination with zero exceptions. WhyLabs is committed to ensuring our current, and future customers are well informed about the robust capabilities and security of the WhyLabs AI Observatory platform.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Get started for free
loading...