Show your love for whylogs with a star!
The open standard for data logging
Get started in seconds:
pip install whylogs
Then, run this code:
import whylogs as why
results = why.log(pandas_df)
Show your love for whylogs with a star!
whylogs lets you:
Track data for ML experiments
Enable data auditing and governance
Detect data drift and resultant ML model performance degradation
Validate data quality
Perform exploratory data analysis
whylogs is an open source library for any kind of data logging.
With whylogs, you are able to generate summaries of your datasets, called whylogs profiles.
“ML engineers need better tools to ensure high-quality data through all stages of an ML project’s lifecycle… [whylogs] makes it easy for developers to maintain real time logs and monitor ML deployments.”
Andrew Ng
Founder and CEO of Landing AI; Founder of DeepLearning.AI
profiles are...
Efficient
whylogs profiles efficiently describe the dataset that they represent. This high fidelity representation of datasets is what enables whylogs profiles to be effective snapshots of the data. They are better at capturing the characteristics of a dataset than a sample would be—as discussed in our Data Logging: Sampling versus Profiling blog post—and are very compact.Customizable
The statistics that whylogs profiles collect are easily configurable and customizable. This is important because different data types and use cases require different metrics, and whylogs users need to be able to easily define custom trackers for those metrics. It’s the customizability of whylogs that enables our text, image, and other complex data trackers.Mergeable
One of the most powerful features of whylogs profiles is their mergeability. Mergeability means that whylogs profiles can be combined together to form new profiles which represent the aggregate of their constituent profiles. This enables logging for distributed and streaming systems, and allows users to view aggregated data across any time granularity.whylogs can be run in Python or Apache Spark environments—both PySpark and Scala—on a variety of data types.
We integrate with lots of other tools including Pandas, AWS Sagemaker, MLflow, Flask, Ray, RAPIDS, Apache Kafka, and more.
Data logging and profiling
whylogs are designed to be extremely flexible. The library can capture profiles from structured and unstructured data such as images, text, audio, bounding boxes, etc. In addition, the library supports custom metrics, log rotation, and tagging. whylogs can be deployed as a container or be invoked directly from various ML tools.
Unlike all open source data quality solutions, whylogs separates the activity of capturing profiles from the activity of acting upon them. This gives users a powerful and extendable foundation for a wide range of MLOps tools and processes.
whylogs outputs statistical profiles, available in the following formats:
- Protobuf - a lightweight and efficient binary format that maps one-to-one with the memory representation of a whylogs object
- JSON - displays the protobuf data in JSON format
- Flat - outputs multiple files with both CSV and JSON content to represent different views of the data, including histograms, upper bound, lower bound, and frequent values
To take advantage of whylogs features, we recommend always enabling the Protobuf format.
Supports batch and streaming data
- Batch Mode - whylogs processes a dataset in batches
- Streaming mode - whylogs processes individual data points
How do I generate whylogs profiles?
First, install whylogs:
pip install 'whylogs[whylabs]'
Then, start logging statistical properties of features, model inputs, and model outputs to enable explorative analysis, data unit testing, and monitoring.
Getting whylogs up-and-running is easy, simply follow one of the integration examples shown below.
Getting whylogs up-and-running is easy, simply follow one of the integration examples provided in the WhyLabs documentation.
whylogs Integration
PYTHON
flask
sagemaker
### First, install whylogs with the whylabs extra
### pip install -q 'whylogs[whylabs]'
import pandas as pd
import os
import whylogs as why
os.environ["WHYLABS_API_KEY"] = "YOUR-API-KEY"
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "YOUR-ORG-ID"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "model-1" # Note: the 'model-id' is provided when setting-up a model in WhyLabs
# Point to your local CSV if you have your own data
df = pd.read_csv("https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv")
# Run whylogs on current data and upload to the WhyLabs Platform
results = why.log(df)
results.writer("whylabs").write()
Brainstorm ideas and share feedback with the whylogs community members on Slack!
What can I do with my whylogs profiles?
whylogs profiles can be used in a variety of ways. They can be viewed directly with the built-in Python profile viewer or a data visualization framework such as matplotlib or Plotly. They can sent to the WhyLabs Platform for monitoring and observability.
The more whylogs profiles you generate for a particular model or dataset, the more value they provide. Here’s a breakdown of what can be done with whylogs profiles, depending on how many you have:
Single profile | Two profiles | Three or more | |
---|---|---|---|
Data documentation | |||
Exploratory data analysis | |||
Data unit testing | |||
Ad-hoc comparison to baseline | |||
Continuous monitoring |
Where do I find other whylogs users and get help?
Join the WhyLabs Community on Slack!
The WhyLabs Community is a forum for you to connect with other practitioners, share ideas, and learn about exciting new techniques.