blog bg left
Back to Blog

How Observability Uncovers the Effects of ML Technical Debt

One of the most alarming aspects of machine learning is that many teams don’t yet have tools and processes to measure the negative effects of technical debt in their production systems. Many teams test their machine learning models offline but conduct little to no online evaluation after initial deployment. These teams are flying blind—running production systems with no insight into their ongoing performance. Observability is the first step in measuring how this technical debt has come to bear on your business. Otherwise, it is your customers that are first to notify you of your system’s faults.

Technical debt and performance degradation can be silent and insidious. Unlike in traditional software, small and valid changes in the data still can cause catastrophic failures for ML models. Changes in the distribution, or relative proportion, of data values can not only cripple your model but do so without triggering any of the typical DevOps alarms on service and data availability. This is in addition to the deployment and software issues that plague all production software, but is made worse because machine learning code is data-dependent and particularly hard to debug. Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.

“Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.”

With observability and monitoring, you can proactively detect and respond to changes in the model performance before your customers or stakeholders even notice an issue. While the accumulation of hidden technical debt like brittle data pipelines and inadequate data collection practices can cripple ML systems, observability into the dynamics of your data and models allows you to nurse your production systems back to health faster.

AI Observability with whylogs and WhyLabs Observatory

The WhyLabs team has collectively spent decades putting machine learning into production and battled massive amounts of ML technical debt. In that time, we repeatedly needed tools that can provide observability into the inner workings of our AI systems that are purpose-built for large datasets. Over the past two years, we have built an observability platform and open-source data logging library, whylogs. The logging library gathers metrics about the dataset’s distribution, missing values, distinct values, data type schema, and additional information that are contained in a dataset profile. These profiles can then be passed to WhyLabs Observatory to enable powerful monitoring and drift detection that allows insight into the growing effects of machine learning technical debt.

Profiling your data using whylogs for Python is as simple as installing with `pip install whylogs` and running the following script:

import pandas as pd
from whylogs import get_or_create_session

df = pd.read_csv("YOUR-DATASET.csv")	# Load a dataset in one of many ways

session = get_or_create_session()		# Get a whylogs session

with session.logger() as ylog:			# Run session in a context manager
	ylog.log_dataframe(df)		# Profile and log your data

Uncovering the sources of ML technical debt effects

Some of the most frequent model failures are caused not by natural data drift, but programmatic data transformations and changes in usage by the organization for data features. These cause undetected surges in missing values, schema changes, and data distribution changes downstream. Monitoring the inputs means monitoring statistical properties of each feature individually and monitoring the collective distribution of features.

As an example, imagine your team has deployed a machine learning model that makes predictions using user information added manually to your website. Zip code is a common and problematic feature in systems for its complexity and structure. Consider an update that collects a shortened five-digit US zip code for new customer accounts instead of the previously required nine-digit code. Not only would your model see entirely new values for this feature that the model was not trained on, your data pipeline may interpret the new values as five-digit integers instead of the longer strings. This will cause poor performance of the model for all new customers—something difficult to catch using performance alone. By monitoring the zip code feature for missing values and schema changes, we’re able to catch changes and save our model as soon as the changes are deployed.

Negative consequences of data and machine learning technical debt can arise from more than just changes in the input data. Changes in code, model parameters, random seeds, infrastructure, inputs, outputs, or performance metrics can all have deleterious effects on your system. This is why it’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.

“It’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.”

While monitoring model inputs and outputs can help to detect the effects of technical debt in real time, full observability of your data and model pipeline are needed to uncover the true root cause of an issue. The high dependency across machine learning pipelines also means that a single source can cause multiple issues. For example, a system that only monitors model outputs or performance may also detect the zip code issue we described above. But observability into the input data before and after feature transformation could help us to locate where in the pipeline the problem originates. Full AI observability might reveal the high proportion of unseen zip code values at inference time, schema change from string to integer, and importantly, the root cause: a change in character length in the raw input for zip code. Without the true root cause, we may retrain the model to temporarily fix the poor performance but the technical debt from the website change would have persisted.

Let us know how you uncover the effects of ML technical debt and how the WhyLabs team can help. The whylogs library is available on Github for both Python and Java with added wrappers for Spark. And for observability and monitoring of production systems, check out our always-free, fully self-serve Starter edition of WhyLabs AI Observatory.

How Observability Uncovers the Effects of ML Technical Debt” was originally published on the DCAI Resource Hub, the premier location for learning how to leverage DCAI best practices for improving ML model performance.``

Other posts

Model Monitoring for Financial Fraud Classification

Model monitoring is helping the financial services industry avoid huge losses caused by performance degradation in their fraud transaction models.

Data and ML Monitoring is Easier with whylogs v1.1

The release of whylogs v1.1 brings many features to the whylogs data logging API, making it even easier to monitor your data and ML models!

Robust & Responsible AI Newsletter - Issue #3

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

Data Quality Monitoring in Apache Airflow with whylogs

To make the most of whylogs within your existing Apache Airflow pipelines, we’ve created the whylogs Airflow provider. Using an example, we’ll show how you can use whylogs and Airflow to make your workflow more responsible, scalable, and efficient.

Data Logging with whylogs: Profiling for Efficiency and Speed

Rather than sampling data, whylogs captures snapshots of the data making it fast and efficient for data logging, even if your datasets scale to larger sizes.

Data Quality Monitoring for Kafka, Beyond Schema Validation

Data quality mapped to a schema registry or data type validation is a good start, but there are a few things most data application owners don’t think about. We explore error scenarios beyond schema validation and how to mitigate them.

Data + Model Monitoring with WhyLabs: simple, customizable, actionable

The new monitoring system maximizes the helpfulness of alerts and minimizes alert fatigue, so users can focus on improving their models instead of worrying about them in production...

A Solution for Monitoring Image Data

A breakdown of how to monitor unstructured data such as images, the types of problems that threaten computer vision systems, and a solution for these challenges.

How to Validate Data Quality for ML Monitoring

Data quality is one of the most important considerations for machine learning applications—and it's one of the most frequently overlooked. We explore why it’s an essential step in the MLOps process and how to check your data quality with whylogs.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo