blog bg left
Back to Blog

How Observability Uncovers the Effects of ML Technical Debt

One of the most alarming aspects of machine learning is that many teams don’t yet have tools and processes to measure the negative effects of technical debt in their production systems. Many teams test their machine learning models offline but conduct little to no online evaluation after initial deployment. These teams are flying blind—running production systems with no insight into their ongoing performance. Observability is the first step in measuring how this technical debt has come to bear on your business. Otherwise, it is your customers that are first to notify you of your system’s faults.

Technical debt and performance degradation can be silent and insidious. Unlike in traditional software, small and valid changes in the data still can cause catastrophic failures for ML models. Changes in the distribution, or relative proportion, of data values can not only cripple your model but do so without triggering any of the typical DevOps alarms on service and data availability. This is in addition to the deployment and software issues that plague all production software, but is made worse because machine learning code is data-dependent and particularly hard to debug. Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.

“Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.”

With observability and monitoring, you can proactively detect and respond to changes in the model performance before your customers or stakeholders even notice an issue. While the accumulation of hidden technical debt like brittle data pipelines and inadequate data collection practices can cripple ML systems, observability into the dynamics of your data and models allows you to nurse your production systems back to health faster.

AI Observability with whylogs and WhyLabs Observatory

The WhyLabs team has collectively spent decades putting machine learning into production and battled massive amounts of ML technical debt. In that time, we repeatedly needed tools that can provide observability into the inner workings of our AI systems that are purpose-built for large datasets. Over the past two years, we have built an observability platform and open-source data logging library, whylogs. The logging library gathers metrics about the dataset’s distribution, missing values, distinct values, data type schema, and additional information that are contained in a dataset profile. These profiles can then be passed to WhyLabs Observatory to enable powerful monitoring and drift detection that allows insight into the growing effects of machine learning technical debt.

Profiling your data using whylogs for Python is as simple as installing with `pip install whylogs` and running the following script:

import pandas as pd
from whylogs import get_or_create_session

df = pd.read_csv("YOUR-DATASET.csv")	# Load a dataset in one of many ways

session = get_or_create_session()		# Get a whylogs session

with session.logger() as ylog:			# Run session in a context manager
	ylog.log_dataframe(df)		# Profile and log your data

Uncovering the sources of ML technical debt effects

Some of the most frequent model failures are caused not by natural data drift, but programmatic data transformations and changes in usage by the organization for data features. These cause undetected surges in missing values, schema changes, and data distribution changes downstream. Monitoring the inputs means monitoring statistical properties of each feature individually and monitoring the collective distribution of features.

As an example, imagine your team has deployed a machine learning model that makes predictions using user information added manually to your website. Zip code is a common and problematic feature in systems for its complexity and structure. Consider an update that collects a shortened five-digit US zip code for new customer accounts instead of the previously required nine-digit code. Not only would your model see entirely new values for this feature that the model was not trained on, your data pipeline may interpret the new values as five-digit integers instead of the longer strings. This will cause poor performance of the model for all new customers—something difficult to catch using performance alone. By monitoring the zip code feature for missing values and schema changes, we’re able to catch changes and save our model as soon as the changes are deployed.

Negative consequences of data and machine learning technical debt can arise from more than just changes in the input data. Changes in code, model parameters, random seeds, infrastructure, inputs, outputs, or performance metrics can all have deleterious effects on your system. This is why it’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.

“It’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.”

While monitoring model inputs and outputs can help to detect the effects of technical debt in real time, full observability of your data and model pipeline are needed to uncover the true root cause of an issue. The high dependency across machine learning pipelines also means that a single source can cause multiple issues. For example, a system that only monitors model outputs or performance may also detect the zip code issue we described above. But observability into the input data before and after feature transformation could help us to locate where in the pipeline the problem originates. Full AI observability might reveal the high proportion of unseen zip code values at inference time, schema change from string to integer, and importantly, the root cause: a change in character length in the raw input for zip code. Without the true root cause, we may retrain the model to temporarily fix the poor performance but the technical debt from the website change would have persisted.

Let us know how you uncover the effects of ML technical debt and how the WhyLabs team can help. The whylogs library is available on Github for both Python and Java with added wrappers for Spark. And for observability and monitoring of production systems, check out our always-free, fully self-serve Starter edition of WhyLabs AI Observatory.

How Observability Uncovers the Effects of ML Technical Debt” was originally published on the DCAI Resource Hub, the premier location for learning how to leverage DCAI best practices for improving ML model performance.``

Other posts

Get Early Access to the First Purpose-Built Monitoring Solution for LLMs

We’re excited to announce our private beta release of LangKit, the first purpose-built large language model monitoring solution! Join the responsible LLM revolution by signing up for early access.

Mind Your Models: 5 Ways to Implement ML Monitoring in Production

We’ve outlined five easy ways to monitor your ML models in production to ensure they are robust and responsible by monitoring for concept drift, data drift, data quality, AI explainability and more.

Simplifying ML Deployment: A Conversation with BentoML's Founder & CEO Chaoyu Yang

A summary of the live interview with Chaoyu Yang, Founder & CEO at BentoML, on putting machine learning models in production and BentoML's role in simplifying deployment.

Data Drift vs. Concept Drift and Why Monitoring for Them is Important

Data drift and concept drift are two common challenges that can impact ML models on production. In this blog, we'll explore the differences between these two types of drift and why monitoring for them is crucial.

Robust & Responsible AI Newsletter - Issue #5

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

Detecting Financial Fraud in Real-Time: A Guide to ML Monitoring

Fraud is a significant challenge for financial institutions and businesses. As fraudsters constantly adapt their tactics, it’s essential to implement a robust ML monitoring system to ensure that models effectively detect fraud and minimize false positives.

How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots

WhyLabs' scalable approach to monitoring high dimensional embeddings data means you don’t have to eye-ball pretty UMAP plots to troubleshoot embeddings!

Achieving Ethical AI with Model Performance Tracing and ML Explainability

With Model Performance Tracing and ML Explainability, we’ve accelerated our customers’ journey toward achieving the three goals of ethical AI - fairness, accountability and transparency.

Detecting and Fixing Data Drift in Computer Vision

In this tutorial, Magdalena Konkiewicz from Toloka focuses on the practical part of data drift detection and fixing it on a computer vision example.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...