blog bg left
Back to Blog

How Observability Uncovers the Effects of ML Technical Debt

One of the most alarming aspects of machine learning is that many teams don’t yet have tools and processes to measure the negative effects of technical debt in their production systems. Many teams test their machine learning models offline but conduct little to no online evaluation after initial deployment. These teams are flying blind—running production systems with no insight into their ongoing performance. Observability is the first step in measuring how this technical debt has come to bear on your business. Otherwise, it is your customers that are first to notify you of your system’s faults.

Technical debt and performance degradation can be silent and insidious. Unlike in traditional software, small and valid changes in the data still can cause catastrophic failures for ML models. Changes in the distribution, or relative proportion, of data values can not only cripple your model but do so without triggering any of the typical DevOps alarms on service and data availability. This is in addition to the deployment and software issues that plague all production software, but is made worse because machine learning code is data-dependent and particularly hard to debug. Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.

“Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.”

With observability and monitoring, you can proactively detect and respond to changes in the model performance before your customers or stakeholders even notice an issue. While the accumulation of hidden technical debt like brittle data pipelines and inadequate data collection practices can cripple ML systems, observability into the dynamics of your data and models allows you to nurse your production systems back to health faster.

AI Observability with whylogs and WhyLabs Observatory

The WhyLabs team has collectively spent decades putting machine learning into production and battled massive amounts of ML technical debt. In that time, we repeatedly needed tools that can provide observability into the inner workings of our AI systems that are purpose-built for large datasets. Over the past two years, we have built an observability platform and open-source data logging library, whylogs. The logging library gathers metrics about the dataset’s distribution, missing values, distinct values, data type schema, and additional information that are contained in a dataset profile. These profiles can then be passed to WhyLabs Observatory to enable powerful monitoring and drift detection that allows insight into the growing effects of machine learning technical debt.

Profiling your data using whylogs for Python is as simple as installing with `pip install whylogs` and running the following script:

import pandas as pd
from whylogs import get_or_create_session

df = pd.read_csv("YOUR-DATASET.csv")	# Load a dataset in one of many ways

session = get_or_create_session()		# Get a whylogs session

with session.logger() as ylog:			# Run session in a context manager
	ylog.log_dataframe(df)		# Profile and log your data

Uncovering the sources of ML technical debt effects

Some of the most frequent model failures are caused not by natural data drift, but programmatic data transformations and changes in usage by the organization for data features. These cause undetected surges in missing values, schema changes, and data distribution changes downstream. Monitoring the inputs means monitoring statistical properties of each feature individually and monitoring the collective distribution of features.

As an example, imagine your team has deployed a machine learning model that makes predictions using user information added manually to your website. Zip code is a common and problematic feature in systems for its complexity and structure. Consider an update that collects a shortened five-digit US zip code for new customer accounts instead of the previously required nine-digit code. Not only would your model see entirely new values for this feature that the model was not trained on, your data pipeline may interpret the new values as five-digit integers instead of the longer strings. This will cause poor performance of the model for all new customers—something difficult to catch using performance alone. By monitoring the zip code feature for missing values and schema changes, we’re able to catch changes and save our model as soon as the changes are deployed.

Negative consequences of data and machine learning technical debt can arise from more than just changes in the input data. Changes in code, model parameters, random seeds, infrastructure, inputs, outputs, or performance metrics can all have deleterious effects on your system. This is why it’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.

“It’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.”

While monitoring model inputs and outputs can help to detect the effects of technical debt in real time, full observability of your data and model pipeline are needed to uncover the true root cause of an issue. The high dependency across machine learning pipelines also means that a single source can cause multiple issues. For example, a system that only monitors model outputs or performance may also detect the zip code issue we described above. But observability into the input data before and after feature transformation could help us to locate where in the pipeline the problem originates. Full AI observability might reveal the high proportion of unseen zip code values at inference time, schema change from string to integer, and importantly, the root cause: a change in character length in the raw input for zip code. Without the true root cause, we may retrain the model to temporarily fix the poor performance but the technical debt from the website change would have persisted.

Let us know how you uncover the effects of ML technical debt and how the WhyLabs team can help. The whylogs library is available on Github for both Python and Java with added wrappers for Spark. And for observability and monitoring of production systems, check out our always-free, fully self-serve Starter edition of WhyLabs AI Observatory.

How Observability Uncovers the Effects of ML Technical Debt” was originally published on the DCAI Resource Hub, the premier location for learning how to leverage DCAI best practices for improving ML model performance.``

Other posts

Achieving Ethical AI with Model Performance Tracing and ML Explainability

With Model Performance Tracing and ML Explainability, we’ve accelerated our customers’ journey toward achieving the three goals of ethical AI - fairness, accountability and transparency.

BigQuery Data Monitoring with WhyLabs

We’re excited to announce the release of a no-code solution for data monitoring in Google BigQuery, making it simple to monitor your data quality without writing a single line of code.

Robust & Responsible AI Newsletter - Issue #4

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

WhyLabs Private Beta: Real-time Data Monitoring on Prem

We’re excited to announce our Private Beta release of an extension service for the Profile Store, enabling production use cases of whylogs on customers' premises.

Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data

We experiment with statistical tests, Kolmogorov-Smirnov (KS) specifically, applied to full datasets and dataset profiles and compare the results.

Re-imagine Data Monitoring with whylogs and Apache Spark

An overview of how the whylogs integration with Apache Spark achieves large scale data profiling, and how users can apply this integration into existing data and ML pipelines.

ML Monitoring in Under 5 Minutes

A quick guide to using whylogs and WhyLabs to monitor common issues with your ML models to surface data drift, concept drift, data quality, and performance issues.

AIShield and WhyLabs: Threat Detection and Monitoring for AI

The seamless integration of AIShield’s security insights on WhyLabs AI observability platform delivers comprehensive insights into ML workloads and brings security hardening to AI-powered enterprises.

Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask

Profiling large-scale data for use cases such as anomaly detection, drift detection, and data validation with Fugue on Spark, Ray or Dask.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo