WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Bernease Herman,

Alessya Visnjic

Mar 10, 2022

Back to Blog

How Observability Uncovers the Effects of ML Technical Debt

AI Observability
WhyLabs
Whylogs
ML Monitoring

Bernease Herman,

Alessya Visnjic

Mar 10, 2022

One of the most alarming aspects of machine learning is that many teams don’t yet have tools and processes to measure the negative effects of technical debt in their production systems. Many teams test their machine learning models offline but conduct little to no online evaluation after initial deployment. These teams are flying blind—running production systems with no insight into their ongoing performance. Observability is the first step in measuring how this technical debt has come to bear on your business. Otherwise, it is your customers that are first to notify you of your system’s faults.

Technical debt and performance degradation can be silent and insidious. Unlike in traditional software, small and valid changes in the data still can cause catastrophic failures for ML models. Changes in the distribution, or relative proportion, of data values can not only cripple your model but do so without triggering any of the typical DevOps alarms on service and data availability. This is in addition to the deployment and software issues that plague all production software, but is made worse because machine learning code is data-dependent and particularly hard to debug. Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.

“Visibility into data and ML-specific technical debt requires approaches that are purpose built for datasets, their distributions and dependencies, as well as the unique characteristics of the machine learning models that rely on them.”

With observability and monitoring, you can proactively detect and respond to changes in the model performance before your customers or stakeholders even notice an issue. While the accumulation of hidden technical debt like brittle data pipelines and inadequate data collection practices can cripple ML systems, observability into the dynamics of your data and models allows you to nurse your production systems back to health faster.

AI Observability with whylogs and WhyLabs Observatory

The WhyLabs team has collectively spent decades putting machine learning into production and battled massive amounts of ML technical debt. In that time, we repeatedly needed tools that can provide observability into the inner workings of our AI systems that are purpose-built for large datasets. Over the past two years, we have built an observability platform and open-source data logging library, whylogs. The logging library gathers metrics about the dataset’s distribution, missing values, distinct values, data type schema, and additional information that are contained in a dataset profile. These profiles can then be passed to WhyLabs Observatory to enable powerful monitoring and drift detection that allows insight into the growing effects of machine learning technical debt.

Profiling your data using whylogs for Python is as simple as installing with `pip install whylogs` and running the following script:

import pandas as pd
from whylogs import get_or_create_session

df = pd.read_csv("YOUR-DATASET.csv")	# Load a dataset in one of many ways

session = get_or_create_session()		# Get a whylogs session

with session.logger() as ylog:			# Run session in a context manager
	ylog.log_dataframe(df)		# Profile and log your data

Uncovering the sources of ML technical debt effects

Some of the most frequent model failures are caused not by natural data drift, but programmatic data transformations and changes in usage by the organization for data features. These cause undetected surges in missing values, schema changes, and data distribution changes downstream. Monitoring the inputs means monitoring statistical properties of each feature individually and monitoring the collective distribution of features.

As an example, imagine your team has deployed a machine learning model that makes predictions using user information added manually to your website. Zip code is a common and problematic feature in systems for its complexity and structure. Consider an update that collects a shortened five-digit US zip code for new customer accounts instead of the previously required nine-digit code. Not only would your model see entirely new values for this feature that the model was not trained on, your data pipeline may interpret the new values as five-digit integers instead of the longer strings. This will cause poor performance of the model for all new customers—something difficult to catch using performance alone. By monitoring the zip code feature for missing values and schema changes, we’re able to catch changes and save our model as soon as the changes are deployed.

Negative consequences of data and machine learning technical debt can arise from more than just changes in the input data. Changes in code, model parameters, random seeds, infrastructure, inputs, outputs, or performance metrics can all have deleterious effects on your system. This is why it’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.

“It’s important to use a solution that can be instrumented at multiple points in your team’s data and model pipeline and built specifically for machine learning and large datasets.”

While monitoring model inputs and outputs can help to detect the effects of technical debt in real time, full observability of your data and model pipeline are needed to uncover the true root cause of an issue. The high dependency across machine learning pipelines also means that a single source can cause multiple issues. For example, a system that only monitors model outputs or performance may also detect the zip code issue we described above. But observability into the input data before and after feature transformation could help us to locate where in the pipeline the problem originates. Full AI observability might reveal the high proportion of unseen zip code values at inference time, schema change from string to integer, and importantly, the root cause: a change in character length in the raw input for zip code. Without the true root cause, we may retrain the model to temporarily fix the poor performance but the technical debt from the website change would have persisted.

Let us know how you uncover the effects of ML technical debt and how the WhyLabs team can help. The whylogs library is available on Github for both Python and Java with added wrappers for Spark. And for observability and monitoring of production systems, check out our always-free, fully self-serve Starter edition of WhyLabs AI Observatory.

“How Observability Uncovers the Effects of ML Technical Debt” was originally published on the DCAI Resource Hub, the premier location for learning how to leverage DCAI best practices for improving ML model performance.``

Bernease Herman,

Alessya Visnjic

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

How Observability Uncovers the Effects of ML Technical Debt

AI Observability with whylogs and WhyLabs Observatory

Uncovering the sources of ML technical debt effects

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs