WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Bernease Herman

Feb 23, 2023

Back to Blog

How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots

AI Observability
Whylogs
Generative AI
LLMs

Bernease Herman

Feb 23, 2023

Production data systems are often centered around structured tabular data - that is, data that can be largely organized into rows and columns of primitive data types. Columns in tabular data may be related to one another, but each column can largely be understood independently. This makes it easy to create and interpret relevant metrics that describe the data in aggregate, such as means, quantiles, and cardinality.

This isn’t the ideal way to handle categorical encodings, GPS coordinates, personal identifiable information (PII) nor more complex data types such as images, text, and audio. For this data, you need to manipulate and aggregate multiple columns at the same time to make sense of it. So we turn to vectors and embeddings. High dimensional embeddings are also what facilitate advanced machine learning models such as ChatGPT and stable diffusion.

Embeddings are heavily used in machine learning for a variety of data types and tasks as inputs, intermediate products, and outputs. Here are just a few:

Natural language understanding and text analysis

Sentiment analysis
Document classification
Text generation

Computer vision and image processing

Manufacturing quality assurance
Autonomous driving

Audio processing

Text-to-speech models
Speaker identification

Tabular machine learning

Product recommendation
Anonymization and privacy

WhyLabs recently released features that make it even easier to profile and monitor high dimensional embeddings data in both whylogs and the WhyLabs Observability Platform. And it doesn’t require you to explore data by hand.

Want to try it out? Sign up for a free WhyLabs account or check out our Jupyter notebook for profiling embeddings in Python.

What do embeddings typically look like?

In short, your data should look like a numerical array of some fixed size (or dimension). For example, [0, 4.5, -1.2, 7.9] ∈ ℝ⁴ represents a vector that could be treated as a dense embedding. Embeddings need to be structure-preserving, meaning that relationships and distances between the embeddings should be meaningful. In machine learning, we often choose embeddings in vector space to satisfy our requirements.

Embeddings are either sparse or dense. Sparse embeddings often represent booleans or counts. One hot encoding is the practice of translating a categorical feature into a vector of 0s and 1s. Most ML algorithms are designed for numerical data, so this is a critical step even in tabular settings.

Another popular sparse embedding is common in text analysis, referred to as bag of words. Here, the dimensionality of the embedding matches the size of the vocabulary of all words used and the entries represent counts of each word within the document.

But most commonly discussed today are dense embeddings, which are often trained using statistical or machine learning models themselves. Dense embeddings solve a problem of sparse embeddings which treat all concepts as completely separate. For example, the words bicycle, banana, and car may be encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively which doesn’t indicate the closer relationship between the concepts of cars and bicycles as compared to bananas. But dense embeddings do. The same words may be described in two-dimensional embeddings [4.53, 7.5], [-0.91, 2.3], and [5.17, 7.12]. If we use distance, for example, Euclidean distance, we can see that the similarity between bicycle and car is preserved for the dense embedding.

Exploring embeddings by hand

So how do data scientists typically explore a set of embeddings? With lots of manual effort. Most often, embeddings analysis tools are essentially visualization tools that translate high dimensional embeddings to 2D or 3D to be plotted. But this still requires data scientists to hover over thousands of data points and manually find trends in the data.

Two dimensional t-SNE plot for 784-dimensional MNIST colored by label.

This visualization looks very cool, but it’s difficult to extract insights from it just by eyeballing. While you may spot a pattern or a change, this is neither scalable nor a precise way for data scientists and stakeholders to detect issues in your organization’s data.

Exploring embeddings at scale

At WhyLabs, we’ve developed techniques to explore embeddings at scale without the manual work of exploring individual data points. This is done by comparing each data point to several meaningful reference points within your embeddings vector space.

For example, take the popular MNIST dataset. This data is highly dimensional, 28 x 28 pixels = 784 dimensions for our vectors. While it is common to reduce such highly dimensional data into dense vectors, we work with the full sparse embedding in this post.

There are several key references that you may want to compare your data to. We’d like some canonical example for each of the ten digits in our dataset for comparison with incoming images. We’ve packaged functions in our `preprocess` module of whylogs for finding references for both supervised and unsupervised data. But for some applications, you may have specific references of interest, such as popular variations of the digits, non-numeric characters, and more.

Comparing production embeddings data to predetermined references allows us to better measure how incoming data shifts relative to the data points that matter most to you. This approach is highly flexible and customizable to detect real issues when capturing distributional information. References give actionable insights on which types of data are most related to issues seen in production without hovering over individual data points by hand.

Some useful measurements and metrics we’ve found for embeddings include:

Cosine similarity across batches of embeddings
Minimum, maximum, and distribution across the individual values within the embeddings
Distribution of distance to reference embeddings (pictured below)
Distribution of the closest reference embeddings
Cluster analysis measures of the clusters formed by reference embeddings, e.g., Silhouette score, Calinski-Harabasz index

By utilizing many of these measures, we are able to programmatically measure and diagnose drift. Using only one or two metrics may help you to detect drift, but it is difficult to distinguish between potential causes.

Getting started with embeddings in whylogs

You can get started with this feature in our open source Python library, whylogs. Check out our Jupyter notebook for profiling embeddings in Python for an end-to-end example using MNIST as well as our notebook for submitting data profiles to the WhyLabs Observatory Platform to monitor your production data over time.

Assuming you have some embeddings in a numpy array, `data`, you can log embeddings with the following:

import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
    DistanceFunction,
    EmbeddingConfig,
    EmbeddingMetric,
)
from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector

references, labels = PCAKMeansSelector(n_components=20).calculate_references(data)

config = EmbeddingConfig(
    references=references,
    labels=labels,
    distance_fn=DistanceFunction.euclidean,
)
schema = DeclarativeSchema([ResolverSpec(column_name="embeddings", metrics=[MetricSpec(EmbeddingMetric, config)])])

results = why.log(row={"embeddings": data}, schema=schema)

To enable drift detection, send your data profiles (not the raw data) to the WhyLabs Observatory Platform. It’s as easy as calling `.write(“whylabs”)` on your profile after you’ve signed up and set up your credentials.

import os
os.environ["WHYLABS_DEFAULT_ORG_ID"] = “YOUR ORG_ID HERE”
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = “YOUR DATASET_ID HERE”
os.environ["WHYLABS_API_KEY"] = “YOUR API_KEY HERE”

results.write(“whylabs”)

Distance to closest reference is collected for each profiled data point, shown here on the Profiles tab. This shows that there’s a drift between our reference and production data.

Feedback

This feature has been released in beta and we’d love feedback on metrics and use cases for your organization. Join our Slack channel or start a GitHub issue to discuss improvements to embeddings with the WhyLabs team.

Get started by creating a free WhyLabs account or contact us to learn more about embeddings.

Resources

Check out whylogs for open-source data logging
Create a free WhyLabs account for data and ML monitoring
Join our Community Slack channel to ask questions and learn more

Bernease Herman

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots

What do embeddings typically look like?

Exploring embeddings by hand

Exploring embeddings at scale

Getting started with embeddings in whylogs

Feedback

Get started by creating a free WhyLabs account or contact us to learn more about embeddings.

Resources

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs