blog bg left
Back to Blog

How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots

Production data systems are often centered around structured tabular data - that is, data that can be largely organized into rows and columns of primitive data types. Columns in tabular data may be related to one another, but each column can largely be understood independently. This makes it easy to create and interpret relevant metrics that describe the data in aggregate, such as means, quantiles, and cardinality.

This isn’t the ideal way to handle categorical encodings, GPS coordinates, personal identifiable information (PII) nor more complex data types such as images, text, and audio. For this data, you need to manipulate and aggregate multiple columns at the same time to make sense of it. So we turn to vectors and embeddings. High dimensional embeddings are also what facilitate advanced machine learning models such as ChatGPT and stable diffusion.

Embeddings are heavily used in machine learning for a variety of data types and tasks as inputs, intermediate products, and outputs. Here are just a few:

Natural language understanding and text analysis

  • Sentiment analysis
  • Document classification
  • Text generation

Computer vision and image processing

  • Manufacturing quality assurance
  • Autonomous driving

Audio processing

  • Text-to-speech models
  • Speaker identification

Tabular machine learning

  • Product recommendation
  • Anonymization and privacy

WhyLabs recently released features that make it even easier to profile and monitor high dimensional embeddings data in both whylogs and the WhyLabs Observability Platform. And it doesn’t require you to explore data by hand.

Want to try it out? Sign up for a free WhyLabs account or check out our Jupyter notebook for profiling embeddings in Python.

What do embeddings typically look like?

In short, your data should look like a numerical array of some fixed size (or dimension). For example, [0, 4.5, -1.2, 7.9] 4 represents a vector that could be treated as a dense embedding. Embeddings need to be structure-preserving, meaning that relationships and distances between the embeddings should be meaningful. In machine learning, we often choose embeddings in vector space to satisfy our requirements.

Embeddings are either sparse or dense. Sparse embeddings often represent booleans or counts. One hot encoding is the practice of translating a categorical feature into a vector of 0s and 1s. Most ML algorithms are designed for numerical data, so this is a critical step even in tabular settings.

Another popular sparse embedding is common in text analysis, referred to as bag of words. Here, the dimensionality of the embedding matches the size of the vocabulary of all words used and the entries represent counts of each word within the document.

But most commonly discussed today are dense embeddings, which are often trained using statistical or machine learning models themselves. Dense embeddings solve a problem of sparse embeddings which treat all concepts as completely separate. For example, the words bicycle, banana, and car may be encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively which doesn’t indicate the closer relationship between the concepts of cars and bicycles as compared to bananas. But dense embeddings do. The same words may be described in two-dimensional embeddings [4.53, 7.5], [-0.91, 2.3], and [5.17, 7.12]. If we use distance, for example, Euclidean distance, we can see that the similarity between bicycle and car is preserved for the dense embedding.

Exploring embeddings by hand

So how do data scientists typically explore a set of embeddings? With lots of manual effort. Most often, embeddings analysis tools are essentially visualization tools that translate high dimensional embeddings to 2D or 3D to be plotted. But this still requires data scientists to hover over thousands of data points and manually find trends in the data.

Caption: Two dimensional t-SNE plot for 784-dimensional MNIST colored by label.

This visualization looks very cool, but it’s difficult to extract insights from it just by eyeballing. While you may spot a pattern or a change, this is neither scalable nor a precise way for data scientists and stakeholders to detect issues in your organization’s data.

Exploring embeddings at scale

At WhyLabs, we’ve developed techniques to explore embeddings at scale without the manual work of exploring individual data points. This is done by comparing each data point to several meaningful reference points within your embeddings vector space.

For example, take the popular MNIST dataset. This data is highly dimensional, 28 x 28 pixels = 784 dimensions for our vectors. While it is common to reduce such highly dimensional data into dense vectors, we work with the full sparse embedding in this post.

There are several key references that you may want to compare your data to. We’d like some canonical example for each of the ten digits in our dataset for comparison with incoming images. We’ve packaged functions in our `preprocess` module of whylogs for finding references for both supervised and unsupervised data. But for some applications, you may have specific references of interest, such as popular variations of the digits, non-numeric characters, and more.

Comparing production embeddings data to predetermined references allows us to better measure how incoming data shifts relative to the data points that matter most to you. This approach is highly flexible and customizable to detect real issues when capturing distributional information. References give actionable insights on which types of data are most related to issues seen in production without hovering over individual data points by hand.

Some useful measurements and metrics we’ve found for embeddings include:

  • Cosine similarity across batches of embeddings
  • Minimum, maximum, and distribution across the individual values within the embeddings
  • Distribution of distance to reference embeddings (pictured below)
  • Distribution of the closest reference embeddings
  • Cluster analysis measures of the clusters formed by reference embeddings, e.g., Silhouette score, Calinski-Harabasz index

By utilizing many of these measures, we are able to programmatically measure and diagnose drift. Using only one or two metrics may help you to detect drift, but it is difficult to distinguish between potential causes.

Getting started with embeddings in whylogs

You can get started with this feature in our open source Python library, whylogs. Check out our Jupyter notebook for profiling embeddings in Python for an end-to-end example using MNIST as well as our notebook for submitting data profiles to the WhyLabs Observatory Platform to monitor your production data over time.

Assuming you have some embeddings in a numpy array, `data`, you can log embeddings with the following:

import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector

references, labels = PCAKMeansSelector(n_components=20).calculate_references(data)

config = EmbeddingConfig(
schema = DeclarativeSchema([ResolverSpec(column_name="embeddings", metrics=[MetricSpec(EmbeddingMetric, config)])])

results = why.log(row={"embeddings": data}, schema=schema)

To enable drift detection, send your data profiles (not the raw data) to the WhyLabs Observatory Platform. It’s as easy as calling `.write(“whylabs”)` on your profile after you’ve signed up and set up your credentials.

import os

Distance to closest reference is collected for each profiled data point, shown here on the Profiles tab. This shows that there’s a drift between our reference and production data.


This feature has been released in beta and we’d love feedback on metrics and use cases for your organization. Join our Slack channel or start a GitHub issue to discuss improvements to embeddings with the WhyLabs team.

Get started by creating a free WhyLabs account or contact us to learn more about embeddings.


Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

WhyLabs and Amazon Web Services (AWS) explore the various ways embeddings are used, issues that can impact your ML models, how to identify those issues and set up monitors to prevent them in the future!

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

Hugging Face and LangKit: Your Solution for LLM Observability

See how easy it is to generate out-of-the-box text metrics for Hugging Face LLMs and monitor them in WhyLabs to identify how model performance and user interaction are changing over time.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo