blog bg left
Back to Blog

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

This is a summary of ‘Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs AI Observatory Platform’, a blog post written in collaboration with AWS. To read the full article, visit the AWS Partner Network (APN) blog.

This article explores the various ways embeddings are used in Machine Learning (ML) and identifies potential problems that may arise. It also explains how to use WhyLabs to identify these issues and set up monitoring systems to stop them from happening again in the future.

What are embeddings?

Embeddings are a way to represent complex data types as numerical representations that preserve context and relationships. They can be sparse or dense to represent different types of data, and embeddings are heavily used in machine learning for a variety of data types and tasks as inputs, intermediate products, and outputs.

How are embeddings used?

Although there are numerous approaches to managing embedding creation, we won't go over them in this post. Instead, we'll go over how embeddings can be used to measure meaningful drift in the transformed inputs, which can be used to identify close clusters of centroids or distances between individual centroids.

Typically for debugging, data scientists would use lower dimensional representations like UMAPs or t-SNE, which is helpful to visually identify clusters but isn’t a scalable approach to understand your embeddings over time in production.

To handle this in a scalable way, whylogs, the open-source library for logging any kind of data, creates a lightweight statistical profile of your data that can be used to extract meaningful insights and characteristics, letting you measure quality and drift over time.

Using whylogs, customers are able to identify centroids in their embeddings and measure different distances inside different clusters. This can be useful for figuring out how your embeddings change over time or suddenly change in response to an upstream data change. Read more in this blog post.

Figure 1 – Visualization of embedding space.

Train and deploy a classification model

By setting up and training a simple classification model in Amazon SageMaker, you can build, train, and deploy ML models using fully managed infrastructure, tools, and workflows.

Check out the full article for detailed steps on:

  • Using the newsgroup datasource to create vectors and train our model on those vectors.
  • Creating an entrypoint script that defines how to load our model and make predictions.
  • Deploying our model to an endpoint so we can make some batched predictions and compare our results.
  • Finally, we use a pretrained model on this same dataset to optimize the process of defining the endpoint in SageMaker.

Measuring embedding distances with whylogs

Once we have our model trained and entrypoint defined, we’ll capture a set of reference points in our embeddings to help identify the centroids in embeddings that our model was trained on. This will help us compare distances for unique centroids in our embeddings during inference, which we will revisit a bit later in this post.

To capture our reference points, we have a few different options in whylogs. We can manually define relationships, or let whylogs automatically identify centroids based on corresponding labels or by utilizing an unsupervised clustering approach.

To follow along with the example, visit the full blog post.

Monitoring embedding drift with WhyLabs Observatory

If you’ve followed through the original article, at this point we have a trained model, reference embeddings, and a whylogs resolver defined to extract the information we want from our embeddings. In order to see the power of measuring embeddings distances, we’ll create a scenario where we are using our classifier to predict the class of document it learned from our training set.

Here’s a high-level architecture of what we’ve done:

Figure 2 – Architecture of the integration in this post.

When we open our project in WhyLabs, we see that our profiles were successfully generated for each batch and submitted to the platform. We won’t cover every feature and output created by our resolver but will highlight three of them below.

Observe introduced drift in WhyLabs

Now you’ll have access to a number of different features in your dashboard that represent the different aspects of the pipeline we monitored:

  • news_centroids: Relative distance of each document to the centroids of each reference topic cluster, and frequent items for the closest centroid for each document.
  • document_tokens: Distribution of tokens (term length, document length and frequent items) in each document.
  • output_prediction and output_target: The output (predictions and targets) of the classifier that will also be used to compute metrics on the “Performance” tab.

With the monitored information, we should be able to correlate the anomalies and reach a conclusion about what happened.


In the chart below, we can see the distribution of the closest centroid for each document. For the first four days, the distribution is similar between each other. The language perturbations injected in the last three days seem to skew the distribution towards the “forsale” topic.

Figure 3 – Visualization in WhyLabs for ‘news_centroids.closest’ input.


Since we removed the English stopwords in our tokenization process but didn’t remove the Spanish stopwords, we can see that most of the frequent terms in the selected period are the Spanish stopwords, and those stopwords don’t appear in the first four days.

Figure 4 – Visualization in WhyLabs for ‘document_tokens.frequent_terms.’


In the “Performance” tab, there is plenty of information that tells us our performance is degrading. For example, the F1 chart below shows the model is getting increasingly worse starting from the fifth day.

Figure 5 – F1 performance metric visualization in WhyLabs.

For now, we’ll focus on how to use WhyLabs to monitor these and be notified in the future when our dataset changes and impacts our models performance. We cover the steps below in more detail in this blog post.

  1. Navigate to the Monitor Manager and select the “Presets” tab.
  2. Next, create a drift monitor on discrete inputs using the “Configure” option on the “Data drift in model inputs” for “All discrete inputs.” Click through to modify the drift distance threshold under section 2 and leave everything else the same.
  3. Lastly, use the save button at the bottom to complete creating our monitor.

Now, we’ll test our monitor on the “news_centroids.closest” feature to show the drift in categorical distribution when we changed our language to Spanish, causing the “forsale” cluster to become the closest centroid cluster more consistently.

Figure 8 – Monitor failure preview in WhyLabs for ‘news_centroids.closest’ input.

We can see that WhyLabs identified the drift in closest clusters which would have triggered an alert to our downstream notification endpoint. This can help us to avoid a sudden change like this in the future.

Start your WhyLabs and Amazon SageMaker journey

Embarking on your journey with WhyLabs and Amazon SageMaker is simple. Take a look at our sample notebook the example in this post is built from, and then make your way over to WhyLabs Observatory to create a free account and begin monitoring your SageMaker models.

You can also learn more about the WhyLabs AI Observability Platform in AWS Marketplace.

Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

Hugging Face and LangKit: Your Solution for LLM Observability

See how easy it is to generate out-of-the-box text metrics for Hugging Face LLMs and monitor them in WhyLabs to identify how model performance and user interaction are changing over time.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.

Monitoring LLM Performance with LangChain and LangKit

In this blog post, we dive into the significance of monitoring Large Language Models (LLMs) and show how to gain insights and effectively monitor a LangChain application with LangKit and WhyLabs.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo