blog bg left
Back to Blog

Deploy and Monitor your ML Application with Flask and WhyLabs

Improving Observability for your AI System

This article was written by Felipe de Pontes Adachi, and first appeared on Towards Data Science on November 3rd, 2021. It's republished with the author's permission.

One of the best milestones in every AI builder’s journey is the day when a model is ready to graduate from training and get deployed into production. According to a recent survey, most organizations already have more than 25 models in production. This underscores how enterprises are increasingly relying on ML to improve performance as intended outside of the lab. However, the post-deployment phase can be a difficult one for your ML model. Data scientists may think that the hard part’s over once a model is deployed, but the truth is the problems have only just begun. Data errors, broken pipelines, and model performance degradation will certainly come one day or another and, when this day comes, we must be prepared to debug and troubleshoot these problems efficiently. In order to have a reliable and robust ML system, observability is an absolute must.

In this article, I’d like to share an approach on how to improve the observability of your ML application by efficiently logging and monitoring your models. To demonstrate, we’ll deploy a Flask application for pattern recognition based on the well-known Iris Dataset. On the monitoring part, we’ll explore the free, starter edition of the WhyLabs Observability Platform in order to set up our own model monitoring dashboard. From the dashboard, we’ll have access to all the statistics, metrics, and performance data gathered from every part of our ML pipeline. To interact with the platform, we’ll use whylogs, an open-source data logging library developed by WhyLabs.

One of the key use cases for a monitoring dashboard is when there is an unexpected change in our data quality and/or consistency. On that account, we simulate one of the most common model failure scenarios, that of feature drift, meaning that there’s a change in our input’s distribution. We then use a monitoring dashboard to see how this kind of problem can be detected and debugged. While feature drift may or may not affect your model’s performance, observing this kind of change in production should always be a reason for further inspection and, possibly, model retraining.

In summary, we’ll cover the step-by-step path to:

  • Deploy the WhyLabs-integrated Flask API
  • Test the application
  • Inject Feature Drift and Explore the model dashboard

If this is a problem that resonates with you, we’ve made the complete code for this tutorial available here for you to reuse. You’ll need to copy it onto your machine in order to follow this example. You’ll also need a Docker and, preferably, a Python environment management tool, such as Conda or pipenv. For those interested in the details, we’ve included a very thorough description and a Jupyter Notebook as a guideline.

1. Overview

Let’s take a look at how the different components of our system interact with each other.

Image by author

We’ll deploy locally a Flask application, which is responsible for serving the user with the requested predictions through a REST endpoint. Our application will also use the whylogs library to create statistical profiles of both input and output features of our application during production. These statistical properties will then be sent in microbatches to WhyLabs at fixed intervals. WhyLabs merges them automatically, creating statistical profiles on a daily basis.

There are various ways to upload your logged metrics into the platform. In this example, we’re uploading them periodically in microbatches, as shown in the diagram. Another option would be to upload them at each log step.

2. Setting up your WhyLabs Account

In order to monitor our application, let’s first set up a WhyLabs account. Specifically, we’ll need two pieces of information:

  • API token
  • Organization ID

Go to and grab a free account. You can follow along with the examples if you wish, but if you’re interested in only following this demonstration, you can go ahead and skip the quick start instructions.

After that, you’ll be prompted to create an API token, which you’ll use to interact with your dashboard. Once you create the token, copy it and store it in a safe place. The second important information here is your org ID. Take note of it as well. WhyLabs gives you an example code of how to create a session and send data to your dashboard. You can test it as well and check if data is getting through. Otherwise, after you get your API Token and Org ID, you can just go to to see your shiny new model’s dashboard. To get to this step, we used the WhyLabs API Documentation, which also provides additional information about token creation and basic examples on how to use it.

3. Deploying the Flask Application

Once the dashboard is up and running, we can get started on deploying the application itself. We’ll serve a Flask application with Gunicorn, packaged into a Docker container to facilitate deployment.

WhyLabs configuration

The first step will be configuring the connection to WhyLabs. In this example, we do that through the .whylogs_flask.yaml file. The writer can be set to either output data locally or to different locations, like, for example, S3, an MLFlow path, or directly to WhyLabs. In this file, we’ll also set the project’s name and other additional information.

project: example-project
pipeline: example-pipeline
verbose: false
-   type: whylabs

Environment variables

The application assumes the existence of a set of variables, so we’ll define them in a .env file. These are loaded later using the dotenv library. Copy the content below and replace the WHYLABS_API_KEY and WHYLABS_DEFAULT_ORG_ID values to the ones you got when creating your WhyLabs account. You can keep the other variables as is.

# This is an example of what .env file should looks like
# Flask

# Swagger Documentation

# WhyLabs

# WhyLabs session
# ROTATION_TIME is how often we create the microbatch

Let’s talk about some of the other existing variables.

  • WHYLABS_DEFAULT_DATASET_ID — The ID of your Dataset. Model-1 is the default value, created automatically once you create your account. Leave it unchanged if you haven’t sent anything to WhyLabs yet. But if it’s already populated, you can change it to model-2. Just remember to set up the new model in A new model-id will be assigned to the newly-created model, and you can use it in your .env file.
  • ROTATION_TIME — Period used to send data to WhyLabs. In real applications, we might not need such frequent updates, but for this example, we set it to a low value so we don’t have to wait that long to make sure it’s working.

Creating the environment

The application will be run from inside a Docker container, but we’ll also run some pre-scripts (to download the data and train the model) and post-scripts (to send the requests), so let’s create an environment for this tutorial. If you’re using Conda, you can create it from the environment.yml configuration file:

conda env create -f environment.yml

If that doesn’t work out as expected, you can also create an environment from scratch:

conda create -n whylogs-flask python=3.7.11
conda activate whylogs-flask

And then install directly from the requirements:

python -m pip install -r requirements.txt

Training the Model

If you look carefully, one of our environment variables is called MODEL_PATH. We don’t have a model yet, so let’s create it. We’ll use sklearn to train a simple SVC classification model from the Iris dataset and save it as model.joblib. From the root of our example’s folder, we can simply call the training routine:


Building the Docker Image

Now we have everything to actually run our application. To build the image, we define the Dockerfile below:


# Create a working directory.
RUN mkdir /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the code
COPY ./ /app

CMD [ "gunicorn", "--workers=2", "--bind=", "--threads=1", "--reload", "autoapp:app"]

And then build the image and run our container:

docker build --build-arg PYTHON_VERSION=3.7 -t whylabs-flask:latest .docker run --rm -p 5000:5000 -v $(pwd):/app  whylabs-flask

This will start our application locally on port 5000. In this example, we’re running our Flask application as a WSGI HTTP Server with Gunicorn. By setting thereload variable, we’re able to debug more effectively, automatically reloading the workers when there’s a change in code.

You should see a message from gunicorn starting the server:

[2021-10-12 17:53:01 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-10-12 17:53:01 +0000] [1] [INFO] Listening at: (1)
[2021-10-12 17:53:01 +0000] [1] [INFO] Using worker: sync
[2021-10-12 17:53:01 +0000] [8] [INFO] Booting worker with pid: 8
[2021-10-12 17:53:01 +0000] [20] [INFO] Booting worker with pid: 20

4. Testing the API

Once the Docker container is running, we should check if the API is functional. The application has two endpoints:

  • /api/v1/health: Returns a 200 status response if the API is up and running.
  • /api/v1/predict: Returns a predicted class given an input feature vector.

The API is properly documented with Swagger, so you can head to http: to explore the documentation:

Image by author

From /apidocs you’ll be able to see request examples to try out both endpoints, along with curl snippets, like:

curl -X GET "" -H "accept: application/json"

curl -X POST "" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"petal_length_cm\": 0, \"petal_width_cm\": 0, \"sepal_length_cm\": 0, \"sepal_width_cm\": 0}"

You should get a “healthy” response from the first command. Likewise, the response from the prediction request should be:

  "data": {
    "class": "Iris-virginica"
  "message": "Success"

Ok, our application is serving our requests appropriately. Now, we can just check if data is reaching our dashboard safe and sound.

It’s important to remember that data is not sent immediately, along with each request done. It’s rather sent periodically, according to the ROTATION_TIME environment variable defined in the .env file. In this example, it was set to 5 minutes, which means that we’ll have to wait 5 minutes before the information is sent to our dashboard. Then, we should see a message from whylogs indicating that the upload is complete.

Once the upload is complete, the received information should be available in your model’s dashboard within minutes. From your model’s dashboard, you can check if any profile was logged, and access your summary’s dashboard:

Image by author

So far, we have 1 profile logged and 5 features, which means data is getting through. Now, let’s explore the platform a bit more with a use-case: We’ll add some outliers to our data distribution and see how that is reflected in our dashboard.

5. Detecting Feature Drift

The real value of any observability platform surfaces when things go wrong. Things can go wrong in a lot of different ways, but for this demonstration let’s pick a specific one: feature drift.

The drift phenomenon in the context of Machine Learning can be a lengthy subject, so in this case, we’ll limit ourselves to a basic definition of feature drift: when there is a change of distribution of the model’s input.

We’ll use a collection of 150 unchanged samples from the Iris dataset to represent the normal distribution of a daily batch. Then, to represent the anomalous batch, we’ll take the same set of samples, except we’ll replace 30 of these samples for random outliers to one random input feature. We’ll proceed to request predictions with input features as follows:

  • Day 1 — Normal data
  • Day 2 — Modified Data (Changed Distribution)
  • Day 3 — Normal Data

The code snippet below will make 150 requests with unchanged input features. For the second day, we simply change the code to pass data_mod as the payload instead of data (line 35) and repeat the process. For this demonstration, the outliers were added to the sepal_width feature.

import pandas as pd
from sklearn.model_selection import train_test_split
import requests
import time
import numpy as np
import time
def add_random_column_outliers(data, number_outliers: int = 10) -> None:
    random_column = None
    data_mod = data.copy(deep=True)
        number_of_columns = len(data_mod.columns) - 2  # Index and label eliminated
        number_of_rows = data_mod.shape[0]
        random_column = data_mod.columns[np.random.randint(number_of_columns) + 1]
        for i in range(number_outliers):
            random_row = np.random.randint(0, number_of_rows)
            data_mod.loc[random_row, random_column] = round(np.random.uniform(low=20.0, high=50.0), 2)
    except Exception as ex:
        raise f"Error adding outliers in random column: {random_column}"
    return data_mod

data = pd.read_csv('dataset/Iris.csv', header=None)

data_mod = add_random_column_outliers(data, 30)
print("Dataset distribution modified!")

url = "http://localhost:5000/api/v1"

labels = ["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm"]
for k in range(data_mod.shape[0]):
        # Build a payload with random values
        payload = dict(zip(labels, data.iloc[:, 0:4].values[k]))
        response ="{url}/predict", json=payload)
        if response.ok:

Let’s take a brief look at some sections from our dashboard. From the inputs page, there’s a lot of useful information, such as count for the number of records in each batch, null fraction, and inferred feature type.

Gif by author

One particularly useful piece of information is the Distribution Distance. This is a time series that shows the feature’s mean statistical distance to previous batches of data over the selected date range, using the Hellinger Distance to estimate the distances between distributions. By adding the outliers in the sepal_width feature, we see a change in the distribution distance not only for the sepal_width feature itself but also on the prediction results.

Note: For the remaining features, the distance seems to increase for the last day. That was actually a mistake on my part, as I sent 1 additional record, for a total of 151 records. Since the graphs are not in scale between features, the effect of this change seems high, but when inspecting the values themselves, the additional record yielded a distribution distance of 0.02.

We can further inspect the individual features by clicking on them. From the distribution graphs, we can see the drift’s effects on both class and sepal_width features.

In the profiles section, we can select up to 3 dataset profiles for comparison, which is a very cool feature. In the image below, I selected a few features in each profile for comparison. As previously, the changes caused by adding the outliers are also very apparent.

Image by author

While we won’t be able to discuss every available feature here, some of my favorites are:

  • Manual thresholds configuration for each feature monitor
  • Notification scheduling for email/Slack
  • Uploading training set as a baseline reference set
  • Segment data during profiling

Feel free to explore them on your own!

6. Conclusion

The post-deployment phase of ML applications presents us with countless challenges and the process of troubleshooting these applications can still be very manual and cumbersome. In this post, we covered an example of how to ease this burden by improving our system’s observability with WhyLabs.

We covered a simple example of deploying a Flask application for classification, based on the well-known Iris dataset, and integrated it with the WhyLabs platform using the whylogs data logging library. We also simulated a Feature Drift scenario by injecting outliers to our input in order to see how that is reflected in our monitoring dashboard. Even though it’s a very simple use case, this example can be easily adapted to cover more tailored scenarios.

Thank you for reading, and feel free to reach out if you have any questions/suggestions! If you try it out and have questions, I’ve found the WhyLabs community on Slack very helpful and responsive.

Other posts

Get Early Access to the First Purpose-Built Monitoring Solution for LLMs

We’re excited to announce our private beta release of LangKit, the first purpose-built large language model monitoring solution! Join the responsible LLM revolution by signing up for early access.

Mind Your Models: 5 Ways to Implement ML Monitoring in Production

We’ve outlined five easy ways to monitor your ML models in production to ensure they are robust and responsible by monitoring for concept drift, data drift, data quality, AI explainability and more.

Simplifying ML Deployment: A Conversation with BentoML's Founder & CEO Chaoyu Yang

A summary of the live interview with Chaoyu Yang, Founder & CEO at BentoML, on putting machine learning models in production and BentoML's role in simplifying deployment.

Data Drift vs. Concept Drift and Why Monitoring for Them is Important

Data drift and concept drift are two common challenges that can impact ML models on production. In this blog, we'll explore the differences between these two types of drift and why monitoring for them is crucial.

Robust & Responsible AI Newsletter - Issue #5

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

Detecting Financial Fraud in Real-Time: A Guide to ML Monitoring

Fraud is a significant challenge for financial institutions and businesses. As fraudsters constantly adapt their tactics, it’s essential to implement a robust ML monitoring system to ensure that models effectively detect fraud and minimize false positives.

How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots

WhyLabs' scalable approach to monitoring high dimensional embeddings data means you don’t have to eye-ball pretty UMAP plots to troubleshoot embeddings!

Achieving Ethical AI with Model Performance Tracing and ML Explainability

With Model Performance Tracing and ML Explainability, we’ve accelerated our customers’ journey toward achieving the three goals of ethical AI - fairness, accountability and transparency.

Detecting and Fixing Data Drift in Computer Vision

In this tutorial, Magdalena Konkiewicz from Toloka focuses on the practical part of data drift detection and fixing it on a computer vision example.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo