WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

WhyLabs Team

Nov 9, 2021

Back to Blog

Deploy and Monitor your ML Application with Flask and WhyLabs

AI Observability
Whylogs
ML Monitoring
Open Source
Integrations

WhyLabs Team

Nov 9, 2021

Improving Observability for your AI System

This article was written by Felipe de Pontes Adachi, and first appeared on Towards Data Science on November 3rd, 2021. It's republished with the author's permission.

One of the best milestones in every AI builder’s journey is the day when a model is ready to graduate from training and get deployed into production. According to a recent survey, most organizations already have more than 25 models in production. This underscores how enterprises are increasingly relying on ML to improve performance as intended outside of the lab. However, the post-deployment phase can be a difficult one for your ML model. Data scientists may think that the hard part’s over once a model is deployed, but the truth is the problems have only just begun. Data errors, broken pipelines, and model performance degradation will certainly come one day or another and, when this day comes, we must be prepared to debug and troubleshoot these problems efficiently. In order to have a reliable and robust ML system, observability is an absolute must.

In this article, I’d like to share an approach on how to improve the observability of your ML application by efficiently logging and monitoring your models. To demonstrate, we’ll deploy a Flask application for pattern recognition based on the well-known Iris Dataset. On the monitoring part, we’ll explore the free, starter edition of the WhyLabs Observability Platform in order to set up our own model monitoring dashboard. From the dashboard, we’ll have access to all the statistics, metrics, and performance data gathered from every part of our ML pipeline. To interact with the platform, we’ll use whylogs, an open-source data logging library developed by WhyLabs.

One of the key use cases for a monitoring dashboard is when there is an unexpected change in our data quality and/or consistency. On that account, we simulate one of the most common model failure scenarios, that of feature drift, meaning that there’s a change in our input’s distribution. We then use a monitoring dashboard to see how this kind of problem can be detected and debugged. While feature drift may or may not affect your model’s performance, observing this kind of change in production should always be a reason for further inspection and, possibly, model retraining.

In summary, we’ll cover the step-by-step path to:

Deploy the WhyLabs-integrated Flask API
Test the application
Inject Feature Drift and Explore the model dashboard

If this is a problem that resonates with you, we’ve made the complete code for this tutorial available here for you to reuse. You’ll need to copy it onto your machine in order to follow this example. You’ll also need a Docker and, preferably, a Python environment management tool, such as Conda or pipenv. For those interested in the details, we’ve included a very thorough description and a Jupyter Notebook as a guideline.

1. Overview

Let’s take a look at how the different components of our system interact with each other.

We’ll deploy locally a Flask application, which is responsible for serving the user with the requested predictions through a REST endpoint. Our application will also use the whylogs library to create statistical profiles of both input and output features of our application during production. These statistical properties will then be sent in microbatches to WhyLabs at fixed intervals. WhyLabs merges them automatically, creating statistical profiles on a daily basis.

There are various ways to upload your logged metrics into the platform. In this example, we’re uploading them periodically in microbatches, as shown in the diagram. Another option would be to upload them at each log step.

2. Setting up your WhyLabs Account

In order to monitor our application, let’s first set up a WhyLabs account. Specifically, we’ll need two pieces of information:

API token
Organization ID

Go to https://whylabs.ai/whylabs-free-sign-up and grab a free account. You can follow along with the examples if you wish, but if you’re interested in only following this demonstration, you can go ahead and skip the quick start instructions.

After that, you’ll be prompted to create an API token, which you’ll use to interact with your dashboard. Once you create the token, copy it and store it in a safe place. The second important information here is your org ID. Take note of it as well. WhyLabs gives you an example code of how to create a session and send data to your dashboard. You can test it as well and check if data is getting through. Otherwise, after you get your API Token and Org ID, you can just go to https://hub.whylabsapp.com/models to see your shiny new model’s dashboard. To get to this step, we used the WhyLabs API Documentation, which also provides additional information about token creation and basic examples on how to use it.

3. Deploying the Flask Application

Once the dashboard is up and running, we can get started on deploying the application itself. We’ll serve a Flask application with Gunicorn, packaged into a Docker container to facilitate deployment.

WhyLabs configuration

The first step will be configuring the connection to WhyLabs. In this example, we do that through the .whylogs_flask.yaml file. The writer can be set to either output data locally or to different locations, like, for example, S3, an MLFlow path, or directly to WhyLabs. In this file, we’ll also set the project’s name and other additional information.

project: example-project
pipeline: example-pipeline
verbose: false
writers:
-   type: whylabs

Environment variables

The application assumes the existence of a set of variables, so we’ll define them in a .env file. These are loaded later using the dotenv library. Copy the content below and replace the WHYLABS_API_KEY and WHYLABS_DEFAULT_ORG_ID values to the ones you got when creating your WhyLabs account. You can keep the other variables as is.

# This is an example of what .env file should looks like
# Flask
FLASK_ENV=development
FLASK_DEBUG=1
FLASK_APP=autoapp.py
MODEL_PATH=model.joblib

# Swagger Documentation
SWAGGER_HOST=0.0.0.0:5000
SWAGGER_BASEPATH=/api/v1
SWAGGER_SCHEMES={"http"}

# WhyLabs
WHYLABS_CONFIG=.whylogs_flask.yaml
WHYLABS_API_KEY=<your-api-key>
WHYLABS_DEFAULT_ORG_ID=<your-org-id>
WHYLABS_DEFAULT_DATASET_ID=model-1
WHYLABS_API_ENDPOINT=https://api.whylabsapp.com

# WhyLabs session
DATASET_NAME=this_is_my_dataset
# ROTATION_TIME is how often we create the microbatch
ROTATION_TIME=1m
DATASET_URL=dataset/Iris.csv

Let’s talk about some of the other existing variables.

WHYLABS_DEFAULT_DATASET_ID — The ID of your Dataset. Model-1 is the default value, created automatically once you create your account. Leave it unchanged if you haven’t sent anything to WhyLabs yet. But if it’s already populated, you can change it to model-2. Just remember to set up the new model in https://hub.whylabsapp.com/models. A new model-id will be assigned to the newly-created model, and you can use it in your .env file.
ROTATION_TIME — Period used to send data to WhyLabs. In real applications, we might not need such frequent updates, but for this example, we set it to a low value so we don’t have to wait that long to make sure it’s working.

Creating the environment

The application will be run from inside a Docker container, but we’ll also run some pre-scripts (to download the data and train the model) and post-scripts (to send the requests), so let’s create an environment for this tutorial. If you’re using Conda, you can create it from the environment.yml configuration file:

conda env create -f environment.yml

If that doesn’t work out as expected, you can also create an environment from scratch:

conda create -n whylogs-flask python=3.7.11
conda activate whylogs-flask

And then install directly from the requirements:

python -m pip install -r requirements.txt

Training the Model

If you look carefully, one of our environment variables is called MODEL_PATH. We don’t have a model yet, so let’s create it. We’ll use sklearn to train a simple SVC classification model from the Iris dataset and save it as model.joblib. From the root of our example’s folder, we can simply call the training routine:

python train.py

Building the Docker Image

Now we have everything to actually run our application. To build the image, we define the Dockerfile below:

ARG PYTHON_VERSION
FROM python:${PYTHON_VERSION}

# Create a working directory.
RUN mkdir /app
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the code
COPY ./ /app

CMD [ "gunicorn", "--workers=2", "--bind=0.0.0.0:5000", "--threads=1", "--reload", "autoapp:app"]

And then build the image and run our container:

docker build --build-arg PYTHON_VERSION=3.7 -t whylabs-flask:latest .docker run --rm -p 5000:5000 -v $(pwd):/app  whylabs-flask

This will start our application locally on port 5000. In this example, we’re running our Flask application as a WSGI HTTP Server with Gunicorn. By setting thereload variable, we’re able to debug more effectively, automatically reloading the workers when there’s a change in code.

You should see a message from gunicorn starting the server:

[2021-10-12 17:53:01 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-10-12 17:53:01 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
[2021-10-12 17:53:01 +0000] [1] [INFO] Using worker: sync
[2021-10-12 17:53:01 +0000] [8] [INFO] Booting worker with pid: 8
[2021-10-12 17:53:01 +0000] [20] [INFO] Booting worker with pid: 20

4. Testing the API

Once the Docker container is running, we should check if the API is functional. The application has two endpoints:

/api/v1/health: Returns a 200 status response if the API is up and running.
/api/v1/predict: Returns a predicted class given an input feature vector.

The API is properly documented with Swagger, so you can head to http:127.0.0.1:5000/apidocs to explore the documentation:

From /apidocs you’ll be able to see request examples to try out both endpoints, along with curl snippets, like:

curl -X GET "http://127.0.0.1:5000/api/v1/health" -H "accept: application/json"

curl -X POST "http://127.0.0.1:5000/api/v1/predict" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"petal_length_cm\": 0, \"petal_width_cm\": 0, \"sepal_length_cm\": 0, \"sepal_width_cm\": 0}"

You should get a “healthy” response from the first command. Likewise, the response from the prediction request should be:

{
  "data": {
    "class": "Iris-virginica"
  }, 
  "message": "Success"
}

Ok, our application is serving our requests appropriately. Now, we can just check if data is reaching our dashboard safe and sound.

It’s important to remember that data is not sent immediately, along with each request done. It’s rather sent periodically, according to the ROTATION_TIME environment variable defined in the .env file. In this example, it was set to 5 minutes, which means that we’ll have to wait 5 minutes before the information is sent to our dashboard. Then, we should see a message from whylogs indicating that the upload is complete.

Once the upload is complete, the received information should be available in your model’s dashboard within minutes. From your model’s dashboard, you can check if any profile was logged, and access your summary’s dashboard:

So far, we have 1 profile logged and 5 features, which means data is getting through. Now, let’s explore the platform a bit more with a use-case: We’ll add some outliers to our data distribution and see how that is reflected in our dashboard.

5. Detecting Feature Drift

The real value of any observability platform surfaces when things go wrong. Things can go wrong in a lot of different ways, but for this demonstration let’s pick a specific one: feature drift.

The drift phenomenon in the context of Machine Learning can be a lengthy subject, so in this case, we’ll limit ourselves to a basic definition of feature drift: when there is a change of distribution of the model’s input.

We’ll use a collection of 150 unchanged samples from the Iris dataset to represent the normal distribution of a daily batch. Then, to represent the anomalous batch, we’ll take the same set of samples, except we’ll replace 30 of these samples for random outliers to one random input feature. We’ll proceed to request predictions with input features as follows:

Day 1 — Normal data
Day 2 — Modified Data (Changed Distribution)
Day 3 — Normal Data

The code snippet below will make 150 requests with unchanged input features. For the second day, we simply change the code to pass data_mod as the payload instead of data (line 35) and repeat the process. For this demonstration, the outliers were added to the sepal_width feature.

import pandas as pd
from sklearn.model_selection import train_test_split
import requests
import time
import numpy as np
import time
       
       
def add_random_column_outliers(data, number_outliers: int = 10) -> None:
    random_column = None
    data_mod = data.copy(deep=True)
    try:
        number_of_columns = len(data_mod.columns) - 2  # Index and label eliminated
        number_of_rows = data_mod.shape[0]
        random_column = data_mod.columns[np.random.randint(number_of_columns) + 1]
        for i in range(number_outliers):
            random_row = np.random.randint(0, number_of_rows)
            data_mod.loc[random_row, random_column] = round(np.random.uniform(low=20.0, high=50.0), 2)
    except Exception as ex:
        raise f"Error adding outliers in random column: {random_column}"
    return data_mod


data = pd.read_csv('dataset/Iris.csv', header=None)


data_mod = add_random_column_outliers(data, 30)
print("Dataset distribution modified!")

url = "http://localhost:5000/api/v1"

labels = ["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm"]
for k in range(data_mod.shape[0]):
        # Build a payload with random values
        payload = dict(zip(labels, data.iloc[:, 0:4].values[k]))
        print(payload)
        response = requests.post(f"{url}/predict", json=payload)
        if response.ok:
            print(response.json())
            time.sleep(5)

Let’s take a brief look at some sections from our dashboard. From the inputs page, there’s a lot of useful information, such as count for the number of records in each batch, null fraction, and inferred feature type.

One particularly useful piece of information is the Distribution Distance. This is a time series that shows the feature’s mean statistical distance to previous batches of data over the selected date range, using the Hellinger Distance to estimate the distances between distributions. By adding the outliers in the sepal_width feature, we see a change in the distribution distance not only for the sepal_width feature itself but also on the prediction results.

Note: For the remaining features, the distance seems to increase for the last day. That was actually a mistake on my part, as I sent 1 additional record, for a total of 151 records. Since the graphs are not in scale between features, the effect of this change seems high, but when inspecting the values themselves, the additional record yielded a distribution distance of 0.02.

We can further inspect the individual features by clicking on them. From the distribution graphs, we can see the drift’s effects on both class and sepal_width features.

In the profiles section, we can select up to 3 dataset profiles for comparison, which is a very cool feature. In the image below, I selected a few features in each profile for comparison. As previously, the changes caused by adding the outliers are also very apparent.

While we won’t be able to discuss every available feature here, some of my favorites are:

Manual thresholds configuration for each feature monitor
Notification scheduling for email/Slack
Uploading training set as a baseline reference set
Segment data during profiling

Feel free to explore them on your own!

6. Conclusion

The post-deployment phase of ML applications presents us with countless challenges and the process of troubleshooting these applications can still be very manual and cumbersome. In this post, we covered an example of how to ease this burden by improving our system’s observability with WhyLabs.

We covered a simple example of deploying a Flask application for classification, based on the well-known Iris dataset, and integrated it with the WhyLabs platform using the whylogs data logging library. We also simulated a Feature Drift scenario by injecting outliers to our input in order to see how that is reflected in our monitoring dashboard. Even though it’s a very simple use case, this example can be easily adapted to cover more tailored scenarios.

Thank you for reading, and feel free to reach out if you have any questions/suggestions! If you try it out and have questions, I’ve found the WhyLabs community on Slack very helpful and responsive.

WhyLabs Team

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Deploy and Monitor your ML Application with Flask and WhyLabs

Improving Observability for your AI System

1. Overview

2. Setting up your WhyLabs Account

3. Deploying the Flask Application

WhyLabs configuration

Environment variables

Creating the environment

Training the Model

Building the Docker Image

4. Testing the API

5. Detecting Feature Drift

6. Conclusion

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs