Deploy and Monitor your ML Application with Flask and WhyLabs
- AI Observability
- Whylogs
- ML Monitoring
- Open Source
- Integrations
Nov 9, 2021
Improving Observability for your AI System
This article was written by Felipe de Pontes Adachi, and first appeared on Towards Data Science on November 3rd, 2021. It's republished with the author's permission.
One of the best milestones in every AI builder’s journey is the day when a model is ready to graduate from training and get deployed into production. According to a recent survey, most organizations already have more than 25 models in production. This underscores how enterprises are increasingly relying on ML to improve performance as intended outside of the lab. However, the post-deployment phase can be a difficult one for your ML model. Data scientists may think that the hard part’s over once a model is deployed, but the truth is the problems have only just begun. Data errors, broken pipelines, and model performance degradation will certainly come one day or another and, when this day comes, we must be prepared to debug and troubleshoot these problems efficiently. In order to have a reliable and robust ML system, observability is an absolute must.
In this article, I’d like to share an approach on how to improve the observability of your ML application by efficiently logging and monitoring your models. To demonstrate, we’ll deploy a Flask application for pattern recognition based on the well-known Iris Dataset. On the monitoring part, we’ll explore the free, starter edition of the WhyLabs Observability Platform in order to set up our own model monitoring dashboard. From the dashboard, we’ll have access to all the statistics, metrics, and performance data gathered from every part of our ML pipeline. To interact with the platform, we’ll use whylogs, an open-source data logging library developed by WhyLabs.
One of the key use cases for a monitoring dashboard is when there is an unexpected change in our data quality and/or consistency. On that account, we simulate one of the most common model failure scenarios, that of feature drift, meaning that there’s a change in our input’s distribution. We then use a monitoring dashboard to see how this kind of problem can be detected and debugged. While feature drift may or may not affect your model’s performance, observing this kind of change in production should always be a reason for further inspection and, possibly, model retraining.
In summary, we’ll cover the step-by-step path to:
- Deploy the WhyLabs-integrated Flask API
- Test the application
- Inject Feature Drift and Explore the model dashboard
If this is a problem that resonates with you, we’ve made the complete code for this tutorial available here for you to reuse. You’ll need to copy it onto your machine in order to follow this example. You’ll also need a Docker and, preferably, a Python environment management tool, such as Conda or pipenv. For those interested in the details, we’ve included a very thorough description and a Jupyter Notebook as a guideline.
1. Overview
Let’s take a look at how the different components of our system interact with each other.
We’ll deploy locally a Flask application, which is responsible for serving the user with the requested predictions through a REST endpoint. Our application will also use the whylogs library to create statistical profiles of both input and output features of our application during production. These statistical properties will then be sent in microbatches to WhyLabs at fixed intervals. WhyLabs merges them automatically, creating statistical profiles on a daily basis.
There are various ways to upload your logged metrics into the platform. In this example, we’re uploading them periodically in microbatches, as shown in the diagram. Another option would be to upload them at each log step.
2. Setting up your WhyLabs Account
In order to monitor our application, let’s first set up a WhyLabs account. Specifically, we’ll need two pieces of information:
- API token
- Organization ID
Go to https://whylabs.ai/whylabs-free-sign-up and grab a free account. You can follow along with the examples if you wish, but if you’re interested in only following this demonstration, you can go ahead and skip the quick start instructions.
After that, you’ll be prompted to create an API token, which you’ll use to interact with your dashboard. Once you create the token, copy it and store it in a safe place. The second important information here is your org ID. Take note of it as well. WhyLabs gives you an example code of how to create a session and send data to your dashboard. You can test it as well and check if data is getting through. Otherwise, after you get your API Token and Org ID, you can just go to https://hub.whylabsapp.com/models to see your shiny new model’s dashboard. To get to this step, we used the WhyLabs API Documentation, which also provides additional information about token creation and basic examples on how to use it.
3. Deploying the Flask Application
Once the dashboard is up and running, we can get started on deploying the application itself. We’ll serve a Flask application with Gunicorn, packaged into a Docker container to facilitate deployment.
WhyLabs configuration
The first step will be configuring the connection to WhyLabs. In this example, we do that through the .whylogs_flask.yaml
file. The writer can be set to either output data locally or to different locations, like, for example, S3, an MLFlow path, or directly to WhyLabs. In this file, we’ll also set the project’s name and other additional information.
project: example-project
pipeline: example-pipeline
verbose: false
writers:
- type: whylabs
Environment variables
The application assumes the existence of a set of variables, so we’ll define them in a .env
file. These are loaded later using the dotenv library. Copy the content below and replace the WHYLABS_API_KEY
and WHYLABS_DEFAULT_ORG_ID
values to the ones you got when creating your WhyLabs account. You can keep the other variables as is.
# This is an example of what .env file should looks like
# Flask
FLASK_ENV=development
FLASK_DEBUG=1
FLASK_APP=autoapp.py
MODEL_PATH=model.joblib
# Swagger Documentation
SWAGGER_HOST=0.0.0.0:5000
SWAGGER_BASEPATH=/api/v1
SWAGGER_SCHEMES={"http"}
# WhyLabs
WHYLABS_CONFIG=.whylogs_flask.yaml
WHYLABS_API_KEY=<your-api-key>
WHYLABS_DEFAULT_ORG_ID=<your-org-id>
WHYLABS_DEFAULT_DATASET_ID=model-1
WHYLABS_API_ENDPOINT=https://api.whylabsapp.com
# WhyLabs session
DATASET_NAME=this_is_my_dataset
# ROTATION_TIME is how often we create the microbatch
ROTATION_TIME=1m
DATASET_URL=dataset/Iris.csv
Let’s talk about some of the other existing variables.
- WHYLABS_DEFAULT_DATASET_ID — The ID of your Dataset. Model-1 is the default value, created automatically once you create your account. Leave it unchanged if you haven’t sent anything to WhyLabs yet. But if it’s already populated, you can change it to model-2. Just remember to set up the new model in https://hub.whylabsapp.com/models. A new model-id will be assigned to the newly-created model, and you can use it in your .env file.
- ROTATION_TIME — Period used to send data to WhyLabs. In real applications, we might not need such frequent updates, but for this example, we set it to a low value so we don’t have to wait that long to make sure it’s working.
Creating the environment
The application will be run from inside a Docker container, but we’ll also run some pre-scripts (to download the data and train the model) and post-scripts (to send the requests), so let’s create an environment for this tutorial. If you’re using Conda, you can create it from the environment.yml
configuration file:
conda env create -f environment.yml
If that doesn’t work out as expected, you can also create an environment from scratch:
conda create -n whylogs-flask python=3.7.11
conda activate whylogs-flask
And then install directly from the requirements:
python -m pip install -r requirements.txt
Training the Model
If you look carefully, one of our environment variables is called MODEL_PATH
. We don’t have a model yet, so let’s create it. We’ll use sklearn to train a simple SVC classification model from the Iris dataset and save it as model.joblib
. From the root of our example’s folder, we can simply call the training routine:
python train.py
Building the Docker Image
Now we have everything to actually run our application. To build the image, we define the Dockerfile below:
ARG PYTHON_VERSION
FROM python:${PYTHON_VERSION}
# Create a working directory.
RUN mkdir /app
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the code
COPY ./ /app
CMD [ "gunicorn", "--workers=2", "--bind=0.0.0.0:5000", "--threads=1", "--reload", "autoapp:app"]
And then build the image and run our container:
docker build --build-arg PYTHON_VERSION=3.7 -t whylabs-flask:latest .docker run --rm -p 5000:5000 -v $(pwd):/app whylabs-flask
This will start our application locally on port 5000. In this example, we’re running our Flask application as a WSGI HTTP Server with Gunicorn. By setting thereload
variable, we’re able to debug more effectively, automatically reloading the workers when there’s a change in code.
You should see a message from gunicorn starting the server:
[2021-10-12 17:53:01 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-10-12 17:53:01 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
[2021-10-12 17:53:01 +0000] [1] [INFO] Using worker: sync
[2021-10-12 17:53:01 +0000] [8] [INFO] Booting worker with pid: 8
[2021-10-12 17:53:01 +0000] [20] [INFO] Booting worker with pid: 20
4. Testing the API
Once the Docker container is running, we should check if the API is functional. The application has two endpoints:
- /api/v1/health: Returns a 200 status response if the API is up and running.
- /api/v1/predict: Returns a predicted class given an input feature vector.
The API is properly documented with Swagger, so you can head to http:127.0.0.1:5000/apidocs to explore the documentation:
From /apidocs you’ll be able to see request examples to try out both endpoints, along with curl snippets, like:
curl -X GET "http://127.0.0.1:5000/api/v1/health" -H "accept: application/json"
curl -X POST "http://127.0.0.1:5000/api/v1/predict" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"petal_length_cm\": 0, \"petal_width_cm\": 0, \"sepal_length_cm\": 0, \"sepal_width_cm\": 0}"
You should get a “healthy” response from the first command. Likewise, the response from the prediction request should be:
{
"data": {
"class": "Iris-virginica"
},
"message": "Success"
}
Ok, our application is serving our requests appropriately. Now, we can just check if data is reaching our dashboard safe and sound.
It’s important to remember that data is not sent immediately, along with each request done. It’s rather sent periodically, according to the ROTATION_TIME
environment variable defined in the .env
file. In this example, it was set to 5 minutes, which means that we’ll have to wait 5 minutes before the information is sent to our dashboard. Then, we should see a message from whylogs indicating that the upload is complete.
Once the upload is complete, the received information should be available in your model’s dashboard within minutes. From your model’s dashboard, you can check if any profile was logged, and access your summary’s dashboard:
So far, we have 1 profile logged and 5 features, which means data is getting through. Now, let’s explore the platform a bit more with a use-case: We’ll add some outliers to our data distribution and see how that is reflected in our dashboard.
5. Detecting Feature Drift
The real value of any observability platform surfaces when things go wrong. Things can go wrong in a lot of different ways, but for this demonstration let’s pick a specific one: feature drift.
The drift phenomenon in the context of Machine Learning can be a lengthy subject, so in this case, we’ll limit ourselves to a basic definition of feature drift: when there is a change of distribution of the model’s input.
We’ll use a collection of 150 unchanged samples from the Iris dataset to represent the normal distribution of a daily batch. Then, to represent the anomalous batch, we’ll take the same set of samples, except we’ll replace 30 of these samples for random outliers to one random input feature. We’ll proceed to request predictions with input features as follows:
- Day 1 — Normal data
- Day 2 — Modified Data (Changed Distribution)
- Day 3 — Normal Data
The code snippet below will make 150 requests with unchanged input features. For the second day, we simply change the code to pass data_mod
as the payload instead of data
(line 35) and repeat the process. For this demonstration, the outliers were added to the sepal_width
feature.
import pandas as pd
from sklearn.model_selection import train_test_split
import requests
import time
import numpy as np
import time
def add_random_column_outliers(data, number_outliers: int = 10) -> None:
random_column = None
data_mod = data.copy(deep=True)
try:
number_of_columns = len(data_mod.columns) - 2 # Index and label eliminated
number_of_rows = data_mod.shape[0]
random_column = data_mod.columns[np.random.randint(number_of_columns) + 1]
for i in range(number_outliers):
random_row = np.random.randint(0, number_of_rows)
data_mod.loc[random_row, random_column] = round(np.random.uniform(low=20.0, high=50.0), 2)
except Exception as ex:
raise f"Error adding outliers in random column: {random_column}"
return data_mod
data = pd.read_csv('dataset/Iris.csv', header=None)
data_mod = add_random_column_outliers(data, 30)
print("Dataset distribution modified!")
url = "http://localhost:5000/api/v1"
labels = ["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm"]
for k in range(data_mod.shape[0]):
# Build a payload with random values
payload = dict(zip(labels, data.iloc[:, 0:4].values[k]))
print(payload)
response = requests.post(f"{url}/predict", json=payload)
if response.ok:
print(response.json())
time.sleep(5)
Let’s take a brief look at some sections from our dashboard. From the inputs page, there’s a lot of useful information, such as count for the number of records in each batch, null fraction, and inferred feature type.
One particularly useful piece of information is the Distribution Distance. This is a time series that shows the feature’s mean statistical distance to previous batches of data over the selected date range, using the Hellinger Distance to estimate the distances between distributions. By adding the outliers in the sepal_width
feature, we see a change in the distribution distance not only for the sepal_width
feature itself but also on the prediction results.
Note: For the remaining features, the distance seems to increase for the last day. That was actually a mistake on my part, as I sent 1 additional record, for a total of 151 records. Since the graphs are not in scale between features, the effect of this change seems high, but when inspecting the values themselves, the additional record yielded a distribution distance of 0.02.
We can further inspect the individual features by clicking on them. From the distribution graphs, we can see the drift’s effects on both class
and sepal_width
features.
In the profiles section, we can select up to 3 dataset profiles for comparison, which is a very cool feature. In the image below, I selected a few features in each profile for comparison. As previously, the changes caused by adding the outliers are also very apparent.
While we won’t be able to discuss every available feature here, some of my favorites are:
- Manual thresholds configuration for each feature monitor
- Notification scheduling for email/Slack
- Uploading training set as a baseline reference set
- Segment data during profiling
Feel free to explore them on your own!
6. Conclusion
The post-deployment phase of ML applications presents us with countless challenges and the process of troubleshooting these applications can still be very manual and cumbersome. In this post, we covered an example of how to ease this burden by improving our system’s observability with WhyLabs.
We covered a simple example of deploying a Flask application for classification, based on the well-known Iris dataset, and integrated it with the WhyLabs platform using the whylogs data logging library. We also simulated a Feature Drift scenario by injecting outliers to our input in order to see how that is reflected in our monitoring dashboard. Even though it’s a very simple use case, this example can be easily adapted to cover more tailored scenarios.
Thank you for reading, and feel free to reach out if you have any questions/suggestions! If you try it out and have questions, I’ve found the WhyLabs community on Slack very helpful and responsive.
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI