Deploying and Monitoring Made Easy with TeachableHub and WhyLabs
- ML Monitoring
- AI Observability
Mar 16, 2022
Deploying a model into production and maintaining its performance can be harrowing for many Data Scientists, especially without specialized expertise and equipment. Fortunately, TeachableHub and WhyLabs make it easy to get models out of the sandbox and into a production-ready environment.
TeachableHub - a fully-managed platform bringing ML teams together and streamlining their efforts to deploy, serve, and share impactful models as public or private APIs serverless with zero downtime.
WhyLabs - the premier AI Observability platform enabling you to achieve healthy models, fewer incidents, and happy customers.
In order to give you an idea of how smooth this monitoring and deployment solutions pairing can be, we decided to use an open dataset. And what better way to pick the perfect one than to ask the people in the most active communities out there - the MLOps community, the Data Talks Club, and the Locally Optimistic community. Here are some of the great suggestions for datasets we received if you would like to explore them:
- 21 Repositories of free/open datasets for your next Data Science Project
- Top NLP Code Repositories
- Datasets from the free course based on the Machine Learning Bookcamp book
- Google BigQuery datasets
- Datasets from the 2021 series of Tidy Tuesday events
- Stop using Iris dataset
- MovieLens dataset
- Jump rope dataset
- Messy dataset
We decided to take the example of a Credit Card Fraud Detection problem because this dataset is large, time-series, and includes a heavy class imbalance, which makes it challenging for most ML monitoring solutions to detect issues.
In the following lines, we’ll show how well WhyLabs and TeachableHub integrate. For this project, we decided to work and collaborate in Deepnote - a Jupyter-compatible data science notebook with real-time collaboration in the cloud. We will start by setting up our environment and splitting our dataset into 3 batches: training, testing, and production. From there, we train a simple logistic regression model on the training dataset, evaluate its performance on test data, and deploy it into production using TeachableHub. Once the model is in production, we feed it drifted data that doesn’t match the training and test data. By causing the model to perform poorly, we can observe the change and the resulting performance degradation using WhyLabs.
The Notebook with the detailed version of this example is available and can be directly replicated at your leisure.
To kick off, install all the packages required for your project.
The imbalanced-learn (imported as imblearn) is an open-source library relying on scikit-learn. It provides tools when dealing with classification with imbalanced classes, such as the one we have chosen for this example.
The TeachableHub SDK can be used to make deployments, serve predictions, manage API interactions and more. It can be integrated into any training logic, notebook during experimentation, or CI/CD system during the production environment.
The whylogs SDK is used to generate statistical summaries of the data (at the training, testing, and production stages) which can then be sent to WhyLabs to provide monitoring and observability for a machine learning system in production.
Import whylogs and Config variables
The next step is to import the whylogs packages and configure the variables that you'll be using in the whole environment.
Use the API key that you’ve created with WhyLabs’ free self-serve offering and the ORG ID associated with it to ensure that you’re writing data profiles to the correct model in WhyLabs.
Monitoring in experimentation
Logging the Training Dataset - Input features
Next, we need to log our training data to WhyLabs, creating the first baseline.
Logging the Test Dataset - Input and Predictions
We'll log the testing dataset w/ timestamp of 3 days ago to create our baseline profile at WhyLabs.
For the testing dataset, we'll also log the metrics to track our model's performance. We'll do that by calling `log_metrics,` passing a list of targets (True classes) and predictions, in addition to the prediction scores, representing our probability estimates.
Deploying your Experiment
Import TeachableHub and Config the variables
Your next step is to import the TeachableDeployment package and configure the variables related to the deployment that you'll be using in the whole environment.
- teachable - copy-paste the user name and teachable name from the upper left corner in your TH’s account.
- environment - enter the environment to which you want to deploy the model (e.g. Production, Staging).
- deploy_key- copy-paste your one-time generated deploy key.
TeachableHub supports multiple environments by default, making it easy to create new ones and design custom processes for releasing new candidates to production. Also, with the security keys for every deployment, users can ensure full control over what goes to production and reaches the end-user. The whole setup can be done with a couple of clicks via the intuitive UI of the platform.
Making the model more self-explainable in terms of accepted features and returned predictions will allow the engineers, product owners, and other stakeholders to better understand and work with it. Therefore making the in- and outputs of the model human-readable is recommended and takes a couple of seconds to do it.
Request Schema Validation
To enable the TeachableHub Serving API to accept human-readable features (e.g. Float, String, Integer), users must specify a Feature Schema Validation. In addition to making the model more understandable, the Schema filters out wrongly formatted requests, missing features or incorrect order of features to eliminate lots of mistakes and errors in the work of your Teachable API, that way making the model less likely to behave unexpectedly or straight up break the End-user application.
For more information on how to structure the schema, please take a look at your TeachableHub's Docs Advanced Deployment Guide section.
Moreover, if someone makes any changes to the Validation Schema, the Classes or installs a new package, it will be made clear in the documentation and the engineers can take appropriate actions and avoid unwanted mistakes when releasing the new version.
But the sweetest thing here is that all these documentations are fully auto-generated, come out-of-the-box with every single Teachable version. Yes, you read that right - unique documentation for each deployment version in seconds in a user friendly UI without anyone from the team spending tedious hours writing or updating it.
To ensure that all deployed models work as expected and have better usage examples for everyone who will integrate the Teachable across different software, deployment samples must be provided. They are used for the Auto-Generated Documentation, Model Validation as well as many other UI helpers and guides. They can be provided manually or taken automatically from your training data, as it’s done in the code block below.
The deployment verification process will begin after the model is uploaded on TeachableHub. Each version goes through a series of checks ensuring predictions can be made with the samples provided above, serving will work as expected, the code and the API are compatible, and more. If the verification fails, the particular version is marked in red and an info bar explaining the reasons for the failure, making it easier to debug.
Monitoring in production
To simulate a drift scenario, we'll consider the following: After the model's deployment, there’s a sudden influx of fraudsters who purchase more than $2,000 worth of goods. To highlight how WhyLabs can be used to detect these types of changes in real-world behaviors, we’ll simulate this change by modifying our data.
This can be considered as an example of concept drift, as the relationship between input and output changes. We'll split the production dataset into two parts. In the first part, unchanged data will be sent, and for the second part, all transactions above $2,000 will be changed to be marked as fraudulent.
Before integrating our Teachable into the end-user application and letting external users make requests to the model, we want to serve a few internal predictions. To quickly check if everything is working properly, users can test each version and make a prediction with the provided samples straight from TeachableHub UI without a single line of code using the PredictMan feature.
Once we are certain everything is working properly, we can integrate the models. The next step will be to import the packages and configure all the variables, following the same logic as the deployment. The cool thing is that TeachableHub auto-generates integration code snippets for the commonly used languages containing the samples provided above. That way the person responsible for integrating the API into the End-user app can save time from writing code.
Log - Unchanged Production Dataset
Now that we’ve set up our prediction environment, we can log our dataset (representing day one of the model being in production) to WhyLabs, and see that its performance is similar to the performance of the model on the test dataset.
To do so, we'll iterate through the data frame and request the prediction for each transaction. In addition to the input features, we'll also log the model's performance results, based on the prediction's results (prediction and score) and also the ground truth (target), which is the Class feature from the df.
Log - The Drifted Portion of the Production Dataset
We'll do the same thing for the drifted portion of the production dataset. Now, we'll log with today's timestamp to be able to compare profiles. To inspect the results in WhyLabs, we can go to our model's dashboards and check the performance.
We can see in the dashboard that, although our model’s precision (the number of true positives divided by the number of predicted positives) has remained more or less constant, our recall (the number of true positives divided by the number of true positives plus the number of false negatives) dropped significantly (from around 90% to around 40%), resulting in a degradation in our f1 score.
This is certainly because of the data drift that we introduced for transactions over $2,000. Now that our model is missing so many of those fraudulent transactions, our recall has degraded, resulting in model performance degradation. Fortunately, once we’ve detected this model performance degradation with WhyLabs, we can go back and retrain our model and then redeploy it with TeachableHub.
With the complementary solutions of TeachableHub and WhyLabs, deploying a model into production and maintaining its performance can be done quickly and simply. This means that teams of all sizes can step on the same solid foundation of established MLOps standards, built-in workflows and best practices.
Are you struggling to streamline your deployment process and automate repetitive tasks? Releasing new models or updated versions daily can be hard, but it doesn't have to be. Book a demo of TeachableHub and let us show you how in just a few clicks, you can increase your iteration speed at scale in a safe and secure manner
And once you’ve got your models deployed, monitoring them is a breeze with WhyLabs. Try this for yourself and monitor one of your ML models for free, with the completely free, self-service Starter edition from WhyLabs.
WhyLabs Announces SCA with AWS to Accelerate Responsible Generative AI Adoption
Nov 14, 2023
Understanding and Mitigating LLM Hallucinations
Oct 18, 2023
- AI Observability
Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs
Sep 11, 2023
- ML Monitoring
Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs
Aug 17, 2023
- Machine Learning
Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring
Aug 10, 2023
- ML Monitoring
WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups
Aug 8, 2023
Hugging Face and LangKit: Your Solution for LLM Observability
Jul 26, 2023