Model Monitoring for Financial Fraud Classification
- AI Observability
- Whylogs
- ML Monitoring
- WhyLabs
Sep 19, 2022
Every $1 of fraud loss costs financial services firms $4 in losses [1]. These losses stem from incurred interest, fines and legal fees, labor and investigation costs, and external recovery expenses. To avoid this, financial services firms deploy machine learning models to predict if a transaction is fraudulent or not based on historical data. But how can these firms be confident that their models are effective after they’ve been deployed in production? Issues that can affect a model’s effectiveness include:
- The economic environment changing transaction patterns
- The definition of a fraudulent transaction changing
- Schema changes in how data is being recorded
- Models receiving new values for the first time causing faulty predictions or model failure
When no safeguards are in place, financial services firms suffer more losses because they are failing to catch model performance degradation. This is where model monitoring comes in. Machine learning engineers can monitor their models across different performance metrics and KPIs, get alerted when there are anomalies, and take immediate corrective action.
Financial institutions can use the WhyLabs Observatory to monitor their machine learning models. With its unique approach to monitoring, WhyLabs ensures financial data remains secure and private, and scales with the volume of data being processed by the models. Since data never leaves its environment, there is no risk of any data leaks or misuse of data.
Start monitoring your data and models now - sign up for a free starter account or request a demo!
We’ll use a fraud transaction classification model example to show how the WhyLabs Observatory can be used in the financial services industry. The data will be logged with whylogs, and monitored with the WhyLabs Observatory.
A little bit about WhyLabs
WhyLabs Observatory is the solution for detecting and alerting on any data issues as data is fed into a machine learning model, including data drift, new unique values, missing values, etc. Financial services firms can utilize the Observatory to minimize losses from fraud. The WhyLabs Observatory platform can identify data quality issues/changes in a data’s distribution, detect anomalies, and send notifications. It can also show which aspects of the data have issues, speeding up time to resolution. This saves time from debugging so that data scientists and machine learning engineers can spend more time developing and deploying models that provide value for your business.
The WhyLabs platform monitors data, whether it is being transformed in a feature store, moving through a data pipeline (batch or real-time), or feeding into AI/ML systems or applications. The WhyLabs platform has two components, the open-source whylogs logging library and the WhyLabs Observatory. The whylogs logging library fits into existing tech stacks through a simple Python or Java integration. It supports both structured and unstructured data. No data is copy/duplicated/moved out of the environment, eliminating any risks of data leaks. whylogs analyzes the whole dataset and creates a statistical profile of all the different aspects of the data. By creating statistical profiles, whylogs captures rare events, seasonality, and outliers that otherwise might be missed with sampling as well as keeping sensitive financial data private.
Once whylogs profiles are ingested into the WhyLabs Observatory, monitors are enabled and anomaly detection is run on the profiles. Pre-built data monitors can be enabled with just a click to look for data drift, null values, data type changes, new unique values, and model performance metrics (e.g. Accuracy, Precision, Recall, and F1 Score). If there isn’t a pre-built monitor available for data issues/model metrics, there is a guided wizard on creating a custom monitor available. If anomalies are detected, notifications are generated showing which aspects of the data/model have issues. For more on data and model monitoring, go here.
Discussion of the dataset and data dictionary
The dataset used for this example is a modified version of this Kaggle dataset and has 7 features (model inputs) and 1 target (model output, isFraud). For this example, the “nameDest” column has been replaced with a column called “countries” and the target imbalance (isFraud) was changed to 91.7% not fraud (0), and 8.3% fraudulent (1).
The data dictionary is:
- step: Maps a unit of time in the real world. In this case, 1 step is 1 hour of time
- type: Type of transaction with 5 unique types: CASH-IN, CASH-OUT, DEBIT, PAYMENT, and TRANSFER (later two additional types are added, PAYPAL and VENMO)
- amount: Amount of the transaction in US Dollars
- nameOrig: Customer who started the transaction
- oldbalanceOrig: Initial balance before the transaction
- newbalanceOrig: Customer’s balance after the transaction
- country: Country where the transaction occurred
- oldbalanceDest: Initial recipient balance before the transaction
- newbalanceDest: Recipient’s balance after the transaction
- isFraud (Target): If transaction is fraudulent. 0 (not fraudulent), 1 (fraudulent)
NOTE: Although the Financial Fraud example has 9 inputs, WhyLabs is capable of monitoring models with 1000s of model inputs.
Getting started with WhyLabs
To start using WhyLabs, sign up for a free account! After creating an account and logging in to the Observatory, the Project Dashboard will appear.
Once in the Project Dashboard, click “Create a project,” and a new box will appear. Under “Model project”, click on “Set up model.” A new screen will appear, where you’ll be prompted to enter a project name and type (if applicable). The below .gif shows this process.
After creating a new project, click on WhyLabs in the top left to get back to the Observatory. Once in the Observatory, find the project card that was just created. Then at the bottom of that project card, click on “Set up a whylogs integration.” A new page will appear. Locate the dropdown menu underneath “Select a model project for whylogs integration.” Click on the dropdown, and select the newly created model. Note the Model ID, which in the example below is model-42
. Copy this as this will be needed when logging profiles with whylogs.
Once the model has been selected, copy the Organization ID underneath (it would be similar to org-xxxxxx) as that will be needed for whylogs. Then click the orange button underneath that says “Create Access Token.” This will generate an API Access Token. Copy the API Access Token. The Model ID, Organization ID, and API Access Token will be used by whylogs to send the profiles to the correct model in the WhyLabs Observatory.
The below example discusses how production data can be profiled with whylogs using a pipelined XGBoost model in Python. An example notebook that shows how to use whylogs with Python can be found here. whylogs creates profiles of data by taking in the data as a pandas dataframe (df).
To use whylogs, a few imports and configurations need to be done. The code below shows how to set up whylogs for use in Python.
# if whylogs isn't installed, it can be installed via pip
# pip install whylogs[whylabs]
# print(whylogs.__version__) to confirm using at least whylogs V1.1
# importing whylogs
import whylogs as why
import os
os.environ["WHYLABS_API_KEY"] = "The WhyLabs API Generated Access Token Here"
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "The WhyLabs Organization ID Here"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "The WhyLabs Model ID Here That Will Receive the whylogs Profiles"
from whylogs.api.writer.whylabs import WhyLabsWriter
# importing pandas to read data into a dataframe. whylogs takes in dataframes.
import pandas as pd
# read in data as a dataframe, dataframe will be used for profiling with whylogs
df = pd.read_csv("path/to/your/data.csv")
After whylogs is imported, import all necessary libraries needed for building and training an XGBoost model. Then load the training data, train the XGBoost model on the training data, create a pipelined XGBoost model that is fitted on the training data, and feed production data to the pipeline model. For an idea of what this looks like, see the below starter code.
Importing additional necessary libraries:
# For data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# For building and training an XGBoost model
from xgboost import XGBClassifier
# For getting different metric scores, and splitting the data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# For data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# For creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# For defining maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# For supressing scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# For supressing warnings
import warnings
warnings.filterwarnings("ignore")
After reading in data into dataframe and verifying it is acceptable for training, build and train an XGBoost model:
# Making a copy of the dataframe
training = df.copy()
# Separating target variable and other variables
X = training.drop(columns = "Target")
Y = training["Target"]
# Creating dummy variables for non-numeric columns
ohe = OneHotEncoder()
X = ohe.fit_transform(X[["all non-numeric columns"]])
# Splitting data into training and test set:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1, stratify = Y)
print(X_train.shape, X_test.shape)
# Building and training an XGBoost model
model = XGBClassifier()
model.fit(X_train, Y_train)
preds = model.predict(X_test)
print(f'Accuracy = {accuracy_score(Y_test, preds):.2f}\nRecall = {recall_score(Y_test, preds):.2f}\n')
cm = confusion_matrix(Y_test, preds)
print('\n')
report = classification_report(Y_test, preds)
print(report)
print('\n')
plt.figure(figsize=(8, 6))
plt.title('XGBoost Classifier Confusion Matrix (with original data)', size=16)
sns.heatmap(cm, annot=True, cmap='Blues');
Once finished training the model, and it is acceptable, pipeline it:
# Pipelining XGBoost with default parameters
Model = Pipeline(
steps=[
(
"XGBoost",
XGBClassifier(),
),
]
)
# Making a copy of the dataframe
training = df.copy()
# Separating target variable and other variables
X = training.drop(columns = "Target")
Y = training["Target"]
# Creating dummy variables for non-numeric columns
ohe = OneHotEncoder()
X = ohe.fit_transform(X[["all non-numeric columns"]])
# Fitting the model on training data
Model.fit(X, Y)
Feed production data into the pipeline model:
# Importing production data
production = pd.read_csv("path/to/your/production.csv")
# Making a copy of the dataframe
prod = production.copy()
# Separating target variable and other variables
prod_x = prod.drop(columns = "Target")
prod_y = prod["Target"]
# Creating dummy variables for non-numeric columns
ohe = OneHotEncoder()
prod_x = ohe.fit_transform(prod_x[["non-numeric columns"]])
# Predicting on production data
model_predictions = Model.predict(prod_x)
# Predicting probabilities
predict_proba = Model.predict_proba(prod_x)
# Determining model's confidence score for a prediction
scores = [max(p) for p in predict_proba]
# Adding a new column to output called prediction (output) with model's predictions
prod_y["prediction_output"] = model_predictions
# Adding a new column to output called output_confidence with model's prediction confidence score
prod_y["output_scores"] = scores
# Combining the two dataframes before logging with whylogs (if any outputs are going to be logged)
prod = pd.concat([prod_x, prod_y], axis = 1)
# Renaming target column to include target so WhyLabs will recognize it as actual output
prod = prod.rename(columns = {"Target": "output_Target"})
Once production data is available, profiling with whylogs can start. To do that, a few lines of code are written for profiling the data. If ground truth (actual outcome) data is available, the data and ground truth would be logged with the below code:
results = why.log_classification_metrics(
prod,
target_column = "The Target Feature (Must Contain the word ‘Output’ as part of the column name, e.g. ‘output_Target’)”,
prediction_column = "prediction_output",
score_column = "output_scores"
)
profile = results.profile()
results.writer("whylabs").write()
If model inputs are only available, the data would be logged with the below code.
profile_results = why.log(prod)
profile_results.writer("whylabs").write()
WhyLabs AI Observatory platform
Day 1 (7/21)
Now that profiles are logged, the first day of using the WhyLabs Observatory is focused on getting familiar with the data and setting up appropriate monitors for the fraud classification model. When first logging into the WhyLabs Observatory, the Project Dashboard appears, which is a central view of all pipelines/models being monitored. For every Project being monitored, there is a high level overview of the number of detected anomalies as well as the category the alerts fall in. The Financial Fraud model is what will be investigated.
Looking at the first day (7/21) of profiles in Profile view, the WhyLabs Observatory automatically displays the statistics captured from the whylogs profiles and automatically generates visualizations from those statistics showing the distribution of the data.
After understanding what the data looks like, appropriate monitors can be enabled. If unsure what kind of monitors are available, The WhyLabs Observatory has Preset monitors in the Model Manager. Multiple monitors can be enabled, and a monitor’s anomaly threshold can be configured to what is appropriate for the model. If looking to create a custom monitor, this can be done through a no-code wizard or a code editor. In response to monitors, actions can be specified to address a monitor’s alert. For example, WhyLabs supports using a Webhook to trigger a model re-training pipeline if a model’s F1 Score falls below 10%.
Through WhyLabs -> Settings -> Notifications panel, users can specify how they would like to receive anomaly notifications.
Day 2 (7/22)
Now that profiles are in the platforms and monitors are enabled and configured, data comparison can begin. By going to the Profiles UI, The Day 2 (7/22) profile can be compared with the Day 1 (7/21) profile. The WhyLabs Observatory is able to generate a side by side comparison of the statistics captured as well as the visualizations generated. Through the profile comparison, the “step” input has some different data compared to the previous day, however, no alert has gone off.
To ensure there are no anomalies, go into the Anomalies Feed section of the Monitor Manager to confirm if there are anomalies.
For this example, ground truth is available. Going to the Performance section of the platform, classification metrics (Accuracy, Precision, Recall, F1 Score) are automatically calculated and displayed showing how the model’s performance is over time.
If there are additional model-specific metrics or business KPIs that need to be tracked, custom code can be written in whylogs that defines these metrics and KPIs, and they would be tracked and displayed in the Performance section in WhyLabs.
Day 3 (7/23)
A notification appeared for the first time. Going to the Anomalies Feed, information is provided on what the anomaly is. The “amount” feature had a significant distribution shift.
Going to the Performance metrics section, all the model metrics have had a significant drop off. What’s concerning is that the False Negatives (lower left corner of the Confusion Matrix) increase by almost 6X, meaning that the model is misclassifying almost 6X more transactions as fraudulent, when they actually weren’t fraudulent. If not fixed, the model could be leading the business to spend more money handling fraudulent transactions than allocated.
To figure out what’s changed with “amount”, a good place to start would be to do a profile comparison comparing today’s profile (7/23) to the previous day’s profile (7/22). Looking at the statistics between the two profiles as well as the visualizations, it looks like on 7/23 the values for “amount” came in as cents as opposed to the usual dollars, which means a schema change happened somewhere upstream of the model. Through this data comparison, the WhyLabs Observatory was able to help debug what happened with “amount”.
If users want to compare the data distribution to previous days, users can go to Inputs and click on “amount”. Different distributions are seen, and when at the Estimated Quantiles Distribution, there is a data distribution alert since the data distribution this day exceeded the anomaly threshold. With this information, users can go talk to the data engineering team to address the schema change and make sure “amount” is recorded in dollars again.
Day 6 (7/26)
After addressing the “amount” schema change with the data engineering team on Day 3, Days 4 and 5 were normal days with no WhyLabs Observatory alerts. However, on Day 6 (7/26), multiple alerts went off and going to the Anomalies Feed confirms multiple alerts.
After reviewing the alerts in the Anomalies Feeds, going over to the Performance tab helps realize how catastrophic the alerts are as all the model performance metrics have plummeted to 0.
Seeing that the “type” input was labeled a high priority alert in the Anomalies Feed, a good place to look is Profiles to see what happened to “type”.
Doing a profile comparison between Day 6 (7/26) and Day 5 (7/25) shows that two new transaction types, PAYPAL and VENMO were added to 7/26.
This was further confirmed by going to Inputs, clicking on “type” and seeing the visualization below.
Also, checking the Output section, and the clicking on the, “output - isFraud” output confirms that the number of missing values went up because PAYPAL and VENMO weren’t transaction types included during model training, and since the classification model was seeing them for the first time, the model broke since it didn’t know how to predict on those two transactions.
After confirming that PAYPAL and VENMO are two new transaction types introduced into the production pipeline that the data scientists weren’t aware of, there has likely been a miscommunication that led to model failure. Connecting a Webhook to a model performance metric monitor can limit the damage. Once the model performance metric drops below the set threshold, the Webhook triggers a re-training job on new data, so the model can then predict on new incoming data without leading to costly, unplanned downtime.
Day 7 (7/27)
After what happened on Day 6, “type” had an automated model pipeline re-training trigger to re-train the model on new data. Day 7 the model recognized PAYPAL and VENMO and the metrics got back to the expected levels.
Conclusion
As you can see, without monitoring in place, some of the WhyLabs Observatory’s detected anomalies might have gone unnoticed for long periods. Even if they were noticed, data scientists/MLOps engineers might struggle to debug why a model’s performance degraded without understanding where to look. The WhyLabs Observatory is a safety mechanism for models to ensure they are providing the results the business needs and addressing issues as soon as they arise instead of the alternative.
Sign up to try WhyLabs Observatory for free and start monitoring your data and models today!
Please check out the Resources section below to get started with whylogs and WhyLabs. If you’re interested in learning how you can apply data and/or model monitoring to your organization, please contact us, and we would be happy to talk!
Resources
Sources
Other posts
Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs
Dec 10, 2024
- AI risk management
- AI Observability
- AI security
- NIST RMF implementation
- AI compliance
- AI risk mitigation
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI