blog bg left
Back to Blog

Production-Ready Models with Databricks and WhyLabs

Every AI practitioner turns to Databricks when they need to build massively scalable AI and data pipelines. But once in production, these AI pipelines are prone to silent failures because of missing values, changing schema, and broken feature transformations. Troubleshooting these failures in terabyte-sized pipelines is like looking for a needle in a needlestack, it takes weeks and often causes significant financial losses because of degraded customer experience. That’s where WhyLabs enables AI and data monitoring that works seamlessly in the Databricks environment. WhyLabs AI Observatory is built to monitor any scale of data, in a distributed environment, without the need to sample or move data around.

Unlocking capabilities that are almost too good to be true

WhyLabs has partnered with Databricks to enable a unique integration that makes it possible to compute all key telemetry data necessary for AI monitoring directly in Apache Spark. The telemetry data is collected in parallel with the AI pipeline using the WhyLabs-MLFlow integration. With this integration, every data science and machine learning team can monitor data quality at the speed of 1M rows per second, while enjoying a fully distributable computation.

With WhyLabs, teams can answer questions about data of any size:

  • How do I know if the data my model depends on is always high quality? A spike in null values or a sudden change in schema can cause your model to produce unexpected results, monitoring your data continually is important for a production-ready model.
  • How can I identify data drift before it causes my model to degrade? Data drift can be slow or abrupt and cause your model to produce less accurate results. Without a method of continually comparing datasets against themselves over time, this can be a very manual process to identify drift in your data.
  • How can I gauge the performance of my models over time and know if it’s gotten worse? Understanding whether a model is healthy involves more than what application or infrastructure tools give visibility into. You need visibility into the predictions as well as ground truth to understand how your model is performing over time.
  • How can I identify bias in my predictions and correlate it to specific segments of my dataset? Bias in your predictions can be challenging to identify unless you’re monitoring your predictions at scale looking for segments that are performing poorly.

One solution for observability across the entire Databricks ecosystem

WhyLabs AI Observatory enables out-of-the-box monitoring for your data pipelines and models running in Databricks across all data types: structured, semi-structured, unstructured, and streaming. The Lakehouse architecture is incredibly flexible, covering use cases across business intelligence, data streaming, machine learning, and generative AI. With WhyLabs, there is an integration that supports every one of these use cases. We will briefly introduce each of these use cases, but if you are looking for advice on the best place to plug in WhyLabs for your unique setup, ping us on the community Slack channel.

Big data and streaming data pipelines with Spark and Delta Live Tables

Data is the lifeblood of any AI and BI application, so ensuring that this data is high quality and observable is critical for the health of these applications. Across our customers, the most common performance degradation in AI models is caused by data quality bugs: missing values, changes in distribution, or introduction of new categories. WhyLabs provides an easy to integrate and cost-effective solution for monitoring all key data quality metrics, alerting team members about issues, and helping with root cause analysis. WhyLabs Integrates with Delta Live Tables and Spark, making it easy to set up observability for any data pipeline.

from pyspark.sql import SparkSession
from pyspark import SparkFiles
from whylogs.api.pyspark.experimental import collect_dataset_profile_view
import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter

spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()
arrow_config_key = "spark.sql.execution.arrow.pyspark.enabled"
spark.conf.set(arrow_config_key, "true")

data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
spark.sparkContext.addFile(data_url)

spark_dataframe = spark.read.option("delimiter", ";") \
  .option("inferSchema", "true") \
  .csv(SparkFiles.get("winequality-red.csv"), header=True)

dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe)

writer = WhyLabsWriter()
writer.write(file=profile.view())

ML model experimentation and serving with MLflow

ML models power the most important applications across enterprises, everything from marketing to core product experiences. The MLflow toolchain makes it simple to track experiments during model development as well as to serve models for inference in production. Once the model is in production, monitoring its performance and the health of model inputs is crucial to ensure ROI. Without monitoring, ML models fail silently because of data drift, biases, and data quality issues. WhyLabs integrates with MLflow serving to build a training data baseline and to assess model training data for data quality issues. Once in production, WhyLabs integrates with MLflow serving to continuously monitor model performance and ensure that the model inputs are not drifting to cause training-serving skew.

AI applications built on Dolly

Databricks’ Dolly is an instruction-following large language model trained on the Databricks machine learning platform. Large Language Models (LLMs) like Dolly are transforming the landscape of AI applications from genuinely helpful chatbots to nearly autonomous code generation tools. But this incredible technology doesn’t come without deployment challenges, as these models are prone to hallucinations, biases, as well as privacy and security loopholes. WhyLabs makes it easy to monitor and safeguard Dolly and other LLMs hosted on Databricks utilizing our industry standard for LLM monitoring, LangKit. LangKit detects and prevents malicious prompts, toxicity, hallucinations, and jailbreak attempts. Here is an example of our seamless integration of LangKit with LangChain for building AI applications.

from langkit import llm_metrics
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
from langchain.callbacks.whylabs_callback import WhyLabsCallbackHandler
from training.generate import InstructionTextGenerationPipeline, load_model_tokenizer_for_generate


model, tokenizer = load_model_tokenizer_for_generate(“databricks/dolly-v2-3b)

prompt = PromptTemplate(input_variables=["instruction"], template="{instruction}")

whylabs = WhyLabsCallbackHandler.from_params(org_id=<your-org>, api_key=<your-key>, dataset_id=<your-dataset>)

hf_pipeline = HuggingFacePipeline(
    pipeline=InstructionTextGenerationPipeline(
        model=model, tokenizer=tokenizer, return_full_text=True, task="text-generation"),
        callbacks=[whylabs]
    )

llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)

response = llm_chain.run(instruction="Explain to me the difference between nuclear fission and fusion.")

Privacy-preserving and cost-effective integration

WhyLabs integrates in a unique privacy-preserving way. The integration does not require users to move data outside of the existing environment - all data processing happens within the Databricks Lakehouse and does not involve data duplication or sampling. This integration approach reduces security risks significantly and ensures that data is handled with the highest level of privacy and confidentiality. The data processing is done using highly optimized telemetry agents (whylogs), which enables telemetry collection in a fully distributed manner. Using telemetry agents is very cost-effective, as the computation of telemetry adds minimal overhead to existing pipelines (see benchmarks). The WhyLabs AI telemetry agents have been battle tested at massive data scale organizations like Lyft, StitchFix, Square, and Yahoo Japan.

Take the guesswork out of AI/ML and data health today

Enabling WhyLabs on any Databricks pipeline is quick and simple. Get started with this easy example and see the power of WhyLabs observability in a matter of minutes. If you’d like to talk about your specific use case, reach out over Slack or schedule time with our solution architects - we’d be happy to help!

Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

WhyLabs and Amazon Web Services (AWS) explore the various ways embeddings are used, issues that can impact your ML models, how to identify those issues and set up monitors to prevent them in the future!

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

Hugging Face and LangKit: Your Solution for LLM Observability

See how easy it is to generate out-of-the-box text metrics for Hugging Face LLMs and monitor them in WhyLabs to identify how model performance and user interaction are changing over time.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...