Back to Blog

WhyLabs' Data Geeks Unleashed

Hello WhyLabs followers! This month three members of the WhyLabs team are speaking at the Data and AI Summit. In this post we've included descriptions and links to all their talks.

Data and AI Summit 2021 (Virtual)

Alessya Visnjic, will be presenting on The Critical Missing Component in the Production ML Stack on May 26, 2021 03:50 pm PT:


The day the ML application is deployed to production and begins facing the real world is the best and the worst day in the life of the model builder. The joy of seeing accurate predictions is quickly overshadowed by a myriad of operational challenges. Debugging, troubleshooting & monitoring takes over the majority of their day, leaving little time for model building. In DevOps, software operations are taken to a level of an art. Sophisticated tools enable engineers to quickly identify and resolve issues, continuously improving software stability and robustness. In the ML world, operations are still largely a manual process that involves Jupyter notebooks and shell scripts. One of the cornerstones of the DevOps toolchain is logging. Traces and metrics are built on top of logs enabling monitoring and feedback loops. What does logging look like in an ML system?
In this talk we will demonstrate how to enable data logging for an AI application using MLflow in a matter of minutes. We will discuss how something so simple enables testing, monitoring and debugging in an AI application that handles TBs of data and runs in real-time. Attendees will leave the talk equipped with tools and best practices to supercharge MLOps in their team.

Leandro Almeida, will be presenting on Semantic Image Logging Using Approximate Statistics & MLflow on May 27, 2021 11:00 am PT:


As organizations launch complex multi-modal models into human-facing applications, data governance becomes both increasingly important, and difficult. Specifically, monitoring the underlying ML models for accuracy and reliability becomes a critical component of any data governance system. When complex data, such as image, text and video, is involved, monitoring model performance is particularly problematic given the lack of semantic information. In industries such as health care and automotive, fail-safes are needed for compliant performance and safety but access to validation data is in short supply, or in some cases, completely absent. However, to date, there have been no widely accessible approaches for monitoring semantic information in a performant manner.
In this talk, we will provide an overview of approximate statistical methods, how they can be used for monitoring, along with debugging data pipelines for detecting concept drift and out-of-distribution data in semantic-full data, such as images. We will walk through an open source library, whylogs, which combines Apache Spark and novel approaches to semantic data sketching. We will conclude with practical examples equipping ML practitioners with monitoring tools for computer vision, and semantic-full models.

Andy Dang, will be presenting on Re-imagine Data Monitoring with whylogs and Spark on May 27, 2021 05:00 pm PT:


In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Past talks

You might be interested to checkout a couple of other talks that the WhyLabs team have given recently:

Bernease Herman was at csv,conf presenting on Static datasets aren't enough: where deployed systems differ from research on May 5, 2021 12:00pm PT. The video and slides are online.


We focus on static datasets in machine learning and AI training fails to translate to how these systems are being deployed in industry. As a result, data scientists and engineers aren't considering how these systems perform in changing, real world environments nor the feedback mechanisms and societal implications that these systems can cause. In the session, we will highlight existing tools that work with dynamic (and perhaps streaming) data. We will suggest some preliminary studies of activities and lessons that may bridge the gap in data science training for realistic data.

Alessya Visnjic, was at the Databricks Community Lightning Talks presenting on The Critical Missing Component in the Production ML Stack on May 20, 2021. The video starts at minute 16.

Other posts

Detecting Semantic Drift within Image Data: Monitoring Context-Full Data with whylogs

Detecting Semantic Drift within Image Data: Monitoring Context-Full Data with whylogs

Concept drifts can originate in different stages of your data pipeline, even before the data collection itself. In this article, we’ll show how whylogs can help you monitor your machine learning system’s data ingestion pipeline by enabling concept drift detection, specifically for image data.
Don’t Let Your Data Fail You; Continuous Data Validation with whylogs and Github Actions

Don’t Let Your Data Fail You; Continuous Data Validation with whylogs and Github Actions

Ensuring data quality should be among your top priorities when developing an ML pipeline. In this article we’ll show how whylogs constraints with Github Actions can help with data validation, as a key component in ensuring data quality.
Integrating whylogs into your Kafka ML Pipeline

Integrating whylogs into your Kafka ML Pipeline

Evaluating the quality of data in the Kafka stream is a non-trivial task due to large volumes of data and latency requirements. This is an ideal job for whylogs, an open-source package for Python or Java that uses Apache DataSketches to monitor and detect statistical anomalies in streaming data.

Run AI with Certainty

We take the pain out of model and data monitoring so that you spend less time
firefighting, and more time building models.