WhyLabs' Data Geeks Unleashed
- Data Science
- Data Analytics
- Data Logging
May 21, 2021
Hello WhyLabs followers! This month three members of the WhyLabs team are speaking at the Data and AI Summit. In this post we've included descriptions and links to all their talks.
Data and AI Summit 2021 (Virtual)
Alessya Visnjic, will be presenting on The Critical Missing Component in the Production ML Stack on May 26, 2021 03:50 pm PT:
The day the ML application is deployed to production and begins facing the real world is the best and the worst day in the life of the model builder. The joy of seeing accurate predictions is quickly overshadowed by a myriad of operational challenges. Debugging, troubleshooting & monitoring takes over the majority of their day, leaving little time for model building. In DevOps, software operations are taken to a level of an art. Sophisticated tools enable engineers to quickly identify and resolve issues, continuously improving software stability and robustness. In the ML world, operations are still largely a manual process that involves Jupyter notebooks and shell scripts. One of the cornerstones of the DevOps toolchain is logging. Traces and metrics are built on top of logs enabling monitoring and feedback loops. What does logging look like in an ML system?
In this talk we will demonstrate how to enable data logging for an AI application using MLflow in a matter of minutes. We will discuss how something so simple enables testing, monitoring and debugging in an AI application that handles TBs of data and runs in real-time. Attendees will leave the talk equipped with tools and best practices to supercharge MLOps in their team.
Leandro Almeida, will be presenting on Semantic Image Logging Using Approximate Statistics & MLflow on May 27, 2021 11:00 am PT:
As organizations launch complex multi-modal models into human-facing applications, data governance becomes both increasingly important, and difficult. Specifically, monitoring the underlying ML models for accuracy and reliability becomes a critical component of any data governance system. When complex data, such as image, text and video, is involved, monitoring model performance is particularly problematic given the lack of semantic information. In industries such as health care and automotive, fail-safes are needed for compliant performance and safety but access to validation data is in short supply, or in some cases, completely absent. However, to date, there have been no widely accessible approaches for monitoring semantic information in a performant manner.
In this talk, we will provide an overview of approximate statistical methods, how they can be used for monitoring, along with debugging data pipelines for detecting concept drift and out-of-distribution data in semantic-full data, such as images. We will walk through an open source library, whylogs, which combines Apache Spark and novel approaches to semantic data sketching. We will conclude with practical examples equipping ML practitioners with monitoring tools for computer vision, and semantic-full models.
Andy Dang, will be presenting on Re-imagine Data Monitoring with whylogs and Spark on May 27, 2021 05:00 pm PT:
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
You might be interested to checkout a couple of other talks that the WhyLabs team have given recently:
Bernease Herman was at csv,conf presenting on Static datasets aren't enough: where deployed systems differ from research on May 5, 2021 12:00pm PT. The video and slides are online.
We focus on static datasets in machine learning and AI training fails to translate to how these systems are being deployed in industry. As a result, data scientists and engineers aren't considering how these systems perform in changing, real world environments nor the feedback mechanisms and societal implications that these systems can cause. In the session, we will highlight existing tools that work with dynamic (and perhaps streaming) data. We will suggest some preliminary studies of activities and lessons that may bridge the gap in data science training for realistic data.
Alessya Visnjic, was at the Databricks Community Lightning Talks presenting on The Critical Missing Component in the Production ML Stack on May 20, 2021. The video starts at minute 16.
How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots
Feb 23, 2023
- AI Observability
Robust & Responsible AI Newsletter - Issue #5
Mar 10, 2023
Detecting Financial Fraud in Real-Time: A Guide to ML Monitoring
Mar 7, 2023
- ML Monitoring
Achieving Ethical AI with Model Performance Tracing and ML Explainability
Feb 2, 2023
- ML Monitoring
Detecting and Fixing Data Drift in Computer Vision
Jan 26, 2023
- ML Monitoring
BigQuery Data Monitoring with WhyLabs
Jan 17, 2023
Robust & Responsible AI Newsletter - Issue #4
Dec 22, 2022
WhyLabs Private Beta: Real-time Data Monitoring on Prem
Dec 21, 2022
Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data
Dec 21, 2022
- Data Science
- Machine Learning