Back to Blog

Introducing WhyLabs, a Leap Forward in AI Reliability

Pandora’s box, according to Greek mythology, was a consequence of mankind’s first technological revolution. It was because humans harnessed the power of fire that the gods sent them a package of misfortunes. The moral of the story seems to be that together with the advantages created by every technological leap forward, there also come unexpected difficulties.

Over the past decade, my co-founders and I have experienced first hand the tremendous advantages as well as the difficulties of AI adoption. Today, we are excited to announce WhyLabs, a company that empowers AI practitioners to reap the benefits of AI without the spectacular failures that so often make the news.

In this post, I’d like to tell you my story of why we built what we call an AI observability platform and why, if you’re an AI practitioner, you should start using WhyLabs today to supercharge you AI applications.

The WhyLabs team, with co-founders Alessya Visnjic, Sam Gracie, Maria Karaivanova, and Andy Dang in the top row, from left to right.

Early (mis)adventures with AI

For five years at Amazon, I carried a pager to be able to immediately respond to system failures. First it was software failures. Then, I led one of Amazon’s first AI forecasting teams, and it became AI model failures. I responded to so many different types of model failures, you could say that I hold a PhD in AI application resuscitation. And because I was part of a great team and we always found a way to fix things, the outside world never noticed.

Software failures are an unavoidable fact of life in any modern enterprise. But the weird thing about AI failures specifically is that most issues originate in the data that one’s models consume. “The data is broken.” “The data changed.” “We’ve never seen this kind of data.” That was the soundtrack in our office for a while. It quickly became apparent to us that the kinds of tools people rely on in DevOps are not suitable for AI applications. AI needed its own tooling ecosystem.

This realization propelled me to help start the development of Amazon’s internal data science platform. Soon after that, my current co-founders Andy Dang and Sam Gracie joined the project, and we worked together on building tools to make the lives of AI builders across the company easier. Eventually, I left Amazon with a vision to build a toolset that would enable every enterprise to operate AI reliably. But it wasn’t just about “democratizing” AI. It was about building tools that are sorely needed but don’t yet exist.

Every AI application comes with a Pandora’s box

I have since spoken with close to a hundred data science teams that are running AI in production and I’ve heard the same story from almost all of them. Taking a model out of the lab and deploying it to production initially feels like a huge accomplishment. But then you realize it’s only the tip of the iceberg. Once in production, the model requires constant attention, adjustment, troubleshooting, and maintenance. Organizations that don’t have the adequate tools for these tasks (i.e. the vast majority of them) experience model failures that cost them anywhere from days of debugging to big PR disasters.

As of today, over 1,000 AI failures have been recorded by Partnership on AI. To make things worse, the pandemic and its manifold aftereffects are creating a massive drift in data that powers AI models across industries, causing huge drops in model accuracy, loss of trust in model reliability, and the sounding of alarms by risk analysts. A few years ago, AI reliability and robustness were mostly topics of academic conferences. Today, it’s clear to every practitioner that operating AI in production without continuous monitoring and evaluation tools is practically engineering malpractice.

What is missing?

Operating an AI application should be like conducting a symphony, not like putting out a fire. Unfortunately, due to the lack of mature tools, the firefighting mode of AI operations is the prevailing one. While countless enterprises have adopted AI to boost productivity or gain a competitive edge, fewer than 2 out of 5 have seen a return on their investment. Now, a wave of startups is trying to fill the glaring gaps in the AI tooling ecosystem. At WhyLabs, we are building a platform that enables practitioners to realize the full potential of their AI operations by preventing costly model failures and enabling cross-functional collaboration between all stakeholders.

We incubated WhyLabs at the Allen Institute for AI (AI2), a fundamental research institute. While there, my team and I had the chance to work with some of the greatest scientists in the field to break down the challenges of AI monitoring and solve them from first principles. At the same time, we spoke with hundreds of enterprise AI practitioners to develop a product that actually solves their biggest challenges in an elegant way. With help from AI2, I founded RsqrdAI, an AI community where practitioners from hundreds of enterprises gather to discuss the goal of Robust and Responsible AI systems.

An early gathering of RsqrdAI. Today, RsqrdAI is 1,000+ members strong, we hold monthly meetups, and I invite you to join the conversation.

These explorations culminated in us identifying three core functions that we believe are fundamental to any AI operations solution. For an AI solution to be robust and responsible, it must

  1. Increase the observability of AI applications
  2. Perform continuous data quality monitoring
  3. Keep all stakeholders informed about the behavior of the application

These are the three pillars that define our approach at WhyLabs.

1. The importance of observability

Explainability has been all the rage in the AI community for the past few years. But explainability techniques have a significant shortcoming. They can only explain why an application made this or that decision. They tell you little about how well your model is behaving over time. While explainability tools can be very useful for, say, performing a post mortem after a costly AI failure, they don’t help you keep tabs on the health and reliability of your models to prevent such failures from occurring in the first place.

To overcome this limitation, we have adopted and adapted a concept from control theory and DevOps called observability. Observability refers to the practice of acquiring actionable insights from information that is emitted by a system continuously. If an explainability tool is like a medical exam that can tell you why you are feeling ill on a particular day, an observability tool is like a futuristic device that simultaneously measures your heart rate, temperature, oxygen levels and a thousand other things, and keep a log of all these measurements over time. While a medical exam is necessary once in a while, the observability tool is more useful for day-to-day health monitoring and, if used right, can save you some trips to the hospital.

As we built observability tools into our platform, we benefited greatly from conversations with Madrona’s Ted Kummert (now at UIPath) and Honeycomb’s Charity Majors. Ted helped refine our thinking on the parallels between best practices in DevOps and AI monitoring. Charity’s insights into how observability is indispensable for maintaining any complex software system gave us clarity on why the observability paradigm was the right one.

WhyLabs enables observability in AI applications by instrumenting the end-to-end AI lifecycle with a lightweight logging agent that continuously and intelligently collects vital information about the data that flows through each step. This distilled information empowers our customers to understand not only why their system is behaving a certain way, but also when it started behaving that way and what caused the behavior to change.

2. Keeping tabs on your data quality

If there is one thing that’s reliable about AI applications, it’s the old adage: Garbage in, garbage out. In other words, most AI failures are caused by the data that models consume. Even slight deviations and irregularities in input data can derail predictions.

In order to minimize data-caused model failures, continuous monitoring of data quality must be a feature of any AI operations toolset. However, that is easier said than done, since monitoring at scale remains a major challenge for data science teams. As we developed our platform, we benefited from conversations with Sebastian Schelter, now at the University of Amsterdam. Our discussions with Sebastian inspired us to make use of approximate statistical methods and advanced techniques from the field of data quality management. Now, after much hard work refining our architecture, our platform is able to operate at terabyte scale, much to the delight of our early users.

Continuous data quality tracking is a cornerstone of reliable AI operations. Without it, the data that flows into a model is like water under a bridge — it’s gone forever without any meaningful records left behind that the AI practitioner can use to understand changes in the model’s behavior. We believe so strongly in the importance of continuous data monitoring and logging that we created a tool for it — WhyLogs — and made it available to all AI builders for free. We released WhyLogs as an open source library that empowers every AI builder to incorporate best practices of data quality into their project and pipeline. Check out the technical deep dive on WhyLogs and get the library on GitHub!

3. Putting humans in the loop

The thing about AI applications that makes them both terrific and terrifying is their endless applicability. Organizations use AI to automate decision making about anything and everything, from deciding how many black socks to stock a warehouse with, to diagnosing lung cancer based on X-rays. That means that the kinds of things about which an AI model makes decisions are usually beyond the purview of those who built the model. Thus, AI builders often cannot troubleshoot problems in their models by themselves. They need to collaborate with experts in whatever domain the model is used for. Furthermore, running an AI application usually requires coordination with data scientists, ML engineers, product managers, and business unit leaders.

In order to enable reliability, an AI operating platform should be accessible and useful for all stakeholders and collaborators on a project. In thinking through how one could build such a collaborative framework, we greatly benefited from conversations with Dan Weld of the Lab for Human-AI Interaction at the University of Washington. Our conversations underscored the importance of building interfaces that are inclusive to all the stakeholders who need to understand and control models. Our platform offers an intuitive user interface that immediately starts surfacing the right insights to the right stakeholders upon installation. For a more in-depth discussion of the platform, check out our product launch.

The path that lies ahead

We are on the brink of an AI-powered revolution which, just like mankind’s first technological leap forward, is already yielding great benefits as well as new challenges. WhyLabs is committed to building tools that enable teams to harness the power of AI while minimizing the difficulties of adoption.

WhyLabs Platform screenshots. Experience it today at

We have come this far thanks to our phenomenal team and great investors. My co-founders Sam Gracie and Andy Dang have extensive experience building AI tools and a deep familiarity with the challenges encountered by AI practitioners. My co-founder and long-time thought-partner Maria Karaivanova brings business acumen to our engineering powerhouse, leveraging her experience from building Cloudflare and from venture investing. Our extended team of engineers brings decades of experience from companies like Amazon, Microsoft, Cloudflare, and Lyft. And our investors — Madrona Venture Group, Bezos Expeditions, Defy Partners, and Ascend VC — are thrilled to propel our vision forward.

We believe that the future of reliable AI will be enabled by an intuitive interface between AI and human operators that marries best practices from Observability, Data Quality and Human-Computer Interaction. The WhyLabs platform is the first major leap in that direction. For more information, or to try it out today, check out

Other posts

Integrating whylogs into your Kafka ML Pipeline

Integrating whylogs into your Kafka ML Pipeline

Evaluating the quality of data in the Kafka stream is a non-trivial task due to large volumes of data and latency requirements. This is an ideal job for whylogs, an open-source package for Python or Java that uses Apache DataSketches to monitor and detect statistical anomalies in streaming data.
Monitoring High-Performance Machine Learning Models with RAPIDS and whylogs

Monitoring High-Performance Machine Learning Models with RAPIDS and whylogs

Machine learning (ML) data is big and messy. Organizations have increasingly adopted RAPIDS and cuML to help their teams run experiments faster and achieve better model performance on larger datasets.
Streamlining data monitoring with whylogs and MLflow

Streamlining data monitoring with whylogs and MLflow

It's hard to overstate the importance of monitoring data quality in ML pipelines. In this post we will explore an elegant solution with whylogs and MLflow, which allows for a more informed analysis of model performance.

Run AI with Certainty

We take the pain out of model and data monitoring so that you spend less time
firefighting, and more time building models.