What is AI observability?
AI observability is the collection of statistics, performance data, and metrics from every part of an ML system. In turn, the system delivers actionable insights to all stakeholders.
Operating ML is a collaborative effort
The hard work begins once the ML application is deployed to production. Operating the ML application means creating feedback loops that inform various stakeholders of the health and quality of the data, model, and predictions. These feedback loops collect the data necessary to ensure that the model's performance and customer experience do not degrade over time. Without proper tools, monitoring, and analyzing feedback data takes about 40% of the team's daily effort.
Monitoring an ML application poses unique challenges
Ultimately, monitoring is the mechanism for creating the feedback loop between the ML pipeline and all human operators, enabling trust and transparency. To enable monitoring in an ML pipeline, a systematic collection of logs, traces, and metadata must be carried out at each of its stages. In addition to infrastructure health and model performance, logs need to describe statistical properties of the data that runs through the pipeline.
Collecting such statistics enables monitoring for data quality and data drift, the most common sources of model failures. Observability is enabled by reconciling and analyzing data collected from each pipeline step.
The missing link in an ecosystem
The DevOps ecosystem evolved with many tools and standards that enable end-to-end observability for traditional software. Today's ML pipelines require the same degree of observability. However, the data required is unique: in volume, statistical nature, and in structure - with an unbounded number of features and dimensions.
These properties make it impossible to adapt traditional DevOps monitoring tools to tackle ML monitoring. A proper solution must treat the data’s cardinality, its statistical properties, and semantic meaning as first class citizens. Furthermore, the model itself should log the data required to explain predictions and accuracy deviations.
Showing some of the popular DevOps and MLOps Tools Available to practitioners. Not an exhaustive list of all solutions.
The AI Observability System
An AI Observability System collects statistics, performance data, and metrics from every single step of your ML lifecycle cycle, and delivers actionable insights to stakeholders. This system needs a view into each stage of the data pipeline, and thus should be relatively infrastructure agnostic, while providing scalability for your data size. By automating the insight extraction process, teams can collaborate and deliver models, and in turn respond to issues more effectively.
The result of an end-to-end observability pipeline is that the organization will get timely insights about changes to data and model behavior in production. This is especially useful for surfacing common ML issues such as data drift, stale models, and data quality changes. These signals can be fed back to the ML processes and accelerate the model development lifecycle.
Anatomy of an enterprise AI Observability Platform
Telemetry collection
Your model doesn’t start in a production-like environment. Models might be deployed to different infrastructure with different characteristics. Collecting telemetry data (statistics, metrics and performance) should happen across your infrastructure in a lightweight manner. Whether you’re deploying batch inference model or a live prediction service, timely telemetry data will reduce the time to respond to model issues.
Monitoring and anomaly detection
AI observability platform applies monitoring and anomaly detection algorithms to detect model performance issues and data issues from the telemetry data.
Time series database (TSDB)
As telemetry data is collected from multiple sources, a TSDB enables organizations to extract up-to-date insights. In turn, this enables optimization of downstream systems to respond to model issues.
Debugging
As ML datasets can have significant sizes, the platform should be intelligent in how it collects debug data: instead of randomly sampling, it can use known statistics of features to decide when to collect and store debug data, enabling scientists to have detailed view of data issues without sacrificing scalability or performance.
Visualization layer
MLOps is a team effort. The visualization layer enables teams to extract different insights from the ML system. Collaboration across job families is the key to successful model deployments.