MLOps, ML Monitoring and Data Science Glossary
To help you better understand MLOps, Machine Learning (ML) Monitoring and Data Science, we have created a glossary of common terms.
AI explainability
AI explainability refers to the ability of artificial intelligence (AI) models to provide understandable and interpretable explanations for their predictions or decisions. It involves techniques and methods that aim to make AI models transparent and accountable, allowing humans to understand how and why a particular prediction or decision was made. AI explainability is crucial for building trust in AI systems, ensuring fairness, accountability, and transparency, and enabling humans to understand, validate, and interpret the outcomes of AI models in a meaningful way.
Bias
Bias refers to a systematic error in a model's predictions or decisions, caused by the model's inability to capture the true underlying relationship between the input variables and the output variable. This can lead to inaccuracies or discrimination against certain groups or individuals. Bias can be caused by various factors such as a biased training dataset, inadequate feature selection, or an inappropriate choice of algorithm. To mitigate bias, it is important to carefully select and preprocess data, use appropriate evaluation metrics, and regularly monitor the model's performance on diverse data.
Data logging
Data logging is the capture, storage, and presentation of one or more datasets for analysis. This is then used to identify trends, correlations, and the analysis of the data for future predictions.
Additional Resources:
Data quality
Data quality refers to the consistency, accuracy, and relevancy of a data set. As data pipelines handle larger volumes of data from a variety of sources and increase in complexity, data quality becomes one of the most important factors to overall model health.
Additional Resources:
Data-centric AI
Data-centric AI refers to an approach in artificial intelligence (AI) where the focus is on leveraging data as a primary driver for model development and decision-making. In data-centric AI, the quality, quantity, and diversity of data are prioritized, and models are trained to learn from data in an autonomous and adaptive manner.
Deep Learning
Deep Learning is a subfield of machine learning that uses artificial neural networks to train models that can learn from and make predictions or decisions based on large amounts of data. These neural networks, organized in multiple layers, can extract complex features from raw data, allowing for highly accurate and sophisticated pattern recognition.
Distribution shifts
Distribution shift refers to a change in the statistical distribution of the input data used to train a model compared to the distribution of the input data used in the real world application. This can occur when the data used to train the model is collected from a different source or time period than the data it will be applied to. As a result, the model may fail to generalize well to new data, leading to decreased accuracy and performance.
Additional Resources:
Embeddings
Embeddings are a way to represent data in a lower-dimensional space, while preserving the relationships between the different data points. Embeddings are widely used in natural language processing (NLP), computer vision, and other areas of machine learning.
Additional Resources:
Feature Store
A feature store is a centralized repository for storing, managing, and sharing machine learning features across an organization. Features are the inputs to a machine learning model that the model uses to make predictions. A feature store makes it easy for data scientists and machine learning engineers to discover, share, and reuse features, allowing them to build models more efficiently and effectively.
Hellinger distance
Hellinger distance, also known as Hellinger divergence or Hellinger kernel, is a measure of similarity or dissimilarity between probability distributions. It is commonly used in machine learning and statistics to compare and quantify the differences between two probability distributions.
Kellinger Distance
Kullback-Leibler (KL) Divergence, also known as Kullback-Leibler Distance, is a measure of the difference between two probability distributions. It measures how much information is lost when approximating one distribution with another. KL Divergence is asymmetric, meaning that the distance from distribution A to distribution B is not necessarily the same as the distance from distribution B to distribution A.
Kolmogorov-Smirnov (KS) Tests
The Kolmogorov-Smirnov (KS) test is a statistical test used to compare the similarity or difference between two probability distributions. It measures the maximum difference between the cumulative distribution functions (CDFs) of the two distributions, quantifying their level of similarity or dissimilarity.
Additional Resources:
MLOps
MLOps (Machine Learning Operations) is a paradigm that includes best practices, sets of concepts, and development culture to facilitate the machine learning process. MLOps aims to provide faster experimentation and development of models, faster deployment of models into production, and quality assurance and end-to-end lineage tracking
Model drift
Concept drift refers to the phenomenon where the statistical properties of the target variable or the input data distribution change over time, resulting in a degradation of the model's performance. This means that the relationships between the input variables and the output variable may change, which can make previously trained models less accurate and relevant for new data.
Additional Resources:
Model performance
Model performance is an assessment of how well an ML model is doing, not only with training data but also in real-time once the model has been deployed to production. It describes the accuracy of the model's predictions, and how effectively it can perform its tasks with the data it has been trained on.
High-performing models means accurate and trustworthy predictions for your respective use cases.
Model performance degradation
Model performance degradation refers to the decline in the performance of a machine learning model over time. It can occur due to various factors, such as changes in the data distribution, feature drift, or model aging.
Degradation can be a result of data quality issues, and real-world data differing from the baseline data the model was trained on, as well as a myriad of other factors like statistical anomalies and an accumulation of unseen errors within the system.
Open source software
OSS or Open Source Software is any software where the entirety of the program, including the source code, is available online, for free, and can be modified by any independent party.
Profiling
In contrast, profiling collects statistical measurements of the data. In the case of whylogs, the metrics produced come with mathematically derived uncertainty bounds. These profiles are scalable, lightweight, flexible, and configurable. Rare events and outlier-dependent metrics can be accurately captured. The results are statistical and of a standard, portable data format which are directly interpretable. Learn more about sampling vs profiling here.
Regression models
Regression models are algorithms that are used to predict a continuous numerical output variable based on one or more input variables. The goal of a regression model is to find the relationship between the input variables and the output variable, and use that relationship to make predictions about the output variable for new input data. The most common types of regression models are linear regression, polynomial regression, and logistic regression.
Responsible AI
Responsible AI is the idea that AI should be developed, designed, and implemented with good intentions. Its core principles are that AI should be developed in a fair, transparent, accountable, and most importantly, in a non-discriminatory fashion.
Sampling
Tracing is the process of tracking the flow of data through a machine learning system, including the input data, the models used, and the output results. Tracing can be used to identify bottlenecks or errors in the system, and to debug issues that may arise during development or deployment. It can also help with performance analysis and optimization, by identifying which parts of the system are taking the most time or resources. Tracing is typically done through the use of specialized software tools that can monitor and log the flow of data through the system in real-time.
Additional Resources: