July 18, 2023

Making sense of machine learning and artificial intelligence models by monitoring the training data

When you're training ML /AI models, the input data selection process matters. Build solid ML/AI model performance with data observability.

Henry Li

In today's information age, we find ourselves constantly bombarded with an overwhelming amount of data. Much of this data is processed and delivered to us through machine learning and artificial intelligence (ML/AI) technologies. While these automations have undoubtedly enhanced our lives, it is often challenging to understand the inner workings of these ML/AI models.

Training data plays a crucial role in shaping the outcomes of these models; a fact which is often overlooked. Here, I will delve into the significance of data in ML/AI, emphasizing the impact of input data selection and how data practitioners can build a solid understanding of their ML/AI model performance with data observability.

Understanding the data

As a data practitioner, I have encountered numerous instances where the decision regarding how to process input data significantly impacted the performance of ML/AI models. For example, normalization of the input data helps manage data source effects so that more data points can be used for training a machine learning model. In more complex scenarios, there will be data batch effects that are either hard to remove or modeled. Ben Shabad’s illustration encapsulates this data-model duality very well. Great data practitioners and scientists demand the highest quality of data, and they make decisions of what data will be used in model training with statistical tolerance in mind.


For any machine automation problem-solving work, it is important to gain a comprehensive understanding of the data landscape, including its sources, characteristics, and limitations. By knowing the intricacies of the data, practitioners can make informed decisions about data preprocessing, feature engineering, and model selection, leading to more accurate and robust ML/AI pipelines. See my previous blog on data anomaly detection for more details.

Making informed choices based on the availability of data, the problem context, and the solution space is crucial in achieving desirable outcomes, like discovering a new scientific phenomenon or making a business more competitive.

The importance of high-quality data

Zooming in, one of the easier, more manageable tasks is to assess the quality of the available data. ML and AI technologies are not infallible. One of the primary reasons for their occasional shortcomings is the selection of input data. Incomplete, imbalanced, or noisy data presents significant challenges in the development of ML/AI algorithms. Furthermore, after a set of models are trained and deployed, subpar or invalid future data for prediction will erode model performance.

The quality of the data used for training directly influences the performance and accuracy of the resulting models. Subpar data can introduce biases, misleading patterns, or inconsistencies that adversely affect the algorithms' ability to make accurate predictions. Therefore, data practitioners must prioritize the collection of high-quality training data. This occurs upstream of model training.

Data processing goes hand-in-hand with data quality. It is not trivial to do correctly. For example, Crux Informatics is a business that ingests and processes data from multiple sources and delivers the organized data to other businesses. Crux utilizes Bigeye to monitor this intense data pipeline at scale to ensure their customers receive the best possible data for their business needs.

Harnessing the power of features

How should data practitioners go about collecting high-quality training data? The solution always lies in understanding the underlying data through time. In the ML/AI world, the data can be structured, unstructured, or both. Structured data means that the values are tabulated and formatted in a relational database; on the other hand, unstructured data, such as audio and video files, are not organized into tables. Data observability on structured data is straightforward. However, unstructured data requires more work to monitor. In either case, creating data features go a long way in establishing data observability.

Features are derived attributes that encapsulate relevant information from the input data, transforming unstructured information into a structured format. By extracting and incorporating features into ML/AI pipelines, practitioners gain insights that facilitate monitoring, analysis, and decision-making processes. These features can include file size, number of words, locational information, statistical measures, domain-specific indicators, or engineered attributes that capture specific patterns or relationships within the data. The utilization of features not only enhances model performance but also enables practitioners to gain a deeper understanding of the underlying data and its impact on ML/AI outcomes.

For example, in my own work, I set up an experiment for a new type of software automation. To accomplish this, I built the appropriate data pipeline, collected the necessary data and developed the algorithm. I then created a set of data features. I set up daily monitoring for these features. One feature metric is shown here; its purpose is to track the number of duplicate values for the entirety of the data available over time. Overall, I expected this feature metric–and other feature metrics–to be stable so that the modeling outcome behaves similarly over time.

However, for a couple weeks, upstream tables to my data pipeline had an issue. At first, I did not know what the root cause of the issue was; the problem could have been caused by many things including a data table change, business shift, or software outages. In all of these scenarios, the feature values will start to look different. If these values are extremely different than they were historically, it is very likely that the trained model will behave differently. As a result, the model performance will be negatively impacted. Through triaging and confirming with owners of upstream tables, I was able to learn that there was indeed an upstream data pipeline issue; fortunately this issue was mitigated and my feature metric eventually recovered to its historical baseline state.

In many modern causal inference and network modeling work, the question arises regarding how much contribution each ML/AI feature gives to the prediction. One of many approaches to answer this question is to use an implementation of Shapley Additive Explanations or SHAP values [Lundberg and Lee 2017]. If a model were to be trained repeatedly over time, these feature importances would shift as a result of the underlying data shifting. This means that any ML/AI practitioners should reassess SHAP values or other measures about their models. In other words, in the real world, it is expected that data will change due to business dynamics, data pipeline adjustments, instrument upgrades, etc. Thus, it is essential to understand how feature values behave over time and map any changes to shifting model components and model performance.

Final thoughts

In the ever-expanding world of ML/AI, the role of data cannot be overstated. The selection of input data and the framework for decision-making significantly influence the performance and success of ML/AI models.

Data practitioners must prioritize the acquisition of high-quality data and develop a solid understanding of the data landscape. By leveraging features derived from the input data, practitioners can unlock valuable data observability for structured and unstructured data. Through a meticulous approach to data, ML/AI models and their performance can become more manageable and interpretable.

share this episode
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
Data analyst
Business analyst
Data/product manager
Total cost
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights