Jim Barker
jim-barker
Thought leadership
-
January 1, 2026

The House Of Data Series: Data Quality

50 min read

This paper focuses on the five dimensions of data quality, how to measure and monitor them, the role of data stewardship, and what data quality means for AI Trust. It does not cover pipeline architecture or data observability tooling in depth — those are addressed in the DataOps whitepaper.

Jim Barker
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Join The AI Trust Summit on April 16
A one-day virtual summit on the controls enterprise leaders need to scale AI where it counts.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

House of Data Series

Every strong data program is built like a house. Data Architecture forms the foundation — the platforms, pipelines, and operating model that everything else depends on. Seven domain pillars rise from that foundation, each one essential to a complete data program: Data Quality, Privacy, Data Security, DataOps, Compliance, Data Enablement, and Data Consumption. Data Literacy runs across all seven as a connecting beam, ensuring people at every level can read, interpret, and act on data. At the top, People & Leadership sets the direction, accountability, and culture that holds the whole structure together.

This series of whitepapers covers each component of the House of Data in depth. Each paper was written by a practitioner with direct experience in that domain. Together, they form a practical guide to building data programs that earn — and keep — trust.

Data Leadership Data Literacy Data Quality Privacy Data Security DataOps Compliance Data Enablement Data Consumption Data Architecture

This paper covers Data Quality — the first pillar of the House of Data, and the one that underpins every other capability. Without accurate, complete, and consistent data, pipelines can't be trusted, analytics mislead rather than inform, and AI systems encode problems rather than solve them. Data quality isn't a cleanup task. It's a continuous discipline.Introduction

Data has become foundational to how organizations operate, compete, and scale. It powers financial reporting, customer engagement, supply chains, regulatory compliance, analytics, and increasingly, automated decision-making through AI. As reliance on data has grown, so has the cost of getting it wrong.

Despite modern data platforms and significant investment, many organizations still struggle with a basic question: can we trust the data we are using to run the business? Conflicting reports, unexpected pipeline failures, and late-stage remediation remain common. In AI-driven use cases, these challenges are amplified. Poor data quality no longer results only in flawed insights, but in automated outcomes that are difficult to detect, explain, or reverse.

Data quality is not about making all data perfect. It is about ensuring that data is fit for business use. That means defining what "good" looks like for the data that matters most, detecting when data falls below acceptable thresholds, addressing issues efficiently, and preventing the same failures from recurring. When data meets those expectations, it should be explicitly recognized as suitable for use.

This paper examines eleven focus areas that, in practice, have the greatest impact on improving data quality outcomes. These areas span organizational roles, identification of critical data, definition of quality expectations, visibility into data health, operational response to issues, and transparency for leadership. Together, they form a practical path from reactive data cleanup toward proactive, scalable data quality management.

Data quality

Data quality defined

Data quality is a core component of successful data programs. To get value out of data, it must be "fit for business" or "fit for purpose." The goal is to understand whether data meets business readiness thresholds. When it is not fit for purpose, fix it. When it does meet the business need, certify it as "good for use."

DAMA definition: The planning, implementation, and control of activities that apply data quality management techniques to data in order to assure it is fit for consumption and meeting the needs of data consumers. (Source: DMBOK, 2017)

Gartner definition: Data quality refers to the usability and applicability of data used for an organization's priority use cases, including AI and machine learning initiatives. Data quality is usually one of the goals of effective data management and data governance. Yet too often organizations treat it like an afterthought.

McGilvray definition: The degree to which information and data can be a trusted source for any and/or all required uses.

Practical definition: Data quality is the level at which data can be trusted to be used for the purpose in which it was collected. It needs to be addressed while in-flight and at-rest, and all users of the data should follow the mantra of "See It, Say It, and Sort It."

Data quality terms of interest

The following vocabulary forms the working foundation for any serious data quality program. Each term represents a concept teams will encounter when building, running, or improving quality capabilities.

Term Definition
At-rest data quality Reviewing the data resident in your systems and defining whether it meets "business fit" or needs special attention. Looking at data already in systems is "at-rest data quality."
In-flight data quality As data quality programs evolve, it is determined that fixing data once a problem occurs is not acceptable. Rules are added to prevent data from being loaded that is not of high quality, reflecting those rows or stopping processes. Addressing data inside of data pipelines is known as in-flight data quality.
Data quality firewall The concept of putting strong edits at the entry point of data as it is created, whether on screens, in apps, or in the import of data from external sources. The firewall says: "We will stop bad data from ever entering our systems by whatever means necessary."
Circuit breakers A technical concept: when a process has a failure (an exception record or too many errors), it is stopped from continuing. Think of this like a circuit breaker in a home — if someone overloads a circuit, the breaker fires and stops it until the situation is remedied.
Data quality rules The formal business definition of what "business fit" means for a given data element. You won't have rules on all fields, but the critical ones. The business rule is then used to build out additional data quality checks or pipeline logic to meet the expectations of the business.
Data quality profiles The output of a process that shows aggregate results of a data set. Can include database data type, inferred data type, number of nulls, number of duplicates, average length, patterns, and sample values. Provides operational metadata to make business and technical decisions.
De-duplication The act of reviewing a data set and determining values that are replicated and should not be. De-duplication removes those duplicates to generate a more usable and reliable set of data.
Householding The idea of de-duplicating customer records for a given address. Allows more efficient correction of data, saves money on direct mail, and demonstrates greater understanding of your customers. A specialized form of de-duplication and data cleansing.
Data quality dimensions A way of classifying data quality checks to demonstrate the health of an organization's data, report on improvement, and build more solid data quality rules. Consider completeness, conformity, consistency, accuracy, timeliness, and uniqueness.
Data quality dashboards Brings together details for executive reporting that shows the overall health of datasets. Illustrates how the investment in data quality is being used and the level of improvement. A key device used to bring people together and increase teaming across the organization.
Data quality reports A wide range of reports used to improve data quality. These can be tactical (showing records with bad data), summary at the functional level, or built for executive illustration. One of the most impactful tools data quality professionals have to show where things are, where they're going, and how they're progressing.
Data quality alerts A signal on data quality challenges that have been encountered. Typically integrated with direct messaging, and can fire at a single record or on groups of records. They alert someone about something bad and provide the opportunity to fix it.
Critical Data Elements (CDEs) Data fields that are very important for a business process. If a CDE field has a data quality issue, very bad things happen. Think of shipping address: if you don't have street address, city, and postal code, a product can't be received. Those address fields are CDEs.
Critical Data Objects Structures, such as a file or table, that hold one or more CDEs. It's helpful to know what tables hold CDEs so firms can collect, save, and use the metadata that brings these together.
Data observability checks Data observability software can execute data quality rules, but more than that, it can find unknown unknowns — using machine learning to find anomalies in data without having to define rules upfront. A more recent device that helps expedite finding and addressing data quality issues.
Data quality at its source The idea of fixing data quality at its source — an ERP application, a SaaS business application, or a raw data file. Rather than fixing data inside a pipeline or at the target, edits and controls are used to stop data from entering the data ecosystem in the first place.
Data quality edits at its source A particular tool for data quality at its source: putting in an edit that requires data to be complete before a record is saved. The idea is to help the person entering the data do it right in the first place. Closely related to data quality firewalls.
Reference data for data quality The set of data used to assist in data entry, find corresponding values, and generate data that is aligned across records, systems, and processes.
Data quality policies The formal rules established on how data will work inside your organization. Can be relative to systems, tables, columns, or classes of data. Examples: all tax IDs need one of three formats; all data is retained for seven years; data destruction rules apply when data is archived or deleted.
Data drift Over time, data will become less reliable. Data drift is that notion — for example, needing to verify financial or address data every 18 months to make sure your data stays current. Also referred to as data decay.
Data cleansing The act of taking data in your system and making it right. This can be done through rules, through software, or through manual efforts such as data triage.
Root cause analysis Identifying the real reason something fails. It's fine to determine that something is wrong, but identifying why it's wrong allows you to improve overall data quality and learn from mistakes for the future.
Data validation The review of data from any given process, not just at the end but at each step of the journey. Heavily used in data migration efforts. Typically explores number of records passed, number failed, matching data between steps, and expected calculations.
Data verification The idea of reviewing data at its final source and verifying that it's fit for purpose and correct. Often used in data migration projects. While data validation reviews the outcome of a process step by step, data verification focuses on simply the final outcome.
VOC (Voice of the Customer) The idea of asking the people using data what works, what doesn't work, and how it can be made better. Rather than a "they'll take what we give them" posture, we ask how to be better. It's critical in data quality to get the consuming public to provide constructive criticism to find places to improve.
Five why's The idea that if you ask why a data quality challenge exists five times, you get to the heart of the answer. Based on six sigma and continuous improvement, it provides a great way to consider the driving factors of data quality challenges.

Focus areas of data quality

Each of those terms is important to data quality, and each could take several pages to fully explore. This paper dives into eleven of them -- the eleven most critical to supporting data quality progress in an organization.

  1. People
  2. Critical Data Objects and Critical Data Elements (CDEs)
  3. Defining data quality dimensions
  4. Defining a data quality process
  5. Developing DQ rules for CDEs
  6. Profiling CDEs
  7. Building DQ reports (data at-rest)
  8. Rolling out a data triage process for DQ
  9. Piloting in-flight DQ checks
  10. Implementing a business process for data circuit breakers
  11. Rolling out a "State of Data" report for leadership

1. People

People are the key to data programs, and the data quality focus area is no different. Three main focus points on people should be understood.

Everyone has a role to play in data quality.

  • Leaders (business, data, and technical) need to understand the importance of data in their organization and how critical good data is for operations, AI, digital transformation, analytics, and execution. Leaders need to pay attention to data quality efforts, set expectations, support funding requests, and have the ability to align resources to prevent "garbage in, garbage out" challenges.
  • Data analysts need to understand the importance of good data for the things they do. Rather than simply complaining that data is bad, they should provide specifics, speak up when they see a problem, and do their part to support those trying to improve data.
  • Data engineers should think about data quality in everything they do when designing, building, and maintaining solutions. Create solutions that help with data quality, ask for business input related to quality, and do what is necessary for solutions with data quality at their core.
  • Data quality specialists work to bring people together, listen to input, build solutions, and promote the help they receive. They should not be afraid to ask for help, to teach others, and to raise the profile and benefits of high-quality data.
  • Data stewards bring it all together and promote data quality activities across the organization. They build out teams across the organization that can walk and talk the benefits of data quality and genuinely believe in a community of practice working together for better data to run a better business.

See It, Say It, Sort It.

Establish a system that provides a mechanism to record issues as you find them, brings people together, and gives people the help they need. Build out the notion that if you see something, learn something, or need something, you write it down. The worst problems encountered in business occur when someone knows something is a problem but does not take the time to share it with others. As you see something wrong, write it down ("say it") and then get it fixed ("sort it"). This makes all the difference.

Who's who in data quality.

Provide a register that is easy to find and tells people who the experts are for data quality, and for all of data. This would include references to functional expertise, technical capabilities, business process, audit concerns, and overall execution. Having the organization set up so you know who people are, recording challenges, and understanding that everyone has a role to play in data will help you progress forward. Get the people right and everything is easier.

2. Critical Data Objects and Critical Data Elements

For years, data governance practitioners have spoken of CDEs, or Critical Data Elements. This concept is of particular importance for data quality, but it helps to take it a step further. The concept of CDEs is actually two different things:

  • Critical Data Elements (CDEs) are the fields in files or tables that are critical to business processes and analytics. If they are wrong, the business is in serious trouble.
  • Critical Data Objects are the tables or files that hold CDEs. It is only with an understanding of both that firms can manage data quality work effectively.

As part of your data program, record what your CDEs are. Use this to manage the work of getting and keeping data of high quality and report on the level of data quality for these critical objects. Firms can have millions of columns in their data, but a smaller number will be critical. Know the difference. Focusing first on CDEs can be a game changer for all organizations.

3. Defining data quality dimensions

Data quality dimensions have been at the core of data management for the last 25 years. It is important to define what your data quality dimensions are and use them for building out better data quality rules, reporting on the quality of your data, and building improvement plans based on those details.

The typical six dimensions most data experts start with are listed below in the recommended sequence to address. Completeness is often addressed first and last: first for critical elements to make a complete record, and last to bring together aspects of other data quality dimension details.

Dimension Description Example
Completeness* A verification that all necessary data elements are populated and all mandatory checks have passed baseline criteria. Completeness can also include results of other DQ dimensions. An address record has the street, city, state, country, and postal code so that a product or document can be sent via post.
Conformity Data that is maintained in the correct format or pattern required for its purpose. All phone numbers are in the correct format, e.g. ###-###-####.
Consistency Data is consistent across columns, conforming when considering the values in other related fields. The postal code for a given address is correct based on the country of the address.
Uniqueness Data is unique within this system for a given entity. There are no inappropriate duplicate records. The given records for a company are free of duplicates and redundancies — you have one Beta Corp in Belmont, CA.
Timeliness Data is available and maintained in the system within the time necessary for use. For a product to be shipped, all aspects of the order — customer, product, units, and regulatory information — are available within 72 hours of order completion.
Accuracy Data is correct, appropriate, verified, and up to date. It works for its needs and meets business context. These are the toughest rules to implement. The record is reviewed and common sense is applied to say "yes, that is correct" based on deep understanding of the data set.

*Completeness is often addressed first and last -- first for critical elements to make a complete record, and last to bring together aspects of other DQ dimension details.

While this is the most common list to work with, there are many other ways to look at this. The DAMA organization in the Netherlands performed a project that categorized a wider variety of data quality dimensions. The conceptual groupings below can provide insight when a firm works to define its own dimensions. (Source: DAMA DMBOK 2.0)

Accessibility Contextual Intrinsic Representational
Accessibility
Ease of Access
Security
Appropriate Amount of Data
Completeness
Relevancy
Timeliness
Value-Added
Accuracy
Believability
Objectivity
Reputation
Concise Representation
Consistent Representation
Ease of Understanding
Interpretability

When defining data quality dimensions for your organization, right-size the list. Find a set comprehensive enough to meet your needs, but small enough to socialize across the organization. The goal is to use these dimensions to drive improvement.

4. Defining a data quality process

Data quality often gets a bad reputation and can be difficult to move forward. Experience has shown that by understanding a common process and getting people to work together and follow it, data quality can improve in an efficient manner.

Danette McGilvray wrote the seminal book on data quality process, Executing Data Quality Projects. The great gift to the data industry from that work was "The Ten Step Process." The original steps are:

These ten steps have evolved in recent years with more focus on data catalogs and data observability. The following evolution is more prescriptive and leads to more proactive action. Automated data observability checks are implemented first and expanded to custom when necessary.

5. Developing DQ rules for CDEs

Documenting the business definition of what "business fit" means for a given field is very important. That does not mean a firm should invest time in defining the data quality rule for every field in their systems. Common sense must rule. Most firms will start by documenting their CDEs and potentially expand from there.

This task involves meeting with subject matter experts for a field, reviewing the data and data profile when available, and documenting the specifics of a rule: what makes a field meet the definition of "business fit" for your organization. The DQ rule should include:

  1. Field name -- the technical name of the field in the file or database
  2. Business name (if different) -- the business common name of the field
  3. Data type -- the technical detail of the field (string, number, date, etc.)
  4. Data size/scale -- the length of the data field
  5. Business definition -- describes what the field holds from a business perspective
  6. Business specification -- one or two sentences describing what this data field needs to contain

An example business DQ rule:

Field attribute Value
Field name Address1
Business name Address Line 1
Data type String (Varchar2)
Data size/scale 80
Business definition This field holds the first/main address line for a customer address.
Business specification This field must not start with a 0, should be longer than 5 digits, and should include a house number and a street. Note: some addresses deviate slightly, so warnings are more appropriate than hard errors.

6. Profiling CDEs

Profiling is technology that, with the click of a button, lets an end user read a table or file and get a report highlighting much of the data source. This typically includes:

  1. Sample data and frequency
  2. Number of nulls
  3. Number of probable duplicates
  4. System data type, size, and scale
  5. Inferred data type, size, and scale
  6. Patterns and frequency
  7. Minimum and maximum of data value
  8. Minimum and maximum length of data value
  9. Average length of data value
  10. Average value of data field

Use profiling technology to profile each table or file that holds CDEs. By running the profile and paying particular attention to data patterns, nulls and duplicates, and sample values, the writing of data quality rules can be expedited.

This profiling activity helps answer key questions: Is this data really populated? What is the general hygiene of this data? What are the formats of the data, and does it need more cleanup? By increasing the understanding of CDEs, teams are set up for a more robust definition of CDE DQ rules.

7. Building DQ reports (data at-rest)

There is an important need that is often missed: the development of reports that support data quality initiatives. This does not need to be done in a dedicated data quality tool -- it is most effective when done with resident BI and analytics software packages. These reports tend to fall into three categories:

  1. Tactical spreadsheet-level reports that illustrate which rows of data have issues and what those issues are. These reports are for data experts and act like audit reports that say what is wrong and what needs to be done to fix it.
  2. DQ dimension reports -- graphical reports that show where data stands across tables or functional boundaries. They show overall trending and help illustrate where more work is needed.
  3. Executive-level reports that show the current state and the historical reference, designed to show leadership where things were, where they are now, and what the focus is moving forward.

These reports give data teams what they need to action DQ situations and are often the basis of capabilities inside DQ dashboards.

[Screenshots: executive-level report, DQ dimension report, and tactical audit report examples]

8. Rolling out a data triage process

A critical part of data quality is the build-out of data stewardship and functional support for data. As your data quality program grows, there will be needs to fix data in operational systems. Some practitioners refer to this as "shifting left." The idea is to fix data at the source when possible so it does not need to be fixed and patched across multiple data pipelines.

The roll-out of a data triage process typically includes:

  1. A help desk notification or workflow that captures requests and allows assignment to people who can address those challenges
  2. Aging reports that show outstanding requests approaching or exceeding service level agreement timeframes
  3. Tooling so that business users can make the necessary updates in a timely manner, reducing the lift required to make changes
  4. Reporting that generates volumetrics to track requests, completions, and timeframe to do the work
  5. A program to recognize top performers who provide support for data triage activities

Note: some firms get caught in a cycle of pushing these requests into help desk software like ServiceNow or JIRA. It is worth asking whether those platforms are the right ones for this type of activity, or whether the work should happen in software packages closer to the users of data.

9. Piloting in-flight DQ checks

In-flight DQ checks are the technical processes built inside data pipelines that use a variety of techniques to identify when data being processed fails to meet the established DQ business rules. These processes are built with capabilities including reference data and data quality or data observability software, and they generate defects and restrict loading if data is not fit for purpose.

The general process: data is read from a source system, a DQ check runs for validity, data that passes is transformed and loaded, while discards are flagged separately before the data reaches the target system.

10. Implementing a business process for data circuit breakers

The in-flight DQ check is helpful, but what happens after is key. In most cases, building a business process that reviews data discards, modifies data in-flight or at the source, and notifies key personnel on challenges is critical.

As you build out your in-flight DQ solutions, remember the business users you support. Do not let the process hold data back without a resolution path. That pattern is a top reason business users lose confidence in data quality and in the technical teams that support them.

Key deliverables for implementing a business process for circuit breakers include:

  1. Building discard reports
  2. Circulating discard reports
  3. Providing oversight for discard follow-ups
  4. Reporting on exceptions and trends over time

Note: this area of business process for discard processing should be viewed as a specialty use case closely related to data triage.

11. Rolling out a "State of Data" report for leadership

As your data program moves forward and gains the attention of leaders, it is vital to include them and share progress. Transparency on data quality progress is critical to grow the trust and confidence of senior management.

Build a "State of Data" report that gets refreshed quarterly. It should cover a wide range of topics related to data, data governance, and data quality. In the data quality space, one or two slides that tell the story of data quality progress are recommended. Those slides often include:

  • Number of data quality actions identified
  • Number of data quality actions completed
  • Trending reports on the improvement of data by DQ dimension across quarters
  • Number of people trained on DQ topics
  • Next steps for the next 30, 60, and 100 days

The goal is to provide leadership with a consistent picture of where data stands each quarter, so leaders spend their time asking the right questions:

  • What are your top priorities for the next quarter?
  • What assistance do you need with resourcing and priorities?
  • What are the benefits from our recent investments?
  • What should I be most concerned about?
  • What is next?

Role of data quality in AI trust

AI is widely regarded as an efficient way to complete business processes. This can be done using machine learning (ML), natural language processing (NLP), or generative AI (GenAI). But all these models, just like traditional analytics, have one requirement: they need quality data.

A focus on quality both at-rest and in-flight is required to have trusted data and processes. Agents operating on low-quality data can provide incorrect outcomes, cause business interruptions, and risk the reputation of your enterprise.

It is critical to have continuous monitoring, regular review, and efficient stewardship strategies for creating and maintaining high-quality data that a firm can trust. You cannot trust AI without first trusting your data.

Bigeye's role in data quality

Bigeye, as a data observability tool, brings data quality forward in an actionable format. It addresses a large number of data quality capabilities, including data quality dimensions, checks, reports and dashboards, data profiling, alerts, and issue management.

Capability Description Benefit
Data quality dimensions Bigeye gives each customer the ability to establish their own data quality dimensions — not a forced set, but dimensions configured to match the organization's needs. These can be used to define what checks are needed and to report a summary view of data health by dimension. Organizations can align data quality reporting to their own standards rather than adapting to a rigid framework.
Data quality checks With over 70 out-of-the-box metrics, a data quality analyst can set up approximately 80% of checks in minutes across a table or schema. These checks can also be embedded inside pipeline processes, and circuit breakers can be put into place to stop processing under certain conditions. Rather than spending hours writing data quality rules from scratch, Bigeye covers the majority out of the box. Teams can focus manual effort on the custom business logic that requires it.
Data quality dashboards and reports Inside Bigeye, a set of reporting widgets, reports, and dashboards are available. Additionally, Bigeye offers the capability to surface Bigeye metrics inside your own BI tool, enabling the current state of data to be visible in the tools teams already use. The combination of out-of-the-box reporting and BI flexibility provides comprehensive support for data quality visibility at every level of the organization.
Data profiling Bigeye offers data profiling capabilities that help accelerate the development of new data products with a focus on data quality, auto-generate metrics from the profile, and assist data stewards in cleaning up the data. Profiling that used to take days of manual SQL work can be completed quickly, giving engineers a head start on data quality rules and pipeline design.
Data quality alerts The Bigeye platform uses machine learning to establish thresholds and generate alerts when action is needed. It identifies real anomalies and filters out outliers that appear problematic but reflect normal business or technical operation. Bigeye's alerting balances the need for notification with the concern about alert overload, so data teams act on what matters rather than chasing noise.
Issue management Bigeye's issue resolution portal allows data teams to take action on data issues and helps teach the ML model for better decision-making in the future. Issues can be assigned, tracked, and resolved within the platform. A unified place to collect, manage, and resolve issues removes the coordination overhead of tracking problems across email threads and spreadsheets.

Summary

Data quality is critical to using data to run your business. This paper covered a wide range of topics: from data quality dimensions and data quality checks to profiling, data triage, and the eleven focus areas that move programs from reactive to proactive. It is vital to your business to have data that you can trust. Data quality is where that trust is built.

Explore the Series

Every great data program is built from the ground up.

The House of Data breaks down the ten pillars of a mature, trustworthy data organization. Click any section to explore that paper.

Data Leadership Data Literacy Data Quality Privacy Data Security DataOps Compliance Data Enablement Data Consumption Data Architecture

References

Caballero, I., Verboon, N., & Piattini, M. (2020). Dimensions of data quality (Version 1.2). DAMA-NL. https://dama-nl.org/wp-content/uploads/2020/09/DDQ-Dimensions-of-Data-Quality-Research-Paper-version-1.2-d.d.-3-Sept-2020.pdf

Khatri, V., & Brown, C. V. (2010). Data governance: The missing approach to data quality. California Management Review, 52(2), 86–103. https://www.proquest.com/openview/4b405a8360f99610460c0640fc680668

Lee, Y. W., Strong, D. M., Kahn, B. K., & Wang, R. Y. (2002). AIMQ: A methodology for information quality assessment. (Related to “Beyond accuracy” concept; see below if using specific article)

Naumann, F. (2002). Quality-driven query answering for integrated information systems. (Contextual—remove if not needed)

Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218. https://dl.acm.org/doi/10.1145/240455.240479

Redman, T. C. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33. https://doi.org/10.1080/07421222.1996.11518099

Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110. https://doi.org/10.1145/253769.253804

Taleb, I., Serhani, M. A., & Dssouli, R. (2024). A review of data quality dimensions. Procedia Computer Science, 232, 187–196. https://www.sciencedirect.com/science/article/pii/S187705092400365X

Trehan, A. (2024). An intelligent approach to data quality management: AI-powered quality monitoring in analytics. https://www.researchgate.net/publication/387298750

Zhang, Y., et al. (2022). Data quality challenges in deep learning. The VLDB Journal, 31, 1–23. https://doi.org/10.1007/s00778-022-00775-9

Loshin, D. (2020). Executing data quality projects: Ten steps to quality data and trusted information. Academic Press. https://www.amazon.com/Executing-Data-Quality-Projects-Information/dp/0128180153

Sambasivan, N., et al. (2020). Data quality and explainable AI. ACM Digital Library. https://dl.acm.org/doi/10.1145/3386687

share with a colleague
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

What are the five dimensions of data quality?

Completeness (are all expected records present?), conformity (does the data match the expected format or schema?), consistency (does the same value appear consistently across systems?), uniqueness (are there duplicates where there shouldn't be?), and timeliness (did the data arrive when it was supposed to?). Each dimension can fail independently, which is why checking one doesn't substitute for checking all five.

What is data stewardship and why does it matter for quality?

Data stewardship assigns ownership of specific datasets to specific people or teams. When a quality issue surfaces, stewardship answers the question "whose problem is this?" before anyone has to spend time figuring it out. Without clear ownership, even a well-instrumented quality program degrades into alert fatigue and unresolved incidents.

How does data quality affect AI?

AI models are trained on historical data and make predictions based on ongoing feeds. Quality issues that would surface quickly in a dashboard — a null rate that doubled, a schema field that changed — can silently corrupt a model's training data or inference inputs. The consequences compound over time. A model trained on six months of skewed data doesn't just give one wrong answer; it encodes that skew into its weights.

about the author

Jim Barker

Director of Professional Services

Jim Barker is a lifelong data practitioner, industry thought leader, and passionate advocate for treating data as a strategic asset. With more than four decades of experience spanning data quality, governance, warehousing, migration, and architecture, Jim brings a rare blend of hands-on expertise and executive perspective to the evolving data landscape.

Jim’s journey in data began at just 14 years old. Since then, he has held leadership roles across organizations including Honeywell, Informatica, Thomson Reuters, Winshuttle (Precisely), Alation, nCloud Integrators, and Wavicle, contributing to advancements in data governance, migration methodologies, and enterprise data strategies. His work has included building global data quality programs, developing scalable governance frameworks, and driving innovation recognized across the industry.

His research and writing focus on lean data management, governance strategies, and the intersection of AI, data quality, and enterprise value creation.

Now at Bigeye as Director of Professional Services, Jim is energized by the company’s vision for data observability and its role in shaping the future of trusted data. He continues to share his perspectives through writing and speaking, aiming to elevate the conversation around data, cut through industry noise, and help organizations do data the right way.

Outside of work, Jim enjoys coaching and spending time with his family, often on the basketball court or soccer field, where many of the same lessons about teamwork, discipline, and leadership apply.

As Jim puts it: “Data matters.”

about the author

about the author

Jim Barker is a lifelong data practitioner, industry thought leader, and passionate advocate for treating data as a strategic asset. With more than four decades of experience spanning data quality, governance, warehousing, migration, and architecture, Jim brings a rare blend of hands-on expertise and executive perspective to the evolving data landscape.

Jim’s journey in data began at just 14 years old. Since then, he has held leadership roles across organizations including Honeywell, Informatica, Thomson Reuters, Winshuttle (Precisely), Alation, nCloud Integrators, and Wavicle, contributing to advancements in data governance, migration methodologies, and enterprise data strategies. His work has included building global data quality programs, developing scalable governance frameworks, and driving innovation recognized across the industry.

His research and writing focus on lean data management, governance strategies, and the intersection of AI, data quality, and enterprise value creation.

Now at Bigeye as Director of Professional Services, Jim is energized by the company’s vision for data observability and its role in shaping the future of trusted data. He continues to share his perspectives through writing and speaking, aiming to elevate the conversation around data, cut through industry noise, and help organizations do data the right way.

Outside of work, Jim enjoys coaching and spending time with his family, often on the basketball court or soccer field, where many of the same lessons about teamwork, discipline, and leadership apply.

As Jim puts it: “Data matters.”

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Want the practical playbook?

Join us on April 16 for The AI Trust Summit, a one-day virtual summit focused on the production blockers that keep enterprise AI from scaling: reliability, permissions, auditability, data readiness, and governance.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.