The House Of Data Series: Data Quality
This paper focuses on the five dimensions of data quality, how to measure and monitor them, the role of data stewardship, and what data quality means for AI Trust. It does not cover pipeline architecture or data observability tooling in depth — those are addressed in the DataOps whitepaper.
.png)

Get the Best of Data Leadership
Stay Informed
Get Data Insights Delivered
House of Data Series
Every strong data program is built like a house. Data Architecture forms the foundation — the platforms, pipelines, and operating model that everything else depends on. Seven domain pillars rise from that foundation, each one essential to a complete data program: Data Quality, Privacy, Data Security, DataOps, Compliance, Data Enablement, and Data Consumption. Data Literacy runs across all seven as a connecting beam, ensuring people at every level can read, interpret, and act on data. At the top, People & Leadership sets the direction, accountability, and culture that holds the whole structure together.
This series of whitepapers covers each component of the House of Data in depth. Each paper was written by a practitioner with direct experience in that domain. Together, they form a practical guide to building data programs that earn — and keep — trust.
This paper covers Data Quality — the first pillar of the House of Data, and the one that underpins every other capability. Without accurate, complete, and consistent data, pipelines can't be trusted, analytics mislead rather than inform, and AI systems encode problems rather than solve them. Data quality isn't a cleanup task. It's a continuous discipline.Introduction
Data has become foundational to how organizations operate, compete, and scale. It powers financial reporting, customer engagement, supply chains, regulatory compliance, analytics, and increasingly, automated decision-making through AI. As reliance on data has grown, so has the cost of getting it wrong.
Despite modern data platforms and significant investment, many organizations still struggle with a basic question: can we trust the data we are using to run the business? Conflicting reports, unexpected pipeline failures, and late-stage remediation remain common. In AI-driven use cases, these challenges are amplified. Poor data quality no longer results only in flawed insights, but in automated outcomes that are difficult to detect, explain, or reverse.
Data quality is not about making all data perfect. It is about ensuring that data is fit for business use. That means defining what "good" looks like for the data that matters most, detecting when data falls below acceptable thresholds, addressing issues efficiently, and preventing the same failures from recurring. When data meets those expectations, it should be explicitly recognized as suitable for use.
This paper examines eleven focus areas that, in practice, have the greatest impact on improving data quality outcomes. These areas span organizational roles, identification of critical data, definition of quality expectations, visibility into data health, operational response to issues, and transparency for leadership. Together, they form a practical path from reactive data cleanup toward proactive, scalable data quality management.
Data quality
Data quality defined
Data quality is a core component of successful data programs. To get value out of data, it must be "fit for business" or "fit for purpose." The goal is to understand whether data meets business readiness thresholds. When it is not fit for purpose, fix it. When it does meet the business need, certify it as "good for use."
DAMA definition: The planning, implementation, and control of activities that apply data quality management techniques to data in order to assure it is fit for consumption and meeting the needs of data consumers. (Source: DMBOK, 2017)
Gartner definition: Data quality refers to the usability and applicability of data used for an organization's priority use cases, including AI and machine learning initiatives. Data quality is usually one of the goals of effective data management and data governance. Yet too often organizations treat it like an afterthought.
McGilvray definition: The degree to which information and data can be a trusted source for any and/or all required uses.
Practical definition: Data quality is the level at which data can be trusted to be used for the purpose in which it was collected. It needs to be addressed while in-flight and at-rest, and all users of the data should follow the mantra of "See It, Say It, and Sort It."
Data quality terms of interest
The following vocabulary forms the working foundation for any serious data quality program. Each term represents a concept teams will encounter when building, running, or improving quality capabilities.
Focus areas of data quality
Each of those terms is important to data quality, and each could take several pages to fully explore. This paper dives into eleven of them -- the eleven most critical to supporting data quality progress in an organization.
- People
- Critical Data Objects and Critical Data Elements (CDEs)
- Defining data quality dimensions
- Defining a data quality process
- Developing DQ rules for CDEs
- Profiling CDEs
- Building DQ reports (data at-rest)
- Rolling out a data triage process for DQ
- Piloting in-flight DQ checks
- Implementing a business process for data circuit breakers
- Rolling out a "State of Data" report for leadership
1. People
People are the key to data programs, and the data quality focus area is no different. Three main focus points on people should be understood.
Everyone has a role to play in data quality.
- Leaders (business, data, and technical) need to understand the importance of data in their organization and how critical good data is for operations, AI, digital transformation, analytics, and execution. Leaders need to pay attention to data quality efforts, set expectations, support funding requests, and have the ability to align resources to prevent "garbage in, garbage out" challenges.
- Data analysts need to understand the importance of good data for the things they do. Rather than simply complaining that data is bad, they should provide specifics, speak up when they see a problem, and do their part to support those trying to improve data.
- Data engineers should think about data quality in everything they do when designing, building, and maintaining solutions. Create solutions that help with data quality, ask for business input related to quality, and do what is necessary for solutions with data quality at their core.
- Data quality specialists work to bring people together, listen to input, build solutions, and promote the help they receive. They should not be afraid to ask for help, to teach others, and to raise the profile and benefits of high-quality data.
- Data stewards bring it all together and promote data quality activities across the organization. They build out teams across the organization that can walk and talk the benefits of data quality and genuinely believe in a community of practice working together for better data to run a better business.
See It, Say It, Sort It.
Establish a system that provides a mechanism to record issues as you find them, brings people together, and gives people the help they need. Build out the notion that if you see something, learn something, or need something, you write it down. The worst problems encountered in business occur when someone knows something is a problem but does not take the time to share it with others. As you see something wrong, write it down ("say it") and then get it fixed ("sort it"). This makes all the difference.
Who's who in data quality.
Provide a register that is easy to find and tells people who the experts are for data quality, and for all of data. This would include references to functional expertise, technical capabilities, business process, audit concerns, and overall execution. Having the organization set up so you know who people are, recording challenges, and understanding that everyone has a role to play in data will help you progress forward. Get the people right and everything is easier.
2. Critical Data Objects and Critical Data Elements
For years, data governance practitioners have spoken of CDEs, or Critical Data Elements. This concept is of particular importance for data quality, but it helps to take it a step further. The concept of CDEs is actually two different things:
- Critical Data Elements (CDEs) are the fields in files or tables that are critical to business processes and analytics. If they are wrong, the business is in serious trouble.
- Critical Data Objects are the tables or files that hold CDEs. It is only with an understanding of both that firms can manage data quality work effectively.
As part of your data program, record what your CDEs are. Use this to manage the work of getting and keeping data of high quality and report on the level of data quality for these critical objects. Firms can have millions of columns in their data, but a smaller number will be critical. Know the difference. Focusing first on CDEs can be a game changer for all organizations.
3. Defining data quality dimensions
Data quality dimensions have been at the core of data management for the last 25 years. It is important to define what your data quality dimensions are and use them for building out better data quality rules, reporting on the quality of your data, and building improvement plans based on those details.
The typical six dimensions most data experts start with are listed below in the recommended sequence to address. Completeness is often addressed first and last: first for critical elements to make a complete record, and last to bring together aspects of other data quality dimension details.
*Completeness is often addressed first and last -- first for critical elements to make a complete record, and last to bring together aspects of other DQ dimension details.
While this is the most common list to work with, there are many other ways to look at this. The DAMA organization in the Netherlands performed a project that categorized a wider variety of data quality dimensions. The conceptual groupings below can provide insight when a firm works to define its own dimensions. (Source: DAMA DMBOK 2.0)
When defining data quality dimensions for your organization, right-size the list. Find a set comprehensive enough to meet your needs, but small enough to socialize across the organization. The goal is to use these dimensions to drive improvement.
4. Defining a data quality process
Data quality often gets a bad reputation and can be difficult to move forward. Experience has shown that by understanding a common process and getting people to work together and follow it, data quality can improve in an efficient manner.
Danette McGilvray wrote the seminal book on data quality process, Executing Data Quality Projects. The great gift to the data industry from that work was "The Ten Step Process." The original steps are:

These ten steps have evolved in recent years with more focus on data catalogs and data observability. The following evolution is more prescriptive and leads to more proactive action. Automated data observability checks are implemented first and expanded to custom when necessary.

5. Developing DQ rules for CDEs
Documenting the business definition of what "business fit" means for a given field is very important. That does not mean a firm should invest time in defining the data quality rule for every field in their systems. Common sense must rule. Most firms will start by documenting their CDEs and potentially expand from there.
This task involves meeting with subject matter experts for a field, reviewing the data and data profile when available, and documenting the specifics of a rule: what makes a field meet the definition of "business fit" for your organization. The DQ rule should include:
- Field name -- the technical name of the field in the file or database
- Business name (if different) -- the business common name of the field
- Data type -- the technical detail of the field (string, number, date, etc.)
- Data size/scale -- the length of the data field
- Business definition -- describes what the field holds from a business perspective
- Business specification -- one or two sentences describing what this data field needs to contain
An example business DQ rule:
6. Profiling CDEs
Profiling is technology that, with the click of a button, lets an end user read a table or file and get a report highlighting much of the data source. This typically includes:
- Sample data and frequency
- Number of nulls
- Number of probable duplicates
- System data type, size, and scale
- Inferred data type, size, and scale
- Patterns and frequency
- Minimum and maximum of data value
- Minimum and maximum length of data value
- Average length of data value
- Average value of data field
Use profiling technology to profile each table or file that holds CDEs. By running the profile and paying particular attention to data patterns, nulls and duplicates, and sample values, the writing of data quality rules can be expedited.
This profiling activity helps answer key questions: Is this data really populated? What is the general hygiene of this data? What are the formats of the data, and does it need more cleanup? By increasing the understanding of CDEs, teams are set up for a more robust definition of CDE DQ rules.
7. Building DQ reports (data at-rest)
There is an important need that is often missed: the development of reports that support data quality initiatives. This does not need to be done in a dedicated data quality tool -- it is most effective when done with resident BI and analytics software packages. These reports tend to fall into three categories:
- Tactical spreadsheet-level reports that illustrate which rows of data have issues and what those issues are. These reports are for data experts and act like audit reports that say what is wrong and what needs to be done to fix it.
- DQ dimension reports -- graphical reports that show where data stands across tables or functional boundaries. They show overall trending and help illustrate where more work is needed.
- Executive-level reports that show the current state and the historical reference, designed to show leadership where things were, where they are now, and what the focus is moving forward.
These reports give data teams what they need to action DQ situations and are often the basis of capabilities inside DQ dashboards.
[Screenshots: executive-level report, DQ dimension report, and tactical audit report examples]
8. Rolling out a data triage process
A critical part of data quality is the build-out of data stewardship and functional support for data. As your data quality program grows, there will be needs to fix data in operational systems. Some practitioners refer to this as "shifting left." The idea is to fix data at the source when possible so it does not need to be fixed and patched across multiple data pipelines.
The roll-out of a data triage process typically includes:
- A help desk notification or workflow that captures requests and allows assignment to people who can address those challenges
- Aging reports that show outstanding requests approaching or exceeding service level agreement timeframes
- Tooling so that business users can make the necessary updates in a timely manner, reducing the lift required to make changes
- Reporting that generates volumetrics to track requests, completions, and timeframe to do the work
- A program to recognize top performers who provide support for data triage activities
Note: some firms get caught in a cycle of pushing these requests into help desk software like ServiceNow or JIRA. It is worth asking whether those platforms are the right ones for this type of activity, or whether the work should happen in software packages closer to the users of data.
9. Piloting in-flight DQ checks
In-flight DQ checks are the technical processes built inside data pipelines that use a variety of techniques to identify when data being processed fails to meet the established DQ business rules. These processes are built with capabilities including reference data and data quality or data observability software, and they generate defects and restrict loading if data is not fit for purpose.
The general process: data is read from a source system, a DQ check runs for validity, data that passes is transformed and loaded, while discards are flagged separately before the data reaches the target system.

10. Implementing a business process for data circuit breakers
The in-flight DQ check is helpful, but what happens after is key. In most cases, building a business process that reviews data discards, modifies data in-flight or at the source, and notifies key personnel on challenges is critical.
As you build out your in-flight DQ solutions, remember the business users you support. Do not let the process hold data back without a resolution path. That pattern is a top reason business users lose confidence in data quality and in the technical teams that support them.
Key deliverables for implementing a business process for circuit breakers include:
- Building discard reports
- Circulating discard reports
- Providing oversight for discard follow-ups
- Reporting on exceptions and trends over time
Note: this area of business process for discard processing should be viewed as a specialty use case closely related to data triage.
11. Rolling out a "State of Data" report for leadership
As your data program moves forward and gains the attention of leaders, it is vital to include them and share progress. Transparency on data quality progress is critical to grow the trust and confidence of senior management.
Build a "State of Data" report that gets refreshed quarterly. It should cover a wide range of topics related to data, data governance, and data quality. In the data quality space, one or two slides that tell the story of data quality progress are recommended. Those slides often include:
- Number of data quality actions identified
- Number of data quality actions completed
- Trending reports on the improvement of data by DQ dimension across quarters
- Number of people trained on DQ topics
- Next steps for the next 30, 60, and 100 days
The goal is to provide leadership with a consistent picture of where data stands each quarter, so leaders spend their time asking the right questions:
- What are your top priorities for the next quarter?
- What assistance do you need with resourcing and priorities?
- What are the benefits from our recent investments?
- What should I be most concerned about?
- What is next?
Role of data quality in AI trust
AI is widely regarded as an efficient way to complete business processes. This can be done using machine learning (ML), natural language processing (NLP), or generative AI (GenAI). But all these models, just like traditional analytics, have one requirement: they need quality data.
A focus on quality both at-rest and in-flight is required to have trusted data and processes. Agents operating on low-quality data can provide incorrect outcomes, cause business interruptions, and risk the reputation of your enterprise.
It is critical to have continuous monitoring, regular review, and efficient stewardship strategies for creating and maintaining high-quality data that a firm can trust. You cannot trust AI without first trusting your data.
Bigeye's role in data quality
Bigeye, as a data observability tool, brings data quality forward in an actionable format. It addresses a large number of data quality capabilities, including data quality dimensions, checks, reports and dashboards, data profiling, alerts, and issue management.
Summary
Data quality is critical to using data to run your business. This paper covered a wide range of topics: from data quality dimensions and data quality checks to profiling, data triage, and the eleven focus areas that move programs from reactive to proactive. It is vital to your business to have data that you can trust. Data quality is where that trust is built.
References
Caballero, I., Verboon, N., & Piattini, M. (2020). Dimensions of data quality (Version 1.2). DAMA-NL. https://dama-nl.org/wp-content/uploads/2020/09/DDQ-Dimensions-of-Data-Quality-Research-Paper-version-1.2-d.d.-3-Sept-2020.pdf
Khatri, V., & Brown, C. V. (2010). Data governance: The missing approach to data quality. California Management Review, 52(2), 86–103. https://www.proquest.com/openview/4b405a8360f99610460c0640fc680668
Lee, Y. W., Strong, D. M., Kahn, B. K., & Wang, R. Y. (2002). AIMQ: A methodology for information quality assessment. (Related to “Beyond accuracy” concept; see below if using specific article)
Naumann, F. (2002). Quality-driven query answering for integrated information systems. (Contextual—remove if not needed)
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218. https://dl.acm.org/doi/10.1145/240455.240479
Redman, T. C. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33. https://doi.org/10.1080/07421222.1996.11518099
Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110. https://doi.org/10.1145/253769.253804
Taleb, I., Serhani, M. A., & Dssouli, R. (2024). A review of data quality dimensions. Procedia Computer Science, 232, 187–196. https://www.sciencedirect.com/science/article/pii/S187705092400365X
Trehan, A. (2024). An intelligent approach to data quality management: AI-powered quality monitoring in analytics. https://www.researchgate.net/publication/387298750
Zhang, Y., et al. (2022). Data quality challenges in deep learning. The VLDB Journal, 31, 1–23. https://doi.org/10.1007/s00778-022-00775-9
Loshin, D. (2020). Executing data quality projects: Ten steps to quality data and trusted information. Academic Press. https://www.amazon.com/Executing-Data-Quality-Projects-Information/dp/0128180153
Sambasivan, N., et al. (2020). Data quality and explainable AI. ACM Digital Library. https://dl.acm.org/doi/10.1145/3386687
Monitoring
Schema change detection
Lineage monitoring
What are the five dimensions of data quality?
Completeness (are all expected records present?), conformity (does the data match the expected format or schema?), consistency (does the same value appear consistently across systems?), uniqueness (are there duplicates where there shouldn't be?), and timeliness (did the data arrive when it was supposed to?). Each dimension can fail independently, which is why checking one doesn't substitute for checking all five.
What is data stewardship and why does it matter for quality?
Data stewardship assigns ownership of specific datasets to specific people or teams. When a quality issue surfaces, stewardship answers the question "whose problem is this?" before anyone has to spend time figuring it out. Without clear ownership, even a well-instrumented quality program degrades into alert fatigue and unresolved incidents.
How does data quality affect AI?
AI models are trained on historical data and make predictions based on ongoing feeds. Quality issues that would surface quickly in a dashboard — a null rate that doubled, a schema field that changed — can silently corrupt a model's training data or inference inputs. The consequences compound over time. A model trained on six months of skewed data doesn't just give one wrong answer; it encodes that skew into its weights.

.png)
