Jim Barker
jim-barker
Thought leadership
-
February 27, 2026

The House of Data Series: DataOps

23 min read

This paper focuses on building, running, and monitoring data pipelines — including the collaboration model between teams, quality checks in flight, and what DataOps requires to support AI Trust. It does not cover data quality dimensions, governance policy, or architectural design decisions in depth — those are addressed in the Data Quality, Compliance, and Data Architecture whitepapers.

Jim Barker
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Join The AI Trust Summit on April 16
A one-day virtual summit on the controls enterprise leaders need to scale AI where it counts.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

House of Data Series

Every strong data program is built like a house. Data Architecture forms the foundation — the platforms, pipelines, and operating model that everything else depends on. Seven domain pillars rise from that foundation, each one essential to a complete data program: Data Quality, Privacy, Data Security, DataOps, Compliance, Data Enablement, and Data Consumption. Data Literacy runs across all seven as a connecting beam, ensuring people at every level can read, interpret, and act on data. At the top, People & Leadership sets the direction, accountability, and culture that holds the whole structure together.

This series of whitepapers covers each component of the House of Data in depth. Each paper was written by a practitioner with direct experience in that domain. Together, they form a practical guide to building data programs that earn — and keep — trust.

Data Leadership Data Literacy Data Quality Privacy Data Security DataOps Compliance Data Enablement Data Consumption Data Architecture

This paper covers DataOps — the engineering and operational discipline that builds and runs data pipelines, bridging ideation to business value across both Build and Run phases.

DataOps

DataOps is the fourth pillar of the House of Data — the engineering and operational discipline that builds and runs data pipelines. It bridges the ideation-to-business-value cycle, covering both the Build phase (designing and delivering new data capabilities) and the Run phase (operating them reliably at production scale). Where other pillars define policies and standards, DataOps is where they get implemented.

Introduction

Organizations with an aim to capitalize on data to grow their business and drive down costs are always taking a look out for capabilities with data. The core set of activities of rolling out and using new data capabilities is DataOps. The goal is to build solid processes on solid designs for the lowest possible cost of maintenance. This white paper delves into DataOps with a split focus of both Build and Run. Build is creating the new capabilities, and Run is the ongoing Operate of these processes.

As we move forward to discuss DataOps, remember the DataOps Manifesto: "Through firsthand experience working with data across organizations, tools, and industries, we have uncovered a better way to develop and deliver analytics that we call DataOps." Whether referred to as data science, data engineering, data management, big data, business intelligence, or the like, we have come to value analytics through our work. (DataOps Manifesto)

Atwal discusses a number of great development strategies in Practical DataOps. These include an agile framework and delivery, KPIs for data product feedback, the importance of building trust through the use of data lineage, and finally organizing different personas for developing new data capabilities FAST.

Schmidt in DataOps: The Authoritative Edition speaks of building data products using DataOps that are well designed, efficient, and both prepare data and address data quality needs. The goal, as Schmidt puts it, is to develop quickly and save time by avoiding the dreaded data integration 'hairball'.

The goal is to build out a model that has smart pipelines, that are built the right way, then use Data Observability to both capture situations where data is not delivered when it should, when record volumes are outside the norm, and stop processing when data quality thresholds are not met.

The manner to drive this out is depicted in the below diagram. It is a combination of: (1) data governance; (2) data competency; (3) planning and design; (4) strategic development and rollout. Further, it requires using a reference data architecture to build processes right the first time, leverage well-established testing strategies, and capture details throughout the development process. This approach not only accelerates development — it also simplifies the enablement process.

The next two sections will detail the build process, and introduce a run process for ongoing support for increased confidence by the business community and lower cost of maintenance for technical teams (TCM).

Build

Build will describe and detail the actions from ideation to delivering with a focus on business value. It is expected that data engineers will build processes based on configurable approaches — often with table-based reference data, data quality checks, well-tested transformation logic that will run with reduced compute resources but still run quickly and efficiently. These objectives are consistent regardless of streaming or more traditional batch processing. The goal of these processes is to develop them in a way that they have the lowest possible total cost of maintenance (TCM).

The build process overall will follow this methodology, the steps are: (1) Ideation; (2) Determine what data is available; (3) Data profiling; (4) Building data models; (5) Building data pipelines; (6) Testing data pipelines and resulting data; (6) Testing full process; (7) Building AI, BI, and Digital Transformation prototypes; (8) Production run pipelines — this feeds into data enablement and use of the new data capabilities.

These steps are detailed in this process flow, and following narrative.

Ideation

A key set of actions for driving out new capabilities is to listen to the business for ideas. This can be through strategic objectives, official requests, data governance surveys response, business challenges, or challenging market conditions. The recommended approach is to have mechanisms in place to ask for ideas of what is needed or wanted for business users. This ideation cycle should be "open" to all data consumers and never should a request be auto-deleted. Make a list to capture any and all requests. One way to do that is with a data governance inbox — a place to both get help, but always record ideas and collect appropriate detail.

Evaluate

Evolve these requests with appropriate level of pre-discovery and obtain approval and funding to move forward. If the requirements are declined or set to a background bucket, save the feedback for the requestor but save the details and evaluate work produced to use for future requests. Make sure to reach out to the submitter and thank them for their efforts, and assure them you will continue to keep a look out for similar requests to see if this need can be addressed in the future, or let them know the forecasted path to addressing this type of request. Use that submitter as someone to assist in design review, testing, and celebrate their contributions when their ask goes live. Reviewing the submitters' ideas with them, and thanking them for their enthusiasm will pay fast dividends in the long run.

Determine source data

In reviewing the request and starting design, discover the available data that "may" work for the request. This can come from data catalogs, past projects, data lineage, or very often the subject matter experts. Don't hesitate to use SMEs for help — a simple "where do I find this data" is often massively helpful and rarely causes negative feedback. People in the data space really do want to help.

Profile the data

Utilize profiling techniques to explore all the possible data sources. The outcome of a good profile typically includes:

  • Column name
  • Data type
  • Inferred data type
  • Min/max/key attributes
  • String length
  • Patterns
  • Number of records

This can be used to both: (1) Determine the physical data model; (2) Detail data transformation or data quality rules, and business logic. While any engineer can write SQL statements very quickly, the profiling lets the engineer look at many data sets quickly and rapidly accelerate the development process.

Pipeline development

Once the target data model is established, business transformation rule requirements are known, and in-flight data checks for circuit breakers are designed, the data pipeline can be created. These pipelines should be developed, run, and tested. The team should then conclude with string testing, component testing, and system testing. The resulting target data should be profiled as part of that testing.

As these processes are developed, the detailed metadata should be collected and added to some sort of repository. A set of sample reports should be developed and run for validation and verification purposes. This metadata that is discovered by data engineers is extremely valuable. Data engineers tend to resist recording it, but doing so while developing is more efficient, more correct, and more reliable than trying to capture it in the days prior to enablement and roll-out.

Best practices for pipeline development include:

  • Value value checks
  • Range tests
  • Reasonability tests
  • Cross-reference transformation
  • Well-formed numerical calculations
  • Performance tuning of processes

The goal is to have well-formed code that is easy to maintain by any developer. This lowers the level of effort and cost for maintenance — that is, it lowers the total cost of maintenance.

Reference Data Architecture (RDA) is a best practice that builds out processes based on table-driven design. RDA is one way to build processes that are simpler to read and understand, and lower the level of effort to make any changes.

Business user involvement: It is easy for technical folks to stay engaged in building out their solutions and show the business users at completion. This pattern should be challenged to keep some business users involved throughout development. Business users who submitted the idea can be a great source of validation if provided the opportunity to share their experiences.

Create sample reports

It is expected that the developer will build out a set of sample reports as part of the build to complete validation and verification. This is also known as building BI prototypes, and has examples of the output that was envisioned for the solution. The goal of this set of "official sample reports" is to take this a step further. It is to build business-ready example reports that can be used for building job (training) aids, and showing business users in the rolling out of the new features. It is critical that these reports or dashboards demonstrate data correctness, clean formats, and show a high degree of ease of use.

Validation and verification

Validation is the set of steps that are executed to make sure each and every calculation is correct. It is the meticulous task that is complete for final evaluation.

Verification is the multi-cycle test that verifies process performance, or that the code runs correctly over many runs. It should also include testing and re-start processing.

Run

Run is the operating phase; it focuses on the continuous execution, integration, and business value derived. After these processes are developed, tested, and implemented, an operations mindset takes over. Any and all failures will be addressed within a service level agreement (SLA) and typically a rotating support model will be established. Good data engineering teams will establish a continuous improvement mindset, and routinely address any run issue prior to business awareness.

It is critical that there is consistent monitoring for pipeline reliability. This includes freshness of data: is data being refreshed on consistent time-scales, and volume: is the correct amount of data being provided in each run.

Once the data pipelines are developed, tested, and deployed, the focus needs to shift to the run phase of use or "Operate". During this phase there needs to be an operational process that covers five areas:

  1. Run and manage data pipelines
  2. Monitor the data quality
  3. Adapt the pipelines as needs arise
  4. Triage — i.e., fix data upon request from business, technical, and stewardship teams
  5. Oversee the data to ensure pipeline reliability and data quality

Run and manage data pipelines

Once the solution is built and tested, set up the data pipelines to run regularly. Set up pipeline reliability capabilities to alert the "Run" team when pipelines: (1) Are running late; (2) Running early; (3) Failing; (4) Processing too many records; (5) Processing too few records. These notifications along with process failure alerts ensure that the business is getting the data they have come to expect.

Data pipelines need to be scheduled and run reliably. When they fail, notifications will be sent out as they need to be run to completion on a regular time scale. Any delays need to be communicated, escalated, and managed.

Data observability software helps with this by providing pipeline reliability capabilities such as illustrating freshness and volume details, and providing the necessary alerts when something is off. All failures should be reviewed and improved to limit any bad results in business solutions. Business users typically are not very supportive of waiting for data to be available, as the efficient management of business processes is critical to the bottom line.

It is recommended to report out in a state-of-data report on pipeline effectiveness numbers to senior leadership each quarter.

Monitor the data quality

Staff need to trust data. To achieve this, data quality checks are often implemented in-flight and at-rest. As the pipeline runs, any data quality checks that fail have the possibility to be suspended, to be triaged within operational parameters — often the next day or sooner. The stoppage of a pipeline due to data quality failures is referred to as circuit breakers.

At-rest data quality checks are a set of data quality checks that are run on a regular basis when the data has completed processing. Some of these checks will be daily/weekly/monthly. The goal of at-rest data quality checks is to find problems that may be caused by other data processes, by the shifting of time, or from other semi-related attributes changing in value.

In-flight data quality checks are a set of data quality checks that execute as data is processed. These capture data quality problems as data is flowing from one database to another, and will fire data quality circuit-breakers when too many failures occur. These in-flight checks can also be part of discard processing.

Circuit breakers are a step built into data pipelines that execute a data quality check, and when 1 or more than some threshold is hit, will stop the process, signal the DataOps team, and they will fix the process before reprocessing the data.

At-rest data quality checks and alerts should also be triaged and corrected. Many firms find the use of a RAIL (Ready Action Item List or "Risks" - "Activities" - "Issues" Listing) is helpful to capture needs and manage the completion of those activities and resolutions. With the RAIL, data teams can manage staff, re-prioritize needs, and have oversight into the addressing of the data pipeline and data quality issues.

Adapt the pipelines as needs arise

Data engineering teams collect feedback from the data processes, from business users, and from data stewards on areas that improvements could be implemented. Once these improvements are categorized, researched, and approved, a set of work efforts will be executed. This will follow the same process as new development utilizing profiling, development, testing, and documentation. A couple of Six Sigma tools can be used to better understand where these problems exist.

Cause and effect diagrams

One example of this is cause-and-effect diagrams (Ishikawa Diagram), or a fishbone chart. The idea of a fishbone diagram is to have the problem in the far right, having branches or categories of the main causes of the issue off the sides of a line, and examples of each off of each branch.

PDCA diagram

Another Six Sigma tool that provides even greater benefits in this area is a PDCA (Plan-Do-Check-Act) Diagram, or the Shewhart Circle. This is the idea that once problems are identified, the following steps will be taken to improve the situation: (1) Plan; (2) Do; (3) Check; (4) Act.

  • Plan — The idea of continuous improvement when the engineering team reviews the process and determines what improvements to make.
  • Do — Make the necessary changes, test the improvement, deploy to production, and reap the benefits.
  • Check — After the change is implemented, evaluate its benefits and document any concerns.
  • Act — Take away any additional change needed based on the results of the check.
r/CodeHarbor - How the PDCA cycle can improve your business

Triage

Whenever an assigned engineer finds a problem and changes code through a maintenance lifecycle so that it can prevent a problem — that is ideal. There are many instances where it is determined the implementation of a pipeline change isn't the right answer. In those cases, data triage requests are sent to the data steward to take action. The data steward will research the problem and make the necessary changes in the organization system, or enlist the functional teams for support. The functional teams will often make the changes and close out a ticket.

This management of triage requests or issues is critical. It is vital to have and manage a place to track this. Some shops will use a process like the RAIL to collect the need, manage it, and potentially build out an aging report to find the older requests and action these or simply re-prioritize.

Oversight

The last major run area of DataOps is regular checks of pipeline reliability and at-rest data quality. The idea is that by looking across time for any variation of run details, you can find that data pipeline or data quality reviews provide a timely review of changes that may be more difficult to notice at any single point in time.

By looking across time, data teams find errors only exposed by reviewing trends. Additionally, the DataOps team should review dashboards and drill into reports to find areas of concern, analyze the trends, and take action.

Run summary

It is critical that data engineering and data stewards work well together with cooperativeness across the five areas: (1) Run and Manage Data Pipelines; (2) Monitor the Data Quality; (3) Adapt the pipelines as needs arise; (4) Triage — i.e., fix data upon request from business, technical, and stewardship teams; (5) Oversee the data to ensure pipeline reliability and data quality. This collaboration and cooperation is the difference between a solid data program and dysfunction and frustration.

AI trust relationships with DataOps

Establishing consistent DataOps as the building and running of data-related processes — in today's climate of AI/ML, DLP, and GenAI — needs to be part of that landscape. This means that the development of new AI capabilities must include taking into account the goals and needs of: (1) Data Security; (2) Data Privacy; (3) Data Quality; (4) Policies and Enforcement.

The outcome of DataOps is solid data. Many businesses will build out data for AI and partition it in a place reserved for AI processes, or more specifically put only allow AI to pull from that location. These systems must be developed in a way to restrict sensitive data, and prevent AI applications from using data that is restricted from AI use by AI Solutions.

Remember: Good AI answers start with trusted data.

Bigeye's role in DataOps

Bigeye has a large role to play in DataOps. Its features are in-line with the process to run efficiently and increase the confidence of your user community. Its functionality includes:

DataOps teams spend most of their time on pipeline reliability, quality monitoring, and incident triage. The capabilities below map directly to those operational responsibilities — reducing the manual instrumentation required and surfacing issues before they reach downstream consumers.

Feature Description Benefit
Pipeline reliability metrics Monitors freshness and volume across pipelines, flagging data that loads early or late and record counts that fall outside expected ranges. Provides continuous reliability monitoring without requiring instrumentation inside the pipeline itself, automatically surfacing risk during normal operational runs.
Data quality checks Provides out-of-the-box metrics for identifying out-of-range values, plus support for custom business rules and stakeholder notifications. Covers the majority of common quality checks without custom rule development, and alerts data stewards before issues reach the business.
Sensitive data scanning Scans data environments to locate sensitive elements in their native fields and sensitive data embedded within other fields. Enables privacy teams to identify and remediate data sensitivity issues proactively, without requiring large volumes of custom detection code.
Issue resolution management A unified issues inbox for collecting, managing, and resolving anomalies, with ML-based filtering to distinguish genuine issues from expected operational variation. Balances the need for notification with the risk of alert fatigue — surfacing real issues without burying teams in noise.
Lineage Maps where data originates, how it's transformed at each stage, and where it flows downstream across the environment. Accelerates root cause analysis, supports impact assessment before making pipeline changes, and gives data consumers confidence in the provenance of what they're using.
Bigeye Pipeline Reliability view
Bigeye Data Quality Checks view
Bigeye Sensitive Data Scanning view
Bigeye Issue Resolution Management inbox

Summary

DataOps is the set of activities more tightly aligned to data engineers. It is the building and running of data pipelines. Data pipelines are how data gets from originating source systems into the analytics data platforms. This white paper also shared the role of DataOps in AI Trust, how some continuous improvement tools can provide benefit, and a framework for building capabilities for a highly efficient back-end for all data processing.

Explore the Series

Every great data program is built from the ground up.

The House of Data breaks down the ten pillars of a mature, trustworthy data organization. Click any section to explore that paper.

Data Leadership Data Literacy Data Quality Privacy Data Security DataOps Compliance Data Enablement Data Consumption Data Architecture

References

  • Atwal, H. (2020). Practical DataOps: Delivering Agile Data Science at Scale (1st ed.). Apress.
  • Schmidt, J., & Basu, K. DataOps: The Authoritative Edition.
  • DataOps Manifesto. https://dataopsmanifesto.org
  • Ishikawa, K. (1976). Guide to Quality Control (2nd ed.). Asian Productivity Organization.
  • Shewhart, W. A. (1939). Statistical Method from the Viewpoint of Quality Control. The Graduate School, Department of Agriculture, Washington.
  • Deming, W. E. (1986). Out of the Crisis. MIT Center for Advanced Engineering Study.
share with a colleague
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights
about the author

Jim Barker

Director of Professional Services

Jim Barker is a lifelong data practitioner, industry thought leader, and passionate advocate for treating data as a strategic asset. With more than four decades of experience spanning data quality, governance, warehousing, migration, and architecture, Jim brings a rare blend of hands-on expertise and executive perspective to the evolving data landscape.

Jim’s journey in data began at just 14 years old. Since then, he has held leadership roles across organizations including Honeywell, Informatica, Thomson Reuters, Winshuttle (Precisely), Alation, nCloud Integrators, and Wavicle, contributing to advancements in data governance, migration methodologies, and enterprise data strategies. His work has included building global data quality programs, developing scalable governance frameworks, and driving innovation recognized across the industry.

His research and writing focus on lean data management, governance strategies, and the intersection of AI, data quality, and enterprise value creation.

Now at Bigeye as Director of Professional Services, Jim is energized by the company’s vision for data observability and its role in shaping the future of trusted data. He continues to share his perspectives through writing and speaking, aiming to elevate the conversation around data, cut through industry noise, and help organizations do data the right way.

Outside of work, Jim enjoys coaching and spending time with his family, often on the basketball court or soccer field, where many of the same lessons about teamwork, discipline, and leadership apply.

As Jim puts it: “Data matters.”

about the author

about the author

Jim Barker is a lifelong data practitioner, industry thought leader, and passionate advocate for treating data as a strategic asset. With more than four decades of experience spanning data quality, governance, warehousing, migration, and architecture, Jim brings a rare blend of hands-on expertise and executive perspective to the evolving data landscape.

Jim’s journey in data began at just 14 years old. Since then, he has held leadership roles across organizations including Honeywell, Informatica, Thomson Reuters, Winshuttle (Precisely), Alation, nCloud Integrators, and Wavicle, contributing to advancements in data governance, migration methodologies, and enterprise data strategies. His work has included building global data quality programs, developing scalable governance frameworks, and driving innovation recognized across the industry.

His research and writing focus on lean data management, governance strategies, and the intersection of AI, data quality, and enterprise value creation.

Now at Bigeye as Director of Professional Services, Jim is energized by the company’s vision for data observability and its role in shaping the future of trusted data. He continues to share his perspectives through writing and speaking, aiming to elevate the conversation around data, cut through industry noise, and help organizations do data the right way.

Outside of work, Jim enjoys coaching and spending time with his family, often on the basketball court or soccer field, where many of the same lessons about teamwork, discipline, and leadership apply.

As Jim puts it: “Data matters.”

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Want the practical playbook?

Join us on April 16 for The AI Trust Summit, a one-day virtual summit focused on the production blockers that keep enterprise AI from scaling: reliability, permissions, auditability, data readiness, and governance.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.