Product

September 27, 2022

Bigconfig empowers data teams to implement data reliability at scale

min read

Kate Wendell

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Modern data teams manage their infrastructure, pipelines, and analytics tools as code. Bigeye has partnered with dozens of top data teams to create Bigconfig, a declarative, YAML-based monitoring-as-code solution, that allows teams to continue this convention and integrate data monitoring as code into their workflows.

Bigconfig provides Bigeye customers with the scalability, repeatability, and governance enterprise data engineering teams need. Bigconfig comes with conveniences like saved metrics, dynamic tags with wild card asset identifiers, and Autothresholds. Our solution empowers data teams to apply standardized monitoring across 1000s of tables to ensure full coverage and early detection of incidents.

This post will help your team get started. It covers the creation of your first Bigconfg, helpful tips to maximize coverage, and details on advanced features.

Monitor business-critical datasets with table deployments

The simplest way to get started with Bigconfig is using table deployments. Most data teams will have a handful of business-critical datasets—tables that power reports for the executive team or training data for a production pricing model. We recommend using a table deployment in Bigconfig to create and manage metrics on these datasets. This allows you to control and customize these metrics with fine grained precision to ensure even subtle pattern changes in your data are detected.

The good news is Bigconfig makes it easy to customize what you want—and automatically configures the rest.

Start with table metrics to ensure the dataset is updating on time with expected row counts.

Then apply column-level metrics to ensure the accuracy of the data. You can choose from Bigeye’s 60+ predefined metrics. You can customize these metrics at the field level by filtering rows with conditions, grouping by relevant dimensions, or defining custom thresholds. But don’t worry, you don’t have to manually define each attribute, Bigconfig automatically applies workspace defaults for those not specified. For example, autothresholds will automatically analyze series history and alert you to anomalies so there’s no need to determine acceptable ranges.

If your table loads incrementally, we recommend setting a row creation time to apply a windowing function on metrics. This will optimize query performance and enhance anomaly detection.

Check out the example below to get started:

Finally, if you find yourself repeating the same metric across multiple columns, it’s a good idea to save the metric in saved_metric_definitions so they’re consistent across datasets, and you can apply metrics in a single line of code. Here are a couple common examples:

Implement broad, standardized monitoring across your warehouse with tag deployments

Once you’ve monitored business critical tables with table deployments, use tags to deploy broad coverage on all datasets. Tags empower you to deploy metrics across your warehouse consistently and automatically, so your data observability scales with your data.

For example, we recommend tracking consistency of table updates with hours since last load and row count checks on all tables. Further, primary key or ID fields should be monitored for NULLs, duplicates, and proper formatting in all tables. Emails and other contact information can similarly be monitored. Finally, ensure critical KPIs have distribution checks to catch any anomalies.

To do this, first define your tags with a list of column selectors. Tag definitions are designed to match common semantic standards in your warehouse. You can include wildcards to dynamically match values across your warehouse, within a specific schema, or in any dataset prefixed/suffixed with a certain name. See below for some examples:

Next, deploy metrics on these tags in tag_deployments. You can reuse saved metrics or inline metric definitions as needed. Autothresholds ensure that each metric created is trained and customized to the specific dataset, so no need for tedious definitions or maintenance. See the example below:

Integrate Bigconfig into your DevOps Workflow

Implementing monitoring-as-code makes it easy to integrate Bigconfig into your existing workflows. First, we recommend versioning your Bigconfig YAML in git so that changes are governed by pull request reviews and approvals. Further, you can integrate and automate Bigconfig by defining tasks in the CI/CD automation tool of your choice – GitHub Actions, Bamboo, Jenkins, etc. For example, you could create a GitHub Action that:

Automatically runs a Bigconfig Plan after a pull request is made for changes to your Bigconfig YAML files
Automatically runs a Bigconfig Apply after that pull request is approved and merged.

Finally, you could automatically Plan and Apply when other code files, like DBT YAML, are released by triggering tasks off those files. This would ensure that metrics are automatically enabled on new tables and views.

Stay tuned for our next blog where we'll further discuss using Bigeye-CLI to integrate data observability into your CI/CD data pipeline.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights