Product
-
September 27, 2022

Bigconfig empowers data teams to implement data reliability at scale

Bigeye has partnered with dozens of top data teams to create Bigconfig, a declarative, YAML-based monitoring-as-code solution, that allows teams to continue this convention and integrate data monitoring as code into their workflows.

Kate Wendell

Modern data teams manage their infrastructure, pipelines, and analytics tools as code. Bigeye has partnered with dozens of top data teams to create Bigconfig, a declarative, YAML-based monitoring-as-code solution, that allows teams to continue this convention and integrate data monitoring as code into their workflows.

Bigconfig provides Bigeye customers with the scalability, repeatability, and governance enterprise data engineering teams need. Bigconfig comes with conveniences like saved metrics, dynamic tags with wild card asset identifiers, and Autothresholds. Our solution empowers data teams to apply standardized monitoring across 1000s of tables to ensure full coverage and early detection of incidents.

This post will help your team get started. It covers the creation of your first Bigconfg, helpful tips to maximize coverage, and details on advanced features.

Monitor business-critical datasets with table deployments

The simplest way to get started with Bigconfig is using table deployments. Most data teams will have a handful of business-critical datasets—tables that power reports for the executive team or training data for a production pricing model. We recommend using a table deployment in Bigconfig to create and manage metrics on these datasets. This allows you to control and customize these metrics with fine grained precision to ensure even subtle pattern changes in your data are detected.

The good news is Bigconfig makes it easy to customize what you want—and automatically configures the rest.

Start with table metrics to ensure the dataset is updating on time with expected row counts.

Then apply column-level metrics to ensure the accuracy of the data. You can choose from Bigeye’s 60+ predefined metrics. You can customize these metrics at the field level by filtering rows with conditions, grouping by relevant dimensions, or defining custom thresholds. But don’t worry, you don’t have to manually define each attribute, Bigconfig automatically applies workspace defaults for those not specified. For example, autothresholds will automatically analyze series history and alert you to anomalies so there’s no need to determine acceptable ranges.

If your table loads incrementally, we recommend setting a row creation time to apply a windowing function on metrics. This will optimize query performance and enhance anomaly detection.

Check out the example below to get started:

Finally, if you find yourself repeating the same metric across multiple columns, it’s a good idea to save the metric in saved_metric_definitions so they’re consistent across datasets, and you can apply metrics in a single line of code. Here are a couple common examples:

Implement broad, standardized monitoring across your warehouse with tag deployments

Once you’ve monitored business critical tables with table deployments, use tags to deploy broad coverage on all datasets. Tags empower you to deploy metrics across your warehouse consistently and automatically, so your data observability scales with your data.

For example, we recommend tracking consistency of table updates with hours since last load and row count checks on all tables. Further, primary key or ID fields should be monitored for NULLs, duplicates, and proper formatting in all tables. Emails and other contact information can similarly be monitored. Finally, ensure critical KPIs have distribution checks to catch any anomalies.

To do this, first define your tags with a list of column selectors. Tag definitions are designed to match common semantic standards in your warehouse. You can include wildcards to dynamically match values across your warehouse, within a specific schema, or in any dataset prefixed/suffixed with a certain name. See below for some examples:

Next, deploy metrics on these tags in tag_deployments. You can reuse saved metrics or inline metric definitions as needed. Autothresholds ensure that each metric created is trained and customized to the specific dataset, so no need for tedious definitions or maintenance. See the example below:

Integrate Bigconfig into your DevOps Workflow

Implementing monitoring-as-code makes it easy to integrate Bigconfig into your existing workflows. First, we recommend versioning your Bigconfig YAML in git so that changes are governed by pull request reviews and approvals.   Further, you can  integrate and automate Bigconfig by defining tasks in the CI/CD automation tool of your choice –  GitHub Actions, Bamboo, Jenkins, etc.  For example, you could create a GitHub Action that:

  1. Automatically runs a Bigconfig Plan after a pull request is made for changes to your Bigconfig YAML files
  2. Automatically runs a Bigconfig Apply after that pull request is approved and merged.

Finally, you could automatically Plan and Apply when other code files, like DBT YAML, are released by triggering tasks off those files.  This would ensure that metrics are automatically enabled on new tables and views.

Stay tuned for our next blog where we'll further discuss using Bigeye-CLI to integrate data observability into your CI/CD data pipeline.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.