Company
-
May 11, 2022

The need for speed: evolving indexing architecture in Bigeye

The Bigeye engineering team explains how it worked through the challenge of processing customers’ database catalogs containing increasingly diverse data sets

Ryan Atwood

Rapidly growing companies like Bigeye often face engineering challenges as they scale. Often, technical decisions made early on to unlock functionality become bottlenecks as companies reach new levels of scale.  Sound familiar?

One of the challenges the Bigeye engineering team recently solved was the ability to process customers’ database catalogs that contain increasingly diverse data sets. Referred to as “catalog refresh,” Bigeye can identify a customer’s schemas, datasets, and columns for a data source, allowing us to apply our expanding list of features around data reliability and anomaly detection to customer data accurately and in real-time.  

Detecting new, modified, or removed database catalog items efficiently is one thing. Still, with a growing list of customers like Zoom, Confluent, Instacart, and others, we feel the need for speed.  

Speed, however, is not enough.  We also need the process to be reliable for every supported relational database type.  

This post details how we met the challenge of scaling our architecture and design to meet demand—from Redis to queuing to locking and more.  

The Challenge: Increasingly diverse needs across many regions

Our challenge was three-fold:

  • Process large amounts of catalog data quickly at scale and know when the processing is complete.
  • Incorporate multi-region failover without the possibility of duplicating work initiated simultaneously in other geographic regions.
  • Modify the process to become more granular. It needed to reliably process an entire data source with one or many schemas, each with one or many tables, and all schemas should be able to be processed independently.

When did my asynchronous processing complete?

The amount of data being processed at Bigeye is on track to grow exponentially, and catalog data is no exception. When every new customer, large or small, needs to be able to sign up and connect to their data source(s), performance should not suffer in the least.  

This isn’t even the tricky part. Any developer can change an endpoint to kick off a process on a background thread and return a “202 Accepted” HTTP status code. How do we scale parallel processing to meet the lion’s share of generic asynchronous processing use cases while keeping track of the overall set of operations?

East Coast and West Coast initiate catalog refreshing simultaneously… and many times each

Remember when life was so simple, and all our customers were essentially all in the same time region? When customers within a single company worked from one city?  Those were the days. Now, with great data sources in many regions comes great responsibility.

For example, suppose a data engineer, who works remotely from North Carolina, has added a few new tables and wants to refresh the catalog to take advantage of Bigeye’s anomaly detection. What happens if the VP of marketing in Seattle, who happens to be preparing for a presentation to leadership, also wants things in Bigeye up to date? Do you process everything twice? What happens if the app has a blip and the process is accidentally requested not once but three times?

Not only is performance important, but processing also gets expensive for the client (e.g., Snowflake charges by CPU processing). How do we prevent the processing of the same catalog repeatedly when we can’t truly depend on our current application database to store and exchange this information reliably across regions? This is, of course, assuming you have already solved the scalability problem detailed in the previous section.

A hatchet or a scalpel?

Adding new functionality over time creates a greater demand for the ability to refresh not just a data source as a whole, but also to refresh individual schemas or even individual datasets.

For example, Bigeye’s Deltas feature automatically maps the columns between your source and target data and intelligently applies data quality metrics from Bigeye’s library of more than 70+ metrics. In a matter of seconds, Deltas generates reports that show which metrics have drifted across which dimensions. Deltas help you determine the root cause, so you can resolve those issues quickly.  If those two tables are in different sources, it is better to refresh the two individual tables involved rather than refreshing both entire sources before running the comparison.  

Customers often want Bigeye to simply recognize a new column that was added without running a full catalog refresh process against their data source with 25 schemas in it. Yet, what is the best way to add this granularity while ensuring we do not increase the strain on our client’s databases?  

Combine these challenges, and one has quite the task.

The Solution: Granular, multi-region failover catalog refreshing at scale

We solved these challenges by creating:

  • A layered system of queues where granular requests are processed in parallel.
  • A “workflow” tracking mechanism so that the system is aware when processing is complete.
  • A high-speed locking mechanism to prevent simultaneous duplicate catalog processing.  
  • Monitoring and alerts proactively keep us abreast of unexpected failures, long-running tasks, queue volume, or the need to scale up.

The need…for speed

Here is a picture of our prior, synchronous architecture:

In order to parallelize the processing of catalog refreshing in a repeatable way for future asynchronous processing at Bigeye, two layers of queuing requests were created. The first layer accepts requests to refresh a source, schema, or dataset. A scalable number of workers are spun up to process these requests. Each request then determines a series of datasets that need to be added, updated, or deleted based on the current state of the client’s catalog. Furthermore, if a source has multiple schemas, those schemas will be assessed in parallel as well.  For example, if a source has 3 schemas, each with 10 tables.  The first layer of workers will process the 3 schemas in parallel.  The second layer will process the 30 tables in parallel once the first layer determines the 30 tables that need to be processed.

The second queuing layer that processes the adds, updates, and deletes for datasets and their underlying fields then also has a scalable number of workers that process those dataset operations in parallel. This decision was made in the spirit of microservices without actually having to build out the infrastructure for another microservice. If in six months we find that the load falls mainly on the processing of the adds, updates, and deletes, we can scale up the number of “dataset operation” workers. If we find the assessment of each schema proves to be most of the work, we can scale up the number of workers for that layer.  

Additionally, each container in our architecture has its own set of workers. This means that if one instance of our application in the cloud is temporarily down, our other instances will not only handle the load, they can be scaled up to match previous performance if required. Being able to scale from two to three different angles promises to provide us with ample flexibility going forward.

Knowing when a workflow has been completed

Thanks to the scalability features of modern cloud applications, rapidly being able to handle 10-100x the number of datasets observed by our system is vastly simpler than it was 10 years ago. However, we also needed a reusable way to know when this worked well in any asynchronous processing situation.

To solve this, we built a “workflow” service that tracked both when an overall workflow was completed and its individual processes. Features that perform asynchronous processing in our systems now have the option to create a workflow initially and easily associate a queued request for an operation with a “workflow process.”

Adding the right database indexing made the process of determining if a completed process was the final piece of the puzzle performant. When the last process completes, the workflow automatically detects that it has been completed and marks itself complete. We then added API endpoints to be able to check the status for a given workflow being processed at any level: for a source, schema, or individual dataset. Now, consumers of our API, including our user interface, can know when a fully parallelized process has been completed successfully or if the process failed or partially failed.  

This design also lends itself toward a reactive architecture if any priority warrants its use, without any of the recent work being throwaway. Currently, when a workflow completes, the process looks something like this:

In the future, we may have a more specific need for our web application to react immediately to an event triggered by our server-side architecture. Let’s suppose we develop a complex dashboard of some kind that has many real-time interactions that are triggered by asynchronous processes completing. In such a case, we could add to the existing design by introducing web sockets or technology like AWS Amplify or Azure SignalR, as noted in the diagram below.  

Preventing unnecessary, duplicate processing

Culling duplicate processing of a client’s catalog is not just more performant. Sensitivity to larger payloads and processing costs can vary greatly from customer to customer. Therefore, as the customer base grows, so does the need to be more precise in how we interact with our clients’ data sources.

To prevent running the same catalog refresh operation multiple times, we developed a locking mechanism using Redis. We chose Redis for several reasons:

  • It is fast.
  • Redis is geo-redundant in the cloud which allows the various instances of our applications to “lock” an “item” in one region and not have to worry about an instance of our application being unable to see the lock-in time. In catalog refreshing’s case, sources, schemas, and datasets are lockable items.  
  • It is tech-stack agnostic and prevents us from being vendor-locked, allowing us flexibility in integrating with some of our larger clients’ architectures.
  • As a bonus, locks expire automatically with Redis’ time-to-live feature. No cleanup required!

The nifty part is that locks for catalog refreshing cascade their locks up and down the dependency tree in a catalog. For example, if Adam in New York refreshes the catalog for a source, the source acquires a lock. If Tina in Georgia wants to refresh the catalog for an individual schema or table within that same source, the lock on the source will prevent the individual schema or dataset from being refreshed. This saves time and processing power for us and our clients.

Keep Watch!

Providing asynchronous processing for automated testing is complex. Unit tests can cover the individual components, but a feature like this is also more of a case for integration tests. Those integration tests can require a special setup and even mocking specific cases in certain situations. It is easier than usual to miss something, harder to have full visibility into a process like this, and all the more important to get things right.  

Ensuring the reliability of functionality like this with all of its ins and outs requires thorough vigilance. We set up Datadog monitoring and alerts down to the individual queue level so that we stay informed in the following circumstances:

  • Our queue volume exceeds a certain threshold.
  • Our internal or external endpoints begin to accrue failures or incur degraded performance.

This way, we are the first ones to know if there is anything about to go wrong with this feature.

Bigeye is growing

In order to meet the architectural challenges that accompany rapid growth, it’s important to implement scalable features that will scale to continue to meet client expectations.  The stakes can be high, but the ride can be super enjoyable.

Think you would thrive on a team that is building data observability solutions and scaling out our systems? We are hiring! Check out our careers page for current listings. Or if you are interested in taking a close look at what we’ve built, get a demo.

The Data Reliability Engineering conference—DRE-Con—is taking place on May 25th & 26th. Hear speakers from Verizon, Strava, DoorDash, attend a hands-on workshop, and participate in live Q&A. DRE-Con is virtual and free to attend.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.