Data observability at scale: monitoring a 50,000 table data lake with Bigeye
You never know if what you theorize will work in practice! Recently, the Bigeye platform was put to the test when a partner connected a data warehouse with thousands of schemas and 50,000 tables—more than 10 times what we usually expect.
Our early engineering team cut their teeth building large-scale systems at Uber. As a result, we designed Bigeye’s infrastructure to handle large scale environments from the get-go.
We recently got the chance to stress test our infrastructure when a partner connected a Databricks Delta Lake data source with 2500+ schemas and 50,000+ tables to Bigeye. “Big data” might not be trendy anymore but it seems to certainly still be around in the enterprise space!
We’ve monitored some large environments before, but this was the biggest single data source Bigeye had been connected to yet. Thanks to our penchant for designing scalability into Bigeye from the early days, it handled this unexpected stress test gracefully. It indexed and profiled the source as expected, without bothering our SRE team or—even more importantly—slowing down the user who connected the large source.
In this blog post, we will give a bit more color on Bigeye’s infrastructure, and the engineering best practices we’ve implemented to make it robust in the face of scale.
As shown in the architecture diagram above, Bigeye consists of a few services:
- Datawatch - Bigeye’s scalable core service which executes most of the core logic, including metric runs and indexing
- ML platform - A training layer and a serving layer that handles all of our Authothresholds anomaly detection models
- Scheduler - Orchestrates metric runs, Autothresholds model retrains, and other recurring tasks
- AWS RDS (MySQL) - Bigeye’s application database with a Redis caching layer for responsiveness
- Redis - Bigeye’s caching layer for responsiveness
- RabbitMQ - For queueing tasks like metric collections
How Bigeye indexes and profiles data sources
When you first connect a new data source, Bigeye needs to build an index of every schema and table it can see in the data source. This enables the most core functions in Bigeye, like enabling our users to browse their data in catalog form, see what metrics our Autometrics recommender system suggests they track, etc.
Once the schemas have been indexed and Bigeye knows what’s in the data source, we need to profile the tables so we can suggest monitoring for them. The profiler, a component of Datawatch, queries each table for two days of history with a limit of 10,000 rows. If Bigeye doesn’t recognize a row-creation time, it will fall back to simple SELECT * and let the data source decide which 10,000 rows to return.
Once profiling is completed, Autometrics will use the profiling results to generate recommendations based on both physical data type, and semantic concepts:
- Low cardinality indicates a potential enum column
- Lack of duplicates suggests a uniqueness metric
- High occurrence of UUID patterns suggests a UUID regex metric
- Numeric data types suggest the need for distribution metrics like min, max, avg
The profiling process is generally expensive, so we need to be careful about what we profile and how often, to avoid racking up an infrastructure bill. Not all schemas need to be monitored, so selective profiling allows Bigeye to simply skip unused or unimportant schemas completely. Reprofiling the same table can be avoided unless Bigeye sees a change in the column, tracked by watching for ALTER TABLE statements run on the table.
Note that the samples coming from the profiler are not stored in Bigeye’s application database but rather only used for autometric generation.
Bottlenecks to solve
In the profiling flow above, there are a couple of weak points that, if improperly designed, can result in service issues.
The first potential weak point is the customer’s data sourceWe don’t want to slam the data source with a huge number of profiling queries and risk slowing down their data science teams or any applications that might be running on it.
The second potential weak point is Bigeye’s own service database and local store. When processing a large data source like the one our partner connected, we’re temporarily doing a lot of writes to Bigeye. Theoretically, there are also restrictions on the number of API calls, but these restrictions will largely function as limits on our databases.
Bigeye has been deliberately designed to minimize pressure on these chokepoints. In particular, the initial profiling of tables is not done synchronously. Rather, it’s broken into asynchronous chunks. This process also results in a better user experience – by making these heavy requests asynchronous rather than synchronous, the user doesn’t have to sit there and wait for Bigeye to index and profile every single table before they see some interaction in the UI.
To process the tables asynchronously, Bigeye uses a message queue, specifically RabbitMQ. RabbitMQ provides an efficient way to store pending transactions (in this case, requests for tables to be processed) and forward them on to somewhere else that handles them. (ELI5 tip: you use RabbitMQ whenever your service is doing something that may take a significant amount of time, like uploading large files, sending large emails, video encoding, etc).
What does this mean in more detail? The tables aren’t just placed in the queue all at once. Rather, each table (or set of columns) is scheduled for an arbitrary time throughout the day. This ensures that the Bigeye doesn’t overwhelm the customer’s data source or take up too much available computing power.
Another best practice that Bigeye implements is connection pooling. Ordinarily, every time you connect to a database, you open and close a connection and a network socket. This opening and closing takes compute power. Most of the time, for simple operations, this is not very expensive, but as things scale up, it can impact performance.
Often, it instead makes sense to “pool” the connections, keeping them open and passing them from operation to operation. Bigeye does this when connecting to the data sources with JDBC APIs. (Pooling is provided as astandard feature from Java libraries). We also configure the JDBC connection pool to accept only a certain maximum number of connections – this causes the app to block a bit waiting for a new connection, which Bigeye uses to provide back-pressure to prevent overloading of the customer warehouse.
There’s another component to all of this of which Bigeye is conscious. Bigeye’s product runs on top of data sources (querying data sources), and most data sources charge by usage. To prevent a situation where a customer is stuck with a huge infrastructure bill, Bigeye is judicious about its querying, in particular:
- Storing the aggregate data as it comes back in its own database – unless you’re going to the preview page, Bigeye is not live querying data
- Batching metric queries together; for instance, rather than running three separate queries for a max, a min, and an average, running one query for all three keeps the cost down
- Being aware of warehouse particularities and working around them. For example, since Snowflake charges customers based on how long the warehouse is running, Bigeye tries to run all the Snowflake queries at the same time, rather than say, every ten minutes, which would keep the customer’s instance up all night.
Our engineering team loves tackling dynamic, real-world situations like this one that help test the boundaries of the product. It’s a great reminder that the design work we do upfront pays dividends later, when Bigeye needs to perform for a large customer or partner.
Having worked in large scale environments before, we’re familiar with running into scalability challenges with our tools. It’s fun to be on the other side of the table and to build for others who are dealing with similar environments.
Schema change detection