Thought leadership
-
January 6, 2023

Data in Practice: Building out a data quality infrastructure with Brex

In the "Data in Practice" series, we talk to real data engineers who have put data reliability engineering concepts into practice, learning from their challenges and successes in the real world.

Kyle Kirwan

In the "Data in Practice" series, we talk to real data engineers who have put data reliability engineering concepts into practice, learning from their challenges and successes in the real world.

In this interview, we look at one example from Brex, the fintech unicorn. We spoke with Andre Menck, a former member of Brex’s data platform team, to learn how they approached building reliable data infrastructure through four years of hypergrowth.

Andre's role at Brex

Andre was the senior engineer on the data platform team at Brex for four years. As one of the earliest data engineers at the company, he had a hand in most of the projects on the team, either in an advising capacity or doing hands-on work.

The data platform team’s function was to create self-serve platforms for data scientists, analytics users, and business users.

Evolution of Brex's data infrastructure

Andre was heavily involved in building out Brex’s data infrastructure, including migrating data storage to Snowflake, deploying Airflow, and supporting the three core data science use cases that developed over the next few years:

  • Fraud detection - things like transaction fraud, ACH fraud, account takeover.
  • Underwriting - measuring customer creditworthiness and credit risk.
  • Product features that involved data science - for example ranking customer value for customer experience (CX) to prioritize their calls, and showing customers insights about their own spending habits

Where they started

When Andre joined Brex, there was no data platform team. At the time, Brex’s data “warehouse” consisted of a MySQL instance. The ETL process consisted of data engineers using AWS DMS (Data Migration Service) to replicate data from the production database into the MySQL instance.

In terms of analytics, Brex employees with read-access on the database ran manual SQL queries. And, there were no data science use cases.

Brex's data stack today

Brex’s data stack at the beginning of 2023 has evolved by leaps and bounds from where they were a few years ago. It now consists of:

  • Orchestration: Airflow
  • Data warehouse: Snowflake (with some custom permissioning infrastructure on top)
  • Transformation: DBT
  • Asynchronous events: Kafka
  • BI: Looker
  • Notebooking for data scientists: Databricks

Making machine learning self service

As the number of data science use cases increased, the team decided to build a platform to support more use cases without having to create engineering bottlenecks. They hoped to move toward a more self-service model, only requiring engineering help for extreme or fringe cases.

The platform (essentially a Python library), allowed data scientists to prototype machine learning models, then save that model in the platform (where it would run in production). The model would then be available for consumption as an API endpoint on the platform, with the platform handling all the data fetching.

Solving data quality

As Brex works with payments and financial data, it was important for the company to ensure that the analytics they used for fraud detection were up-to-date and error-free. Data quality is critical to their business model, where data quality issues can quickly spiral into regulatory, financial, and legal catastrophes in the worst cases. Andre helped firefight a number of data quality issues while at Brex:

Problem 1: Airflow DAGs

Brex had around 3000 data transformations running on Airflow. This huge number was a direct consequence of hypergrowth: “Our analytics team was probably 30, 40 people when I left, so you had all these new people, PM’s and analysts, they go to our Snowflake, they see tables, and they’re like, oh, I don’t know where to get this data. It looks like no one has written this query before, so I’ll just do it myself…it’s sort of the flip side of having very independently operating teams.”

While each Airflow transformation is internally consistent (they are directed acyclic graphs), Airflow didn’t understand that some of the DAGS also depended on each other – the output of one would often be the expected input of another. After seven or eight layers of transformations, they often ended up with 10 day old data.

The solution to this problem was to merge everything into one DAG. This was still an ongoing project when Andre left the company.

Problem 2: Connections with banks

The second problem was in the underwriting space. Brex used Plaid to get customers' financial information from their banks to determine their real-time creditworthiness, but maintaining an ongoing connection to a customer's bank account was challenging.

To handle this issue, Brex came up with a couple of different solutions.

The first was to build an underwriting policy that could deal with stale data – to make sure that if a customer's data was stale, Brex didn’t lower their credit limits right away. “We made that more and more complex to basically be able to guess what is the risk that we attribute to this customer. Let’s say the data we have on the customer is 90 days old, but if they had $100 million in their bank account, they’re probably not insolvent now.

The second solution was to monitor the bank connections in a more systematic way, so that they would pin down exactly which bank was having problems, and talk directly to Plaid without impacting customers.

Problem 3: Product Teams

The third problem was product teams changing the events they emitted without informing the data platform team, so that Brex no longer had the data for certain features for certain machine learning models.

While one solution to this problem would have been to write extremely thorough integration tests, this would have been expensive and difficult to maintain. Instead, the data platform team wrote some manual tests, and supplemented with a more “process driven” solution – close coordination between the data science team that consumed the data and teams that submitted data.

For example, they had a library that was used to produce events to Kafka for data science models. And whenever there was a PR that made some changes to the usage of the library, anywhere, a data scientist could get alerted through a linter, or they would be autotagged on the PR.

Andre’s advice for building data infrastructure

When asked what advice he would give to himself if he were to build data infrastructure again for a new startup, Andre had the following two principles:

Automate as much as possible

Andre advises building intentionally, and automating as much as possible, even when you're under the gun: “You always build fraud-detection products in a rush, because you’re trying to stop the crime. As such, there’s a tendency toward relying on manual processes to test them. But that ends up being way more of a pain in the long term.”

Grow thoughtfully

Says Andre: “If you just build and build and build, you get that disorganized mess of data assets, and it’s impossible to recover from it as a company. If you have 2,000 employees and that’s your data picture, that will be with you forever."

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.