Engineering
-
October 12, 2022

Does your data team hate NoSQL? It doesn't have to be that way!

NoSQL data stores were invented by engineers. For engineers, by engineers. So what’s the problem? Data teams bear the burden of any issues happening

Egor Gryaznov

It’s no secret that many engineers love NoSQL. That’s because they kind of…invented it. Yes, NoSQL data stores were invented by engineers. For engineers, by engineers. So what’s the problem?

The problem happens for data teams. The picture isn’t quite as rosy for us data folks, when it comes to NoSQL. While developers love it, data teams bear the brunt of the pain once the data is written in there. Many data people want to use it for analytics and need it to get to the data stack; an issue that developers can have the luxury of rarely considering.

SQL versus NoSQL: How did we get here?

For decades, relational databases have been the default way to store data, and SQL has been by far the most popular approach to access that data. Relational databases have their whole design built around the notion that data is stored in tabular format, with well known columns and types. This means that in order to add any new data into relational databases, someone has to first specify a schema - i.e. what the data will look like. Having a schema allows users to know what they’re looking at, and allows the database to perform many optimizations at storage and query time.

The downside to having a schema is that if you ever want to store new information, you have to update the schema every time. And this gets worse if you don’t really know what you want yet! Some projects might go through two or three iterations before landing on a schema that makes sense. This is very painful for developers, who prefer to move faster and are typically measured by the speed with which they can get new features out. On top of this, any schema change requires an expensive “database migration”, which often results in temporary performance issues or possibly even outages.

Enter NoSQL! NoSQL databases store and expose data using non-tabular structures: key-value pairs (DynamoDB, Redis), JSON (“document stores” - MongoDB, Elasticsearch), graphs (Neo4j), or something else completely. The one thing all of these databases have in common is that they typically accommodate data that does not have a predefined structure and they evolve easily by having every record have different information. Not having to think about a schema upfront for every single change means that developers can move quickly, but this causes a huge headache for the data teams downstream.

How NoSQL impacts data teams

Data teams are measured differently than developers. Their success relies on their ability to deliver reliable, understandable data to their users. This data is consolidated from sources across the business, and is typically used to perform interesting analytics that span across many domains. It’s really hard to find and use data without having a well-defined structure, so most analytical databases are relational. This means that data teams have to ingest data from application databases and fit it into something that’s relational, and therefore must have a schema. This was an easy job when developers used only relational databases, but with NoSQL, there’s the added complexity of not only knowing how to access the data in a way that’s specific to the database, but also applying a structure to it after the fact.

The burden of exposing the valuable data stored in these NoSQL systems at this point falls on the data team, however they are on the losing end of both battles. On one hand, they don’t know when the data changes, for example fields being added/removed. On the other hand, they need to make sure that their product (the data in the data warehouse) is understood and usable by their stakeholders. At the end of the day, the data team ends up having to maintain a tabular schema for all this non-tabular data, and that creates overhead that no one wants to take on.

The happy medium: a compromise between applications and data

Data inherently has a schema and structure, which your application also expects. If you’re using an unstructured database, then you’re really just applying the schema at read time. Your application is making the assertions about the structure, rather than the datastore itself.

If engineers work with data teams to explicitly declare that structure, everyone can be happy! If you’re an engineer, that means alerting data teams about changes that might have a downstream impact to the model. Allow those teams time and information to work around those changes. That extra step can save endless time later.

For example: with a document store in MongoDB, an object has five values that an engineer is splitting one of those into two levels. Say they’re taking a “full name” field and splitting it into “first name” and “last name.” The engineer should talk to their data team, and say, “Hey there used to be five fields and now there are six. This is how I am transforming it, and backfilling or not backfilling the values.”

In general, if you’re an engineer, tell data teams what changes you’re making, and how they will impact historical data. With enough heads up, those teams won’t need to reproduce changes on their end, or get surprised by sudden changes in the data they’re working with. It’s true that you add a slight layer of complexity when you have to openly communicate with your data team -  there’s an extra step in walking down a hall and having a conversation, or sending a few Slack messages. However, you still get all the benefits of evolving the data model quickly in your application, while still keeping the data team informed and happy. Your data team doesn’t want surprises (and this is why Bigeye exists, by the way!).

NoSQL vendors are even starting to rediscover patterns from traditional relational databases. They’re realizing that their database will have to be used by both developers and data teams, and both need to access it. Data teams speak SQL, and NoSQL vendors are starting to add SQL access to their data, in order to support the largest possible audience in one place.

The best of both worlds is possible, it just requires communication and transparency between engineers and data teams.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.