Engineering

October 12, 2022

Does your data team hate NoSQL? It doesn't have to be that way!

min read

Egor Gryaznov

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

It’s no secret that many engineers love NoSQL. That’s because they kind of…invented it. Yes, NoSQL data stores were invented by engineers. For engineers, by engineers. So what’s the problem?

The problem happens for data teams. The picture isn’t quite as rosy for us data folks, when it comes to NoSQL. While developers love it, data teams bear the brunt of the pain once the data is written in there. Many data people want to use it for analytics and need it to get to the data stack; an issue that developers can have the luxury of rarely considering.

SQL versus NoSQL: How did we get here?

For decades, relational databases have been the default way to store data, and SQL has been by far the most popular approach to access that data. Relational databases have their whole design built around the notion that data is stored in tabular format, with well known columns and types. This means that in order to add any new data into relational databases, someone has to first specify a schema - i.e. what the data will look like. Having a schema allows users to know what they’re looking at, and allows the database to perform many optimizations at storage and query time.

The downside to having a schema is that if you ever want to store new information, you have to update the schema every time. And this gets worse if you don’t really know what you want yet! Some projects might go through two or three iterations before landing on a schema that makes sense. This is very painful for developers, who prefer to move faster and are typically measured by the speed with which they can get new features out. On top of this, any schema change requires an expensive “database migration”, which often results in temporary performance issues or possibly even outages.

Enter NoSQL! NoSQL databases store and expose data using non-tabular structures: key-value pairs (DynamoDB, Redis), JSON (“document stores” - MongoDB, Elasticsearch), graphs (Neo4j), or something else completely. The one thing all of these databases have in common is that they typically accommodate data that does not have a predefined structure and they evolve easily by having every record have different information. Not having to think about a schema upfront for every single change means that developers can move quickly, but this causes a huge headache for the data teams downstream.

How NoSQL impacts data teams

Data teams are measured differently than developers. Their success relies on their ability to deliver reliable, understandable data to their users. This data is consolidated from sources across the business, and is typically used to perform interesting analytics that span across many domains. It’s really hard to find and use data without having a well-defined structure, so most analytical databases are relational. This means that data teams have to ingest data from application databases and fit it into something that’s relational, and therefore must have a schema. This was an easy job when developers used only relational databases, but with NoSQL, there’s the added complexity of not only knowing how to access the data in a way that’s specific to the database, but also applying a structure to it after the fact.

The burden of exposing the valuable data stored in these NoSQL systems at this point falls on the data team, however they are on the losing end of both battles. On one hand, they don’t know when the data changes, for example fields being added/removed. On the other hand, they need to make sure that their product (the data in the data warehouse) is understood and usable by their stakeholders. At the end of the day, the data team ends up having to maintain a tabular schema for all this non-tabular data, and that creates overhead that no one wants to take on.

The happy medium: a compromise between applications and data

Data inherently has a schema and structure, which your application also expects. If you’re using an unstructured database, then you’re really just applying the schema at read time. Your application is making the assertions about the structure, rather than the datastore itself.

If engineers work with data teams to explicitly declare that structure, everyone can be happy! If you’re an engineer, that means alerting data teams about changes that might have a downstream impact to the model. Allow those teams time and information to work around those changes. That extra step can save endless time later.

For example: with a document store in MongoDB, an object has five values that an engineer is splitting one of those into two levels. Say they’re taking a “full name” field and splitting it into “first name” and “last name.” The engineer should talk to their data team, and say, “Hey there used to be five fields and now there are six. This is how I am transforming it, and backfilling or not backfilling the values.”

In general, if you’re an engineer, tell data teams what changes you’re making, and how they will impact historical data. With enough heads up, those teams won’t need to reproduce changes on their end, or get surprised by sudden changes in the data they’re working with. It’s true that you add a slight layer of complexity when you have to openly communicate with your data team - there’s an extra step in walking down a hall and having a conversation, or sending a few Slack messages. However, you still get all the benefits of evolving the data model quickly in your application, while still keeping the data team informed and happy. Your data team doesn’t want surprises (and this is why Bigeye exists, by the way!).

NoSQL vendors are even starting to rediscover patterns from traditional relational databases. They’re realizing that their database will have to be used by both developers and data teams, and both need to access it. Data teams speak SQL, and NoSQL vendors are starting to add SQL access to their data, in order to support the largest possible audience in one place.

The best of both worlds is possible, it just requires communication and transparency between engineers and data teams.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Does your data team hate NoSQL? It doesn't have to be that way!

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

SQL versus NoSQL: How did we get here?

How NoSQL impacts data teams

The happy medium: a compromise between applications and data

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

How To Evaluate Data Observability Platforms (With Downloadable)

Why data lineage is mission-critical for businesses today

Making sense of machine learning and artificial intelligence models by monitoring the training data

Join the Bigeye Newsletter

Does your data team hate NoSQL? It doesn't have to be that way!

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

SQL versus NoSQL: How did we get here?

How NoSQL impacts data teams

The happy medium: a compromise between applications and data

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

How To Evaluate Data Observability Platforms (With Downloadable)

Why data lineage is mission-critical for businesses today

Making sense of machine learning and artificial intelligence models by monitoring the training data

Join the Bigeye Newsletter