Engineering
-
March 6, 2023

Why everyone tries to reinvent SQL (and why it never works)

Inventing a better SQL is easier said than done, even though many have tried. Here's why, in our opinion, SQL is here to stay.

Egor Gryaznov

“It’s the new Excel” is as ubiquitous as “The Uber for (fill in the blank).” However, as soon as something is pronounced “the new Excel,” those words tend to be famous last ones. As much as people try to reinvent Excel, things always eventually end up in an Excel spreadsheet. Between the fact that every business person is familiar with Excel and the extensibility that it provides for power users, Excel can support any use case you can think of while also being easily accessible to non-technical users.

The same goes for SQL. SQL has been the lingua franca for data access for decades, embedding itself in any major data users’ mind and process. There’s business logic encoded in SQL, often so convoluted that no one even knows how it works, that no one will want to migrate to anything else. Many people think they can create a more efficient way of accessing data than SQL, but they are glossing over a few major points that make SQL as sticky and irreplaceable as it is.

SQL has been around for a long time

The first advantage SQL has is that of a long life. This has allowed it to be present in some way across every data system, with more SQL present in legacy systems (those created before any major alternative existed). On top of that, SQL has had decades worth of iterations, which has made it more robust and expressive over time, with most dialects adopting commonly used advanced features very quickly.

On top of that, migrating to a new solution means one of two things - one large lift-and-shift migration of all existing SQL or building anything new in the new solution while maintaining the legacy SQL solutions until they are replaced. Both of these approaches are large organizational undertakings, and what ends up happening is the legacy SQL, due to it’s complexity, is never migrated and will continue running for years to come. Many major NoSQL vendors are starting to create SQL-like layers on top of their solution to ease the burden of migrating existing solutions and to attract more users.

SQL is highly extensible

New solutions entering the market have the best intentions: making simpler to understand statements to access your data. However, this simplicity comes at the cost of expressibility, and those solutions will inevitably run into the same feature requests that SQL has in the past. When they try to address more advanced cases, such as window functions, the addition of these features feels like an afterthought, making the once simple solution much more difficult to understand and less appealing.

But I really want to invent a new way of accessing data…

All of this is not to say that SQL is a data access panacea. There are languages and projects that are much better fits for data access, but in very specific use cases. For example, Neo4j’s Cypher graph query language is purpose-fit for querying and traversing graphs. Storing and traversing graph data in a relational database using SQL is particularly difficult, so having a custom solution for this makes a lot of sense.

That said, creating a new way to access data will still require you to support all the other parts of the data ecosystem, such as extracting the data into a data warehouse, modeling it for analytical consumption, and exposing it to your data ops tooling.

A case study: Uber

Let’s look at this phenomenon in action. In 2015, Uber moved its trip storage from Postgres to a homegrown NoSQL data store. The solution was a great fit for the scale and extensibility that is needed for the transactional side of the business. However, for the data team, this store was quite painful. Having a clustered data store with 4,096 nodes that each had to be queried separately to ingest data into the warehouse was a nightmare from a data integration perspective. Although the problem was later addressed, it was a strong example of how doing the right thing for engineers and transactional systems did not work as well when it flowed downstream to the data teams.

Conclusion

Data teams aren’t going to trade a universally-used solution that is familiar and widespread, for a totally new solution that works 10% better. The solution needs to be drastically better across the vast majority of use cases, including any edge cases. A new data access language that does the same thing but “more efficiently” doesn’t provide enough value to make restarting the learning curve worth it.

In a perfect world, if someone invented a new SQL that is 10 times more performant across all use cases and encompasses all the functionality that is expressible in SQL, it would be worth considering. However typing the same query into a different syntax for marginal gains is not useful for data consumers. It will just add to the complexity and cognitive overhead to an already complex workflow and data consumers will never see the value in that.

That means, like it or not, for better or for worse, SQL is here to stay.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.