Why everyone tries to reinvent SQL (and why it never works)
Inventing a better SQL is easier said than done, even though many have tried. Here's why, in our opinion, SQL is here to stay.
“It’s the new Excel” is as ubiquitous as “The Uber for (fill in the blank).” However, as soon as something is pronounced “the new Excel,” those words tend to be famous last ones. As much as people try to reinvent Excel, things always eventually end up in an Excel spreadsheet. Between the fact that every business person is familiar with Excel and the extensibility that it provides for power users, Excel can support any use case you can think of while also being easily accessible to non-technical users.
The same goes for SQL. SQL has been the lingua franca for data access for decades, embedding itself in any major data users’ mind and process. There’s business logic encoded in SQL, often so convoluted that no one even knows how it works, that no one will want to migrate to anything else. Many people think they can create a more efficient way of accessing data than SQL, but they are glossing over a few major points that make SQL as sticky and irreplaceable as it is.
SQL has been around for a long time
The first advantage SQL has is that of a long life. This has allowed it to be present in some way across every data system, with more SQL present in legacy systems (those created before any major alternative existed). On top of that, SQL has had decades worth of iterations, which has made it more robust and expressive over time, with most dialects adopting commonly used advanced features very quickly.
On top of that, migrating to a new solution means one of two things - one large lift-and-shift migration of all existing SQL or building anything new in the new solution while maintaining the legacy SQL solutions until they are replaced. Both of these approaches are large organizational undertakings, and what ends up happening is the legacy SQL, due to it’s complexity, is never migrated and will continue running for years to come. Many major NoSQL vendors are starting to create SQL-like layers on top of their solution to ease the burden of migrating existing solutions and to attract more users.
SQL is highly extensible
New solutions entering the market have the best intentions: making simpler to understand statements to access your data. However, this simplicity comes at the cost of expressibility, and those solutions will inevitably run into the same feature requests that SQL has in the past. When they try to address more advanced cases, such as window functions, the addition of these features feels like an afterthought, making the once simple solution much more difficult to understand and less appealing.
But I really want to invent a new way of accessing data…
All of this is not to say that SQL is a data access panacea. There are languages and projects that are much better fits for data access, but in very specific use cases. For example, Neo4j’s Cypher graph query language is purpose-fit for querying and traversing graphs. Storing and traversing graph data in a relational database using SQL is particularly difficult, so having a custom solution for this makes a lot of sense.
That said, creating a new way to access data will still require you to support all the other parts of the data ecosystem, such as extracting the data into a data warehouse, modeling it for analytical consumption, and exposing it to your data ops tooling.
A case study: Uber
Let’s look at this phenomenon in action. In 2015, Uber moved its trip storage from Postgres to a homegrown NoSQL data store. The solution was a great fit for the scale and extensibility that is needed for the transactional side of the business. However, for the data team, this store was quite painful. Having a clustered data store with 4,096 nodes that each had to be queried separately to ingest data into the warehouse was a nightmare from a data integration perspective. Although the problem was later addressed, it was a strong example of how doing the right thing for engineers and transactional systems did not work as well when it flowed downstream to the data teams.
Data teams aren’t going to trade a universally-used solution that is familiar and widespread, for a totally new solution that works 10% better. The solution needs to be drastically better across the vast majority of use cases, including any edge cases. A new data access language that does the same thing but “more efficiently” doesn’t provide enough value to make restarting the learning curve worth it.
In a perfect world, if someone invented a new SQL that is 10 times more performant across all use cases and encompasses all the functionality that is expressible in SQL, it would be worth considering. However typing the same query into a different syntax for marginal gains is not useful for data consumers. It will just add to the complexity and cognitive overhead to an already complex workflow and data consumers will never see the value in that.
That means, like it or not, for better or for worse, SQL is here to stay.
Schema change detection