Top data trends for 2023
In 2022, the data space matured by leaps and bounds. What's in store for 2023? Here are nine top data trends we think we'll see in the coming year.
In 2022, the data space matured by leaps and bounds. On the tooling side, we saw consolidation around analytic data warehouses like Snowflake and Redshift, SQL tools like dbt, and dash-boarding tools like Looker and Mode. On the process side, we saw increasing focus on data quality, transparency, and governance. Here are nine top data trends we think we'll see in the coming year.
1. The year of DRE
Data Reliability Engineering (DRE) refers to creating standards, processes, alignment, and tooling, to keep data – like dashboards and ML models — reliable. It’s a term inspired by Google’s Site Reliability Engineering.
This work is done by data engineers, data scientists, and analytics engineers who historically have not had the mature tools and processes at their disposal that modern software engineering and DevOps teams already enjoy. So, data reliability work today usually involves more spot-checking data, kicking off late-night backfills, and hand-rolling some SQL-into-Grafana monitoring than scalable and repeatable processes like monitoring and incident management.
Under the name Data Reliability Engineering (DRE), some data teams are starting to change that by borrowing from SRE and DevOps.
2. Data teams will run like product teams
Most data teams today run like service organizations, similar to IT. A request or question ("can I get last week’s revenue numbers?" "What percentage of our website traffic comes from Instagram versus Tiktok?") goes in, and an answer comes out. This makes data teams reactive rather than proactive.
Most people aren't happy with this state of affairs. Data teams are perennially overwhelmed with one-off requests, while data consumers don’t trust the answers they get.
Instead of serving ad-hoc requests from other parts of the organization, data teams will move toward a different model. More and more, data teams are deciding to proactively build a data product that helps the business make better decisions.
The implications of this shift are considerable: data teams will have a clearer vision for their product, a deep understanding of their customers, and will drive revenue and good business outcomes. They will staff up a multidisciplinary team with engineers, analysts, designers, and technical writers, and they should market the “features” they build, just as product teams do.
The data product is defined broadly. Technically, any information people use to make decisions can be part of the data product. It might include elements like:
- Every piece of data that flows between people, systems, and processes
- Every analysis the team produces and every tool that has some analytic capability
- Every spreadsheet
3. Stricter data governance and in-house governance builds
Data governance encompasses factors like the discovery of data assets, viewing lineage information, and providing general context around data/table status in an organization.
With the ingestion and modeling of ever more data, data governance is becoming an acute problem. A number of big tech companies have are building in-house solutions:
- Linkedin: DataHub
- Lyft: Amundsen
- WeWork: Marquez
- Airbnb: Dataportal
- Spotify: Lexikon
- Netflix: Metacat
- Uber: Databook
It's likely that many more will follow suit. As these solutions become open-sourced and commercialized via startups, it will become standard to implement stricter, more thorough data governance in 2023.
4. Data contracts will proliferate
Data contracts are API-like agreements between data producers and data consumers. They help teams export high-quality data that is resilient to change.
In the data contract paradigm, instead of dumping data generated by production services into data warehouses, service owners decide which data to expose to consumers. Then they expose it in an agreed-upon, structured fashion; similar to an API endpoint. As a result, responsibility for data quality shifts from the data scientist and analyst to the software engineer.
For example, imagine a ride-share application. Production microservices write into the `"rides", "payments", "customers", and "trip request" tables in the database. These schemas evolve as the business runs promos and expands into different markets.
If no action is taken, all of these production tables end up in a data warehouse. Subsequently, any machine learning engineer or data engineer consuming the analogous tables in the data warehouse must rewrite data transformations upon schema changes.
Data contracts change that paradigm. With data contracts, data analysts and scientists don’t consume near-raw tables in data warehouses. Instead, they consume from an API that has already munged the data and produced a human readable event, like a “trip request." The trip request metadata will be attached (pricing, yes/no surge pricing, promo, payment details, reviews). More teams will adapt data contracts as the more efficient way to consume data.
5. The proliferation of real-time infrastructure
Currently, most data infrastructure uses batch-based operations like polling and job scheduling, because the use-case is primarily analytics/dashboards.
Moving into 2023, companies will increasingly look at use cases that need streaming/real-time infrastructure, like process automation or operational decision-making.
Major data warehouses are a;read beginning to move in this direction. Snowflake has streams functionality and Bigquery and Redshift both offer materialized views. There are also startups building in the space: Meroxa offers change data capture from relational data stores and webhooks while Materialize is a Postgres-compatible data store that natively supports near-real-time materialized views.
6. More continual learning in ML
Continual learning is the process of iterating on machine learning models after they are deployed to production. Production data is used to improve models and adapt to real-world changes.
While most machine learning models deployed today are retrained on an ad-hoc basis, continual learning either periodically retrains the models, or retrains them upon certain triggers (like performance degradation).
In 2023, continual learning is likely to increase, as machine learning adopts monitoring and observability best practices. In particular, there will be an increasing push to not only monitor tables in data warehouses (what Bigeye does) but also direct user outcomes and feedback.
- Clicking on a recommendation
- Churning from the product
- Intervening with autopilot
- Flagging a generation as “offensive”
7. SaaS Services will export data directly to databases
Currently, the "Extraction" in ETL is handled by middlemen services like Fivetran and Stitch. They extract data from SaaS app APIs (Salesforce, Shopify, LinkedIn, Zendesk) and put them into the data warehouse. However, some SaaS apps are now going direct, striking partnerships with data warehouses to deliver their service data.
The benefits of this are that the SaaS apps will likely be more diligent about updating their data warehouse partners about changes to their API than they currently are with middlemen. Customers can look forward to fewer errors in data extraction, and probably for cheaper.
8. Data warehouses will grow beyond just SQL
While dbt has democratized the transformation step of ELT, letting data analysts write SQL to do transformations in data warehouses, there are some kinds of data processing for which SQL isn't ideal. For example, ML model training and more complex transformation logic can be easily handled with a simple Python package.
Towards that end, data warehouses will support more languages like Python directly in their processing engine. Snowflake, for example, recently announced Snowpark; an intuitive API that lets you build applications that process data in Snowflake without moving data to the system where your application code runs.
9. Teams will embrace T-shaped monitoring
T-shaped monitoring tracks fundamentals across all your data while applying deeper monitoring on the most critical datasets, such as those used for financial planning, machine learning models, and executive-level dashboards. This approach ensures you're covered against even the unknown unknowns.
The T-shaped monitoring is a philosophy that helps teams avoid that perennial problem of observability: an influx of bad alerts. As data teams learn to better prioritize their monitoring and map it directly to business outcomes, T-shape monitoring will be a handy tool in this strategic endeavor.
Schema change detection