5 ways data scientists and ML engineers advance their careers and help hit business targets
5 ways your team’s progress, and your career, are held back by bad data quality, and what you can do to address it.
It’s the ultimate rug pull – you’re recruited for a position building cutting-edge machine learning models, but end up spending 80% of your time cleaning data or dealing with data that’s full of missing values and outliers, has a frequently changing schema, doesn’t always load on time… in short: junk! These sorts of gaps between expectation and reality are commonplace for data scientists and machine learning engineers.
Data scientists enter a role excited to tackle insights and advanced models, but instead get to deal with schema changes, tables that stop updating, and other surprises that silently break their dashboards and models.
“Data science” can mean a wide range of things from product analytics to putting statistical models in production. In most cases, data scientists (or sometimes: machine learning engineers) are often at the tail end of the data pipeline. They’re data consumers, pulling it from centralized sources like data warehouses or S3. They analyze data to help make business decisions, or use it as training inputs for machine learning models.
So data quality issues impact them, but often they aren’t empowered to fix them. In the meantime, they either have to write a ton of defensive data preprocessing into their work, or just find another project to work on.
Don’t just throw your hands up and complain that the data engineering upstream isn’t good and the data is forever busted. Do some science! If you put out analytics tools that your team or your executives rely on—you’re the last step in the pipe and that means you’re responsible. If you put a model into production—you’re responsible.
Let’s talk about 5 ways your team’s progress, and your career, are being held back if you’re avoiding stepping up to the plate—and how you CAN step up to the plate, and make sure that even if you didn’t create a data quality problem, you can prevent it from impacting the people who rely on you.
1. Earn trust with better data quality testing and monitoring
Many business executives are still somewhat hesitant to make key decisions based on data alone. A report from KPMG showed that 60% of companies don't feel very confident in their data, and 49% had leadership teams who didn't fully support the company's data and analytics strategy.
That means if you can increase data’s accuracy and get it into dashboards that help key decision-makers, you’ll have a direct positive impact on your organization. But manually checking data for quality issues is error prone AND a huge drag on your velocity. It slows you down and makes you less productive.
Using data quality testing (e.g. with dbt tests) and data observability (e.g. like how Impira uses Bigeye) help to ensure you find out about quality issues before your stakeholders do, winning their trust in you (and the data) over time.
2. Avoid the blame game with data SLAs
Data quality problems can easily lead to an annoying blame game between data science, data engineering, and software engineering team: who broke the data, whose responsibility was it to know about the problem, and who’s supposed to be fixing it?
But when your stakeholders spot issues with the data, they don’t care whose fault it was. They just want the data to work so they can push the business forward.
To ensure accountability for each step in the data pipeline, you can put Service Level Agreements in place. These define data quality in quantifiable terms, and assign responders who should spring into action to fix problems, avoiding the blame game.
3. Enable more experimentation and faster analysis
We just addressed how trust erodes when your stakeholders catch mistakes, and the blame game ensues. What about when they don’t catch quality issues? In those cases, everyone loses. Either the decision is bad, or the model performs poorly, but either way, the business loses.
For example, duplicate entries in the data can invalidate results from experiments. Suppose you have a single entity, New York City, that’s being logged as both “New York City” and “NYC” in the database. When it comes to testing a new feature, everyone in “New York City” is shown variation A, and everyone in “NYC” is shown variation B. At the end of the day, what can you actually conclude about whether users in New York City? Not only have you reduced the statistical power of the test, the groups may not even be correctly randomized in the first place.
Instead, clear the path for better experimentation and analysis, by laying a foundation for higher quality data. Once the data is reliable, the experiment results can be trusted, and the team can focus on what to test next, instead of whether the results are correct.
4. Be the point person for data quality and ownership
Confusion surrounding your own work is a drag on your sense of professionalism. Lets face it, if you don’t really trust that your work is high quality and reliable, you’re going to silently carry that on your shoulders all the time.
Instead, carve out a niche as the person responsible for data quality and data ownership. Help define quality and assign responsibility for fixing different issues. You’ll remove heartburn between the data science team and data engineering, and even with software engineering further upstream.
By leading the charge to define a data quality strategy and reduce confusion, you’ll positively impact almost every other team within your organization, earning you the appreciation of your teammates for reducing a headache that affects nearly everyone.
5. Reduce wasted data storage and compute costs
Incomplete or unreliable data can lead to terabytes of wasted data that ends up taking up storage space in your data warehouse, and getting included in queries that incur compute costs. This low quality data is a drag on your infrastructure bill over time, as it has to be filtered out of analyses over and over again.
Identifying problematic data—especially in pipelines that are heavily used for product analytics or machine learning—can lead to a high-value fix-it list. Recollecting, reprocessing, or even just imputing and cleaning the existing values can reduce the storage and compute costs your team is wasting on this dirty data.
As you identify and clean up these tables, keep track of the ranked list, how much data is getting cleaned up, and how many queries are run on those tables. Later, you can report to your team how many queries aren’t being run on junk data anymore, and how many gigs of storage are now being put to better use.
No matter whether you’re just starting your career or are a seasoned professional, as a data scientist or ML engineer you’ve got the opportunity to make yourself an indispensable part of your organization through more reliable data.
Product analytics techniques and machine learning algorithms are becoming more and more commoditized, but the input data is not—it’s always business specific and somehow unique. Even the best-designed analytics tools and ML models don’t provide value if they rely on erroneous or incomplete data. That limits the impact of data science for the entire organization.
These five steps can each help you create more impact at your organization, just by improving the data your work (and every other data scientist on your team) ultimately depends on.
Join Bigeye for a demo to see how we can help monitor for data quality issues and let machine learning engineers and data scientists get back to the work they were hired to do.
Schema change detection