Data Science teams have different levels of maturity in terms of their ways of working. In the worst case, every team member works as an individual. Results are poorly explained and impossible to reproduce. In the best case, teams reach full scientific reproducibility with simple conventions and little overhead. This leads to efficiency and confidence in results and minimal friction in productionising models. It is important to be able to measure a team’s maturity so that you can improve your ways of working and so you can attract and retain great talent. This series of questions is a Joel Test of Data Science Maturity. As with Joel’s original test for software development, all questions are a simple Yes/No and a score below 10 is cause for concern. Depressingly, many teams seem to struggle around a 3.
A Joel Test of Data Science Maturity
- Are results reproducible?
- Do you use source control?
- Do you create a data pipeline that you can rebuild with one command?
- Do you manage delivery to a schedule?
- Do you capture your objectives (scientific hypotheses)?
- Do you rebuild pipelines frequently?
- Do you track bugs in your models and your pipeline code?
- Do you analyse the robustness of your models?
- Do you translate model performance to commercial KPIs?
- Do new candidates write code at interview?
- Do you have access to scalable compute and storage?
- Can Data Scientists install libraries and packages without intervention by IT?
- Can Data Scientists deploy their models with minimal dependencies on engineering and infrastructure?
1. Are results reproducible?
A core aspect of traditional science is that results be reproducible. This is essential when building models of the world that aim to improve our understanding of the world. It is no different for Data Science. And it turns out the reproducibility promotes efficiency. Teams no longer waste time wondering which data led to a particular result, which code led to a particular result and why results might have changed as understanding of the problem improved.
2. Do you use source control?
Building algorithms and data pipelines is complex. Source control lets you track changes to your code, roll back poor changes and try out new ideas without breaking working code.
3. Do you create a data pipeline that you can rebuild with one command?
A version controlled data pipeline allows you to centralise and consolidate your understanding of the data (business and cleaning rules) and your definition of features that feed into an algortihm. If you can rebuild this pipeline with one command then you can quickly iterate as your understanding of the problem evolves and as you inevitably discover issues with the data.
4. Do you manage delivery to a schedule?
Data science needs a schedule to keep it focused. As projects are often open ended and exploratory, you need to have clear checkpoints where you can make a call that perhaps ‘this data is not fit for purpose’ or ‘there is no value in further iterations of model refinement’. Teams that do not deliver to any schedule tend to drift into perfection being the enemy of done.
5. Do you capture your objectives?
Every data science problem is really an optimisation problem and you cannot optimse without an objective. Although it can sometimes feel painful or appear ‘picky’, it is essential that the objective of a project and a model are clearly defined. Increate profit? Increase volume? Increase both with some balance? Get clear and agree with your customer.
6. Do you rebuild pipelines often?
Like traditional software, rebuilding often can highlight integration bugs. In the context of data science integration bugs are effectively data flows through a pipeline. If you do not rebuild often it is possible to introduce cyclic references into your data preparation, lose the logic for creation of a feature and other nasty bugs that cause you to lose that essential reproducibility.
7. Do you track bugs in your model and in your pipeline code?
Data science model development is complex. It has many dependencies. Customer feedback and domain knowledge are incredibly valuable. Make sure you are tracking feedback so mistakes are not repeated and so your models are always improving.
8. Do you analyse the robustness of your models?
No model will work in all scenarios and poor performing models are dangerous. It is important to analyse and understand the conditions under which your model will work and under which it will degrade. This is robustness analysis. Are model outputs biased? Does a model require 6months of training data or 2 weeks? Does a model only perform once it has seen 5 customer journeys? A mature data science team has confidence pushing its models into production because this type of testing has been done in advance.
9. Do you translate model performance to commercial KPIs?
Technical performance metrics are important for you as a technical data scientist. However, to get business buy-in and adoption of your models you need to be able to make your models commercially relevant. That means turning predictions into revenue or cost savings or time savings or whatever the business cares about and whatever will justify further funding of your work.
10. Do new candidates write code at interview?
Data science is full of hype, bluffers and analytics rebranding itself. You want to filter down to the great candidates who understand the scientific method and can apply it to select and tune models. A technical test that involves using data and writing code is the most effective way to do this.
11. Do you have access to scalable compute and storage?
The complex combination of technologies needed for Data SCience often means that organisations struggle to enable their teams with the best technology to do their jobs. If your team does not have access to scalable compute and storage then their success will always be limited. Lack of a central place to store data and workings is a warning sign that Data Science is not taken seriously in an organisation.
12. Can data scientists install libraries and packages without intervention by IT?
If there is one word that summarises the requirements of Data Science it is ‘flexibility’. The nature of the work involves selecting models and tuning them against data. This means being able to quickly install and evaluate lots of model libraries. If a Data Science team needs approval for every library installation and upgrade then its speed of turnaround is going to slow from days to weeks and months.
13. Can Data Scientists deploy their models with minimal dependencies on engineering and infrastructure?
If models cannot be put into use they are of little value beyond curiousities. But deploying a model involves training on reproducible data, monitoring of decisions and performance and A/B testing of new releases. Delays in deployment mean models go out of date or competitive advantage is lost. The best organisations have platforms that allow model deployment to happen quickly, driven by Data Scientists.
So how do you score a 13/13?
How would your team score on a Joel Test of Data Science Maturity? This is where Guerrilla Analytics can help. Guerrilla Analytics provides guiding principles and conventions for promoting data provenance and reproducibiltiy in Data Science and Analytics work. There are guidelines on how to structure projects at every stage of the life cycle and how to consolidate knowledge in flexible data pipelines. You will also learn how to leverage techniques and tools from software engineering such as testing and source control.