Data Science is a varied mix of activities. It typically includes database design, algorithm coding, interactive analytics (like with iPython) and visualization coding as well as reporting and sharing of results. All of this is highly iterative because of the exploratory nature of Data Science where requirements and data change often.
Without some kind of control, Data Science projects quickly become impossible to manage. Code is fragmented across the languages and technologies involved. Key numbers and analyses cannot be reproduced. This is exacerbated when a team grows and a project is running at pace. The team drowns in the complexity of its own creations.
Data Science is ‘defensive’ if it can withstand the disruptions of changing data and requirements while still producing repeatable, explainable insights. Put another way, Defensive Data Science maintains data provenance. Fortunately, the Guerrilla Analytics Principles make it easy to do defensive Data Science . This blog post describes how.
Why Do Defensive Data Science?
There are many reasons to strive for ‘Defensive Data Science’. When I have run teams of up to 12 data scientists for very demanding stakeholders, maintaining data provenance has increased team efficiency and made Data Science easier to manage and maintain. Advantages include:
- Reduction in time wasted in tracking data sources, data modifications and analysis outputs. Without some kind of data provenance in place, a team wastes time trying to find out where data came from, how data was modified by continuously evolving code and which of many versions of analysis were delivered to a customer.
- Fewer errors. If you have data provenance in place, it is much easier to track everything that a team is doing. You know which version of your code, which version of your data, which version of your 3rd party libraries and which version of your business understanding came together to make any number that your team produced. You can stand over your analyses with confidence.
- Easier sharing of work. If all the inputs and outputs of your Data Science team’s work are easy to identify then sharing of work, collaboration, handovers and on-boarding of new team members are less of a toll on the team.
How should a team do Defensive Data Science?
Fortunately, many of the challenges of defensive Data Science have already been addressed in Software Engineering. There are decades tools and techniques that can be easily adapted to the needs of defensive Data Science. Follow these steps when moving your work into a more defensive setup.
- Semantic project structures. Firstly, get your project structure right. By following simple conventions on project structure, it is easy for a team to know where key project artefacts are located with minimal overhead of documentation. The team know where to put stuff and don’t spend time trying to figure this out. Some of the most important artefacts are incoming data and its versions, code and its versions, 3rd party libraries and their versions and team outputs and their versions. Guerrilla Analytics advocates simple flat project structures, giving a unique ID to incoming data and outgoing work products.
- Clear data steps. Popular Data Science languages such as SQL and Python are quite free form, permitting many routes to solving a given problem. This is their strength but it is also a difficulty when it comes to managing a team solving a diverse range of problems. A given problem can be solved in several different ways. You do not want to discourage this exploratory behaviour. However, by breaking down data extraction, transformation and reshaping into clearly defined modular ‘data steps’ it becomes much easier to automate, test and modify data flows. New steps can be easily introduced, irrelevant steps removed and ‘component’ tests introduced where needed. You can have the best of both worlds – exploration and reproducibility.
- Automation with pipelines. With an easily understood project structure in place and clear modular code, the next most important aspect of defensive Data Science is probably automation. Automation with pipelines means using tools (custom or 3rd party) that facilitate executing code ‘in a single click’. Since Data Science is highly iterative and also complex, working without some form of automation risks not discovering stale data bugs and broken data flows. Having easy automation available encourages ‘continuous integration’ behaviours, running all code and tests often.
- Version control. While you can survive without version control tools, the ability to track all changes by a team member, develop code ‘branches’ in parallel, tag released code and roll back to earlier versions of code is essential for a team that needs to adapt to unforeseen changes in data and requirements. Version control for Data Science has some differences to version control for software development .
- Documentation and workflow tracking. The previous steps go a long way towards enabling a team to produce defensive Data Science. While the overhead of documentation must be kept to a minimum in a very dynamic environment, there are some basic activities that benefit from documentation. Having a wiki makes it easy to version control changes to evolving team knowledge. Typical uses of a wiki for defensive Data Science include documenting data dictionaries and business rules. Furthermore, a simple workflow tracker for keeping tabs on what the team is doing and logging received data will make it much easier to maintain data provenance across the team’s activities.
Defensive Data Science need not be an administrative burden on a team. There are many benefits and a few simple principles go a long way to making a team easier to manage and more adaptable to dynamic project environments.
Do you have anything you wish to add? Please get in touch or have a look at Guerrilla Analytics. I’d be happy to discuss.
 Defensive Data Science: What we can learn from software engineers, http://www.predictiveanalyticsworld.com/patimes/defensive-data-science-what-we-can-learn-from-software-engineers0811153/
 Guerrilla Analytics: A Practical Approach to Working with Data, Enda Ridge 2014