A/B tests! Machine learning! Deep learning! It’s easy to be distracted by new libraries and beautiful visualizations. It’s easy to waste time with scattergun approaches to data and algorithms. It’s easy to forget that as a data scientist you should be taking a scientific approach to understanding data.
Is scientific rigour appropriate in industry or should it be confined to academic research? Is it not better to to design an experiment instead of diving into the data? And should you care about reproducibility or move quickly to the next project?
A definition of Science
Here is a definition of ‘science’ followed by what that means for data science in business.
the systematic study of the structure and behaviour of the… world through observation and experimentOxford English Dictionary
This ‘systematic’ scientific method generally involves the following steps.
- Formulate a question
- Formulate a hypothesis
- Make a prediction
- Analyse results
So how should data science apply this scientific method in business?
The Rigour of Science is Essential for Successful Data Science in Business
Here are the steps in the scientific method and a data science example.
The Business Objective
- Formulation of a question. This is the most challenging and most important step for science in general and data science in business. This can be closed like ‘why are our sales decreasing?’ or can be open ended like ‘can we solve problem X?’
- Example: our buying and marketing teams might engage data science to ask ‘can we predict what customers want to buy?’
- Success in business: If you are not clear on your objective, you project is heading for trouble. Without a well formulated question, data projects become mired in complexity, go off course and fail.
The Business Case
- Formulation of a hypothesis (conjecture) about a population. This is a testable conjecture that rejects the status quo (in science jargon, this is the null hypothesis).
- Example: a hypothesis might be that a logistic regression applied to established customers (the population) will make better predictions than chance (or the current algorithm in use).
- Success in business: this is your business case. This is the outcome that would cause the business to change the way they work.
- Prediction. this is the logical consequence of the hypothesis. The more unlikely a prediction due to coincidence then the more likely the status quo should be rejected.
- Example: we could predict an improvement in the number of correctly predicted purchases due to the new algorithm.
- Success in business: this is an extension of the business case. If the business case were successful, then this is what we predict will happen.
Evaluating the Business Case
- Testing with experiment. Here is where the prediction is tested in the real world. Experiment design is a field in its own right that I will blog about separately. For our customer algorithm, we might put it live on an online website or use some other means to get new predictions in front of real world customers.
- Success in business: this is where we rigorously evaluate the business case. Experiment design is critical to counteract biases, random chance and external influences that we cannot control.
- Analysing experiment results. This is where the data gathered from the experiment is analysed to determine if the status quo should be rejected. In the example, we would look for a significant difference in customer predictions using the new algorithm instead of the incumbent approach.
- Success in business: this is where we rigorously decide where the business case will be a success. Because we are leveraging all the previous steps, our conclusion to reject the status quo and change the business can be done with confidence.
Rigour of Science drives good data science practices
In addition, the following scientific principles should be adhered to:
- Repeatability. It should be possible to run an experiment again and get the same results and conclusions.
- Success in business: this is how you have confidence that your results are generally applicable to the business. You can take that algorithm out of the lab and run it in production.
- Reproducibility. It should be possible for somebody else to independently following your experiment steps and get the same results and conclusions
- Success in business: many businesses are seasonal. Teams change, projects pause and are restarted. Reproducibility allows other teams to successfully inherit your work and use it with confidence.
- Avoidance of bias. This can be human bias due to our inherent subjectivity but can also inadvertently be introduced through poor experiment design.
- Success in business: This is how you stop those great results being extrapolated and interpolated in ways your never intended.
You can read more about how to bring scientific rigour to your data science in my book Guerrilla Analytics (UK) (USA). It contains simple principles for maintaining reproducibility and repeatability while delivering at pace.