The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. It is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area.
In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts , ,  and a discussion in my PhD thesis . I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.
8 Types of Bias in Data Science
The first step is to be aware of the types of bias you may encounter.
- Confirmation bias. People are less critical of Data Science that supports their prior beliefs rather than challenges their convictions.
- This happens when results that go against the grain are rejected in favour of results that promote ‘business as usual’. Was the latest quarterly marketing campaign really successful across the board or just for one part of the division?
- Rescue bias. This bias involves selectively finding faults in an experiment that contradicts expectations. It is generally a deliberate attempt to evade and undermine evidence.
- You may fall for this bias when your project results are disappointing. Perhaps your algorithm can’t classify well enough. Perhaps the data is too sparse. The Data Scientist tries to imply that results would have been different had the experiment been different. Doing this is effectively drawing conclusions without data and without experiments.
- ‘Time will tell’ bias. Taking time to gather more evidence should increase our confidence in a result. This bias affects the amount of such evidence that is deemed necessary to accept the results.
- You may encounter this bias when a project is under pressure to plough ahead rather than waiting for more data and more confident Data Science. Should you draw conclusions based on one store or wait until you have more data from a wide variety of stores and several seasons?
- Orientation bias. This reflects a phenomenon of experimental and recording error being in the direction that supports the hypothesis.
- You may encounter this bias when your work is needed to support a business decision that has already been made. This arises in the pharmaceuticals industry, for example, where trials favour the new pharmaceutical drugs.
- Cognitive bias: This is the tendency to make skewed decisions based on pre-existing factors rather than on the data and other hard evidence.
- This might be encountered where the Data Scientist has to argue against a ‘hunch’ from ‘experience’ that is not supported by hard data.
- Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient and cost-effective for your purposes.
- You will encounter this bias when you have to ‘demonstrate value’ on a project that has not been properly planned. The temptation is to do ‘best endeavors’ with the data available.
- Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population.
- An oft-quoted example here is the use of Twitter data to make broad inferences about the population. It turns out that the Twitter users sample is biased towards certain locations, certain incomes and education levels etc.
- Modelling bias: This is the tendency to skew Data Science models by starting with a biased set of assumptions about the problem. This leads to selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics.
So what can you do to counter these biases in your work?
The first step is awareness and hopefully the above list will help you and your colleagues. If you know about bias, you can remain alert to it in your own work and that of others. Be critical and always challenge assumptions and designs.
The next best thing is to do what scientists do and make your work as reproducible and transparent as possible.
- Track your data sources and profile your raw data as much as possible. Look at direct metrics from your data such as distributions and ranges. But also look at the qualitative information about the data. Where did it come from? How representative is this?
- Make sure your data transformations and their influence on your populations can be clearly summarised. Are you filtering data? Why and so what? How are you calculating your variables and have you evaluated alternatives? Where is the evidence for your decision?
- Track all your work products and data understanding as they evolve with the project. This allows you to look back at the exploration routes you discarded or didn’t have time to pursue.
Bias is sometimes unavoidable because of funding, politics or resources constraints. However that does not mean you can ignore bias. Recognising the types of bias, and understanding their impact on your conclusions will make you a better Data Scientist and improve the quality of your conclusions.
You can read more about how to do reproducible, testable Data Science that helps defend against bias in my book Guerrilla Analytics: A Practical Approach to Working with Data. Can you think of any other biases? Please get in touch!
- Data Scientist: Bias, Backlash and Brutal Self-Criticism, James Kobielus, MAY 16, 2013, http://www.ibmbigdatahub.com/blog/data-scientist-bias-backlash-and-brutal-self-criticism
- The Hidden Biases in Big Data, Kate Crawford APRIL 01, 2013, https://hbr.org/2013/04/the-hidden-biases-in-big-data
- 7 Common Biases That Skew Big Data Results, 9th July 2015 Lisa Morgan, http://www.informationweek.com/big-data/big-data-analytics/7-common-biases-that-skew-big-data-results/d/d-id/1321211
- Design of Experiments for the Tuning of Optimisation Algorithms, 2004, University of York, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.9333&rep=rep1&type=pdf