Data don’t lie, but they could be misleading: understanding biases in data science

By Dr Miko Chang

What can you infer if I tell you that the mean monthly income of 10 people in a room is RM10,000? You might conclude that these 10 people could be top-level managers who fall under the T20 category in Malaysia. Fair enough. But what if I tell you that 1 out of the 10 people is earning RM90,000 per month since he is running several businesses? Well, if you do the mathematics quickly, you might arrive at a completely different conclusion with this additional clue since the mean monthly income of the remaining 9 people in the room is now RM1,111 instead of RM10,000. This is one of the common pitfalls in using summary statistics such as an arithmetic mean when you have outlier bias in a small dataset.

Data do not lie, but our interpretations of these data via statistics and mathematical models could be misleading if we are not aware of the potential biases. Nowadays, all data science models are trained and built based on historical data. Hence, there is a strong underlying assumption that we must be aware of when interpreting the results which is the future is like the past i.e., the historical data. When the world was first hit by COVID-19, many time series prediction models failed due to the unfamiliar challenges brought by the pandemic. As a result, data scientists tried salvaging their time series models by simulating different scenarios with varying assumptions of “what if the lockdown is ended, maintained or further extended.” This results in different scenarios that hopefully help decision-makers in devising the most appropriate strategy from time to time.

Despite that, building a perfect model by capturing every piece of data is an impossible quest. Data are important and the models are useful if their underlying assumptions are not violated. For example, the estimated time to arrival (ETA) on Google Maps based on posted speed limits and historical traffic patterns serves as a useful guide for us to plan our journey. In the international bestseller book “Factfulness: Ten Reasons We’re Wrong About the World” by Swedish statistician Hans Rosling, it is proven that data helps us to form evidence-based opinions based on hard facts. This allows us to view things rationally while downplaying the influence of cognitive biases.

Overcoming cognitive biases requires constant effort as they are the mental shortcuts that our brains developed to simplify information processing in this noisy world. These cognitive biases could be surprisingly accurate sometimes as they are based on our experience or gut feeling, but they could also be misleading due to many factors including emotions. While data science is an objective discipline where scientific methods are applied to systematically extract insights from data, the analytical processes could be highly subjective depending on the judgements and interpretations of the stakeholders. Among all these biases, the most prominent ones include survivorship bias and confirmation bias.

The survivorship bias is caused by the limitation that whatever data we have captured are merely samples instead of the full population. The data points that do not survive via sampling will never have the chance to explain their side of the story and all our analyses will be biased toward the samples that we have successfully collected. For example, a post-COVID survey that aims to analyse the most critical business challenges faced by companies during the pandemic will suffer from survivorship bias since businesses that were badly hit would have been long gone and hence unable to respond in such a survey.

The second common yet dangerous bias is confirmation bias due to our existing beliefs or prejudices. Confirmation bias is part of human nature where we exhibit the tendency to stay within our comfort zone i.e., we only talk to people we like or we enjoy reading books and news that conform closely with our own interests. One effective way to overcome such bias would be humility, keeping an open mind by embracing diverse opinions and questioning our own beliefs from time to time.

In conclusion, data are powerful if analysed and interpreted correctly. In this age of information overload where the wave of digitalisation has resulted in an exponential growth of data, we are constantly exposed to more data science applications such as business forecast reports, interactive data dashboards or even predictions from the data science models. It is important that we are aware of the underlying assumptions behind all these analytical results as data do not lie, but they could be misleading due to different biases in data science.

Dr Miko Chang is a Senior Lecturer and Discipline Leader for Bachelor of Information and Communication Technology at Swinburne University of Technology Sarawak Campus. She is contactable at [email protected]