Imagine this.
You’re running for president and you want to be the voice of the majority.
So, you head to an environmentalist meeting and ask five people what they think about the meat industry. All five of them say that meat production should be banned. Immediately, you’re convinced that everyone wants to ban meat production to save Earth.
You make this your key issue and preach it day and night, thinking that this is the secret to winning the election.
Four months later, you end up with less than 1 percent of the vote.
Your idiocy could have easily been avoided if you had known about bias. Bias is an important concept to understand, not just in statistics and machine learning, but in other areas like philosophy, psychology, and business too.
Generally, bias is defined as “prejudice in favor of or against one thing, person, or group compared with another, usually in a way considered to be unfair.”
Bias is bad. We want to minimize as much bias as we can. To do so, we need to understand it clearly.
What Is Statistical Bias?
For this article, we’re going to focus on statistical bias. Statistical bias happens when a model or statistic is unrepresentative of the population. Several types of bias can cause this error.
The most common types of bias include the following:
6 Types of Statistical Bias
- Selection bias.
- Survivorship bias.
- Omitted variable bias.
- Recall bias.
- Observer bias.
- Funding bias.
1. Selection Bias
Selection bias is the phenomenon of selecting “individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.”
Several subtypes of selection bias exist as well.
Sampling Bias
Sampling bias refers to the collection of a biased sample caused by non-random sampling. To give an example, imagine that there are 10 people in a room, and you ask if they prefer grapes or bananas. If you only surveyed the three women present and concluded that the majority of people like grapes, you’d have demonstrated sampling bias.
Time Interval Bias
Time interval bias is caused by intentionally specifying a certain range of time to support the desired conclusion. For example, determining the average number of tweets per hour from a sample taken during peak hours (9 p.m. to 12 a.m.) is an example of time interval bias.
Susceptibility Bias
Susceptibility bias includes clinical susceptibility bias, protopathic bias, and indication bias, which all relate to mixing up cause/effect with correlation.
Confirmation Bias
Confirmation bias is the tendency to favor information that confirms one’s beliefs.
2. Survivorship Bias
Survivorship bias is ta phenomenon where only those that survived a long process are included or excluded in an analysis, thus creating a biased sample.
Sreenivasan Chandrasekar provides this example:
“We enroll for gym membership and attend for a few days. We see the same faces of many people who are fit, motivated and exercising every day whenever we go to gym. After a few days, we become depressed. Why aren’t we able to stick to our schedule and motivation more than a week when most of the people whom we saw at gym could? What we didn’t see was that many of the people who had enrolled for gym membership had also stopped turning up for gym just after a week, and we didn’t see them.”
3. Omitted Variable Bias
This bias stems from the absence of relevant variables in a model. In machine learning, removing relevant and/or too many variables results in an underfit model.
An example of this is purchasing a car based on the brand and the car model but not the mileage. Imagine finding a 2020 Porsche 911 Turbo for $10,000. That sounds like a steal until you find out that there are 400,000 miles on it.
4. Recall Bias
Recall bias is a type of information bias where people do not remember previous events, memories, or details. It is related to recency bias, where we tend to remember things that have happened more recently better.
5. Observer Bias
This bias stems from the subjective viewpoint of observers and how they assess subjective criteria or record subjective information.
6. Funding Bias
Also known as sponsorship bias, funding bias is the tendency to skew a study or the results of a study to support a financial sponsor.