Statistical bias occurs when a statistic is unrepresentative of the population, and it can lead to inaccurate results in a research study, experiment or forecasting model.
What Is Statistical Bias?
Statistical bias is any instance that creates a difference between an expected value and the true value of a parameter being estimated, leading to inaccurate results. It can be caused by inadequate data collection and measurement, omission of too many variables or flawed study design.
Imagine you’re running for president and you want to be the voice of the majority.
So, you head to an environmentalist meeting and ask five people what they think about the meat industry. All five of them say that meat production should be banned. Immediately, you’re convinced that everyone wants to ban meat production to save Earth. You make this your key issue and preach it day and night, thinking that this is the secret to winning the election.
Four months later, you end up with less than 1 percent of the vote.
Your mishap could have easily been avoided if you had known about statistical bias. Statistical bias is an important concept to understand, not just for pure statistics and machine learning, but in other areas like philosophy, psychology, and business too.
Bias is bad. We want to minimize as much bias as we can. To do so, we need to understand it clearly.
What Is Statistical Bias?
Statistical bias describes any instance that creates a difference between an expected value and the true value of a population parameter being estimated. It can also be thought of as the underestimation or overestimation of a population value.
Several types of bias can cause this error. The most common types of statistical bias include:
6 Types of Statistical Bias
- Selection bias.
- Survivorship bias.
- Omitted variable bias.
- Recall bias.
- Observer bias.
- Funding bias.
1. Selection Bias
Several subtypes of selection bias exist:
SAMPLING BIAS
Sampling bias refers to the collection of a biased sample caused by non-random sampling. To give an example, imagine that there are 10 people in a room, and you ask if they prefer grapes or bananas. If you only surveyed the three women present and concluded that the majority of people like grapes, you’d have demonstrated sampling bias.
TIME INTERVAL BIAS
Time interval bias is caused by intentionally specifying a certain range of time to support the desired conclusion. For example, determining the average number of tweets per hour from a sample taken during peak hours (9 p.m. to 12 a.m.) is an example of time interval bias.
SUSCEPTIBILITY BIAS
Susceptibility bias refers to the instance where one occurrence is susceptible to a second occurrence, but any effect on the first occurrence is also susceptible to the second occurrence. This can make the effect falsely attributed to causing the second occurrence. For example, a patient who has high cholesterol could suffer heart disease, so they may take a certain medication to lower their cholesterol levels. However, the medication may be blamed instead to be causing the patient’s heart disease.
This type of bias arises particularly in epidemiological studies, and includes clinical susceptibility bias, protopathic bias, and indication bias, which all relate to mixing up cause/effect with correlation.
CONFIRMATION BIAS
Confirmation bias is the tendency to favor information that confirms one’s beliefs. This can lead individuals to seek out and support only a small subset of larger data, and ignore remaining data that doesn’t align with what they’re searching for.
For example, confirmation bias can surface during presidential elections. Individuals may intentionally look for information that depicts their preferred candidate as a positive figure, while at the same time ignoring information that depicts them as a negative figure.
2. Survivorship Bias
Survivorship bias is a phenomenon where only those that survived a long process are included or excluded in an analysis, thus creating a biased sample.
Sreenivasan Chandrasekar provides this example:
“We enroll for gym membership and attend for a few days. We see the same faces of many people who are fit, motivated and exercising every day whenever we go to gym. After a few days, we become depressed. Why aren’t we able to stick to our schedule and motivation more than a week when most of the people whom we saw at gym could? What we didn’t see was that many of the people who had enrolled for gym membership had also stopped turning up for gym just after a week, and we didn’t see them.”
3. Omitted Variable Bias
Omitted variable bias stems from the absence of relevant variables in a machine learning or forecasting model. In machine learning, removing relevant and/or too many variables results in an underfit model.
An example of this is purchasing a car based on the brand and the car model, but not the mileage. Imagine finding a 2020 Porsche 911 Turbo for $10,000. That sounds like a steal until you find out that there are 400,000 miles on it.
4. Recall Bias
Recall bias is a type of information bias where people do not remember previous events, memories or details, leading to inaccurate accounts of past exposures. It is related to recency bias, where we tend to remember things that have happened more recently better. This can accidentally influence results in cases of self-reporting studies.
For example, say you are conducting a survey about who in your town contracted the flu and whether or not they previously received their flu vaccine. When asking people with the flu if they have gotten a vaccine, they may not be able to remember if they have.
5. Observer Bias
This bias stems from the subjective viewpoint of observers and how they assess subjective criteria or record subjective information. Observer bias can be especially likely to occur if the conductor of a study has a desired outcome or preconception from the mentioned study.
For example, suppose a scientist holds a study where two groups are told that they are given a medication to help with headaches, with one group actually receiving a placebo. The scientist already expects that the placebo group will still experience headaches. From this assumption, the scientist treats the two groups differently and may frame questions about pain level in a subjective manner for each group.
6. Funding Bias
Also known as sponsorship bias, funding bias is the tendency to skew a study or the results of a study to support a financial sponsor. Funding bias has been seen in studies reporting the nutritional effects of commercial products like food, tobacco or pharmaceutical drugs.
An example of funding bias could be a shampoo manufacturer who uses a harmful chemical as part of its ingredients may fund studies that support the positive effects of using the chemical.
Frequently Asked Questions
What is statistical bias?
Statistical bias is any instance that creates a difference between an expected value and the true value of a parameter being estimated. In other words, it occurs when a statistic is unrepresentative of the population.
What is an example of statistical bias?
As an example of statistical bias, imagine there are 100 people in a room and you want to determine if they like ketchup or mustard better. You ask only five people who you know like ketchup on their opinion. From this, you conclude that all people in the room like ketchup better.
What are the types of bias in statistics?
The six types of statistical bias include:
- Selection bias
- Survivorship bias
- Omitted variable bias
- Recall bias
- Observer bias
- Funding bias