Correlation is a really useful variable. It tells you that two variables tend to move together. It’s also one of the easiest things to measure in statistics and data science. All you need is literally one line of code (or a simple formula in Excel) to calculate the correlation.
Machine learning models, both the predictive kind and the explanatory kind, are built with correlations as their foundations. For example, the usefulness of a forecasting model is based heavily on your ability to find and engineer some feature variables that are highly correlated with whatever it is you are trying to predict.
But correlation is not causation — I bet you’ve heard this before. A lot of times this doesn’t matter, but sometimes it matters a lot. It depends on what question you are trying to answer. If all you care about is prediction (like what will the stock market do next month?), then we don’t care much about the distinction between correlation and causation.
But if we’re trying to decide between several policy options to invest in, and we want the chosen policy to affect some sort of positive outcome, then we better be sure that it really will. In this case, we care greatly about causality. If we’re wrong and mistake correlation for true causation, we could end up wasting millions of dollars and years of effort.
Say, for instance, we observed a high correlation between hair loss and wealth. We would really regret it if we ripped all our hair out expecting money to start pouring into our bank accounts. This is an example of non-causal correlation — the issue is that there’s some other missing variable that’s the true driver that causes the correlation we observe. In our case, it might be because there were a lot of really stressed out entrepreneurs in our sample, and these people worked night and day, ultimately trading their hair and some of their health for a big payout.
So how can we measure causation?
The Ideal Way: Random Experiments
The purest way to establish causation is through a randomized controlled experiment (like an A/B test) where you have two groups — one gets the treatment, one doesn’t. The critical assumption is that the two groups are homogenous — meaning that there are no systematic differences between the two groups (besides one getting the treatment and the other not) that can bias the result.
If the group that gets the treatment reacts positively, then we know there is causation between the treatment and the positive effect that we observe. We know this because the experiment was carefully designed in a way that controls for all other explanatory factors besides the thing we are testing. So any observed difference (that’s statistically significant) between the two groups must be attributable to the treatment.
What if We Can’t Run an Experiment?
The problem is that, in reality, we often can’t run randomized controlled experiments. Most of the time, we only have empirical data to work with. Don’t get me wrong, empirical data is great, but it’s lacking when it comes to proving causation.
With empirical data, we often run into the chicken-and-the-egg problem. For example, at a previous company, I was tasked with a project to prove that our investment advisory service helped increase users’ savings rates. There was a strong correlation between signing up for our service and increased savings rates — people who signed up for our services were much more likely to increase savings than those who didn’t.
But correlation is not enough. Another plausible explanation is that those people who want to save more are the ones that sign up for our service. In that case, it’s not that our service helped them save more, but rather that signing up for our service was a byproduct of wanting to save more (the chicken-and-the-egg problem). So if this were true, then if a company paid for subscriptions to our advisory service for its employees, it would not see an increase in their savings rates (because it’s not causation).
Testing for Causation Using Pseudo-Randomness
So how do we get around this when there’s no way to run an actual experiment? We need to look for events that introduce pseudo-randomness.
Recall that the critical assumption that allows us to prove causation with an A/B test is that the two groups are homogenous. Thus, all differences in outcomes can be attributed to the treatment. So when we can’t run an experiment, we need to look for sub-periods or sub-portions of our data that happen to produce two homogenous groups that we can compare.
Let’s see what I mean by that through our earlier investment advisory service example.
In my case, fortunately there were a handful of companies that opted all their new employees into our service starting in 2014. I could compare how these new employees’ savings rates evolved over time relative to new employees in the years prior to 2014. The big assumption I’d be making here is that the pre- and post-2014 new employee cohorts at these companies were pretty similar across all the characteristics that mattered (such as age, education and salary).
Here, the employee’s job start date (whether it was before 2014 or not) is known as an instrumental variable. An instrumental variable is one that varies the probability of the treatment without altering the outcome. In other words, it successfully isolates the impact of the treatment across the two groups and creates a reasonable approximation of a randomized controlled experiment.
The new employees pre-2014 were indeed reasonably similar to the new employees that joined these companies in 2014 and after (yes, I checked). But the ones after 2014 were defaulted into our advisory service — this allowed me to compare two reasonably homogenous groups where the only major difference between groups was in whether they were defaulted into our service or not.
It turned out that the post-2014 new employees did increase their savings at a higher rate than the pre-2014 cohort. As expected, the increase was less than that observed across the entire population.
So yes, our service was definitely more attractive to those who already wanted to save more (there was a decent amount of non-causal correlation). But even after removing this effect through an instrumental variable, I still found a statistically significant difference in the savings rate increases of the two groups. And yes, my company’s service did help users increase their savings rates.
The Value of Determining Causality
Causation is never easy to prove. I got lucky that there was a feasible instrumental variable to use. But generally, good instrumental variables will not be easy to find — you will have to think creatively and really know your data well to uncover them.
But it can be worth it. When you are thinking of investing significant amounts of resources and time, it’s not enough to know that something is correlated to the effect you are after. You need to be reasonably certain that there’s a real causal relationship between the action you are thinking about taking and the effect that you desire.