Likelihood Vs. Probability: What’s the Difference?

When someone asks, “How likely is it to rain tomorrow?” normal people (i.e., not statisticians or economists) simply substitute in the question “What’s the probability that it rains tomorrow?” and then respond with whatever their weather app or their gut tells them. To most, the scribblings of a few academics are not worth the effort to decipher the difference these two terms imply about inference. When faced with a strict statistical definition used in a different context than pure inference, most individuals outside of academe are capable of both deciphering the actual meaning of a routine question concerning prediction versus evidence and satisfactorily answering it.

What Is the Difference Between Likelihood and Probability?

The fundamental difference between likelihood and probability is what they are designed to numerically represent. Likelihoods work out evidence for or against some hypothesis about outcomes that have already occurred relative to other explanations of these outcomes. Probability considers the entirety of possible future events and assigns each of them a fraction of the space of all possibilities. Since one outcome must occur at the expense of all the other possible outcomes, probabilities must be non-zero and sum to one.

However, statistical definitions, like grammar, have their place in converting between information and insight. This is unfortunately true, even if strict statisticians, in the mold of their grammarian counterparts, often seem hell-bent on wielding their explanatory tools as pedant’s cudgels to punish the world for its collective unscrupulousness. But, at the risk of indulging the dogmatists, consider their primary difference. Probability directs the inferential arrow from insight to outcome. Likelihood reverses the directional arrow to go from outcomes to insight.

“Problems considered by probability and statistics are inverse to each other. In probability theory, we consider some underlying process which has some randomness or uncertainty modeled by random variables, and we figure out what happens. In statistics, we observe something that has happened and try to figure out what underlying process would explain these observations.”

Persi Diaconis

More From Edward HearnWhat Is a Creative Commons License?

Probability Vs. Likelihood Example

At this point, the foremost question in the reader’s mind is “This is an essay about probability, so when’s a coin flip example going to be brought up?” And what a timely question because here it is: There is a cut-rate gambling hall with a small table where the dealer (or “flipper”) takes wagers on whether a coin will land heads or tails. One person in the crowd, a mathematician specializing in probability (or, “probabilist”) thinks, “Hmm … a two-sided coin. This means that there are only two outcomes, and the result of the next coin flip must be one of these two outcomes and not the other. I’m not sure which of the two outcomes the next flip will result in, but knowing that one out of only two outcomes must come up, I’ll assume the probability of either heads or tails is one out of two. From here, I can figure out the probability of one or many future coin flips, and even compute the probabilities of sequences of coin flips in different scenarios!”

Another person loitering around the coin flipper’s table is a statistician, who thinks “I’d gamble on the outcome of the next coin flip, but this place being a real dive by the look of it, and the fact that coin flipping is a game of chance gives me some pause. How can I be sure that this flipper’s coin is equally likely to come up heads as it is tails? I think I’ll watch some flips and keep track of how often I see heads. After a while, if what I see seems generally consistent with half of this flipper’s spins coming up heads, my assumption of a fair coin will be confirmed. Then, I’ll ask the probabilist how best to play.”

This scenario, adapted from the craps example in Steven Skiena’s Calculated Bets, illustrates the central difference between probability and likelihood. Even more broadly, this also represents the central difference between probability and statistics, the latter of which places the computation of likelihood at its core. Or, as Skiena more deftly states, “In summary, probability theory enables us to find the consequences of a given ideal world, whereas statistical theory permits us to measure the extent to which our world is ideal.”

At this point, it is beneficial to back up even further and ask two definitional questions: “What are these things called probabilities?” and “What are these things called likelihoods?” Doing so will help solidify definitions. This should help clarify these concepts’ distinction from each other as well as their connection. It should also keep the “statistical grammarians” at bay.

What Is Probability?

So, how to characterize this “probability” stuff? At its core, probability is forward-looking. Contemplating the possibility of a future event mathematically (meaning assigning it some quantity) requires measuring it within the context of other possible future events. Probability, then, is merely a numerical accounting of how prevalent a possible outcome is relative to the set of all possible outcomes. Once an event occurs, its occurrence precludes every other possible event’s occurrence, and its probability is no longer calculable. Because of this, a logical consideration of all events requires at least one of them to occur. Otherwise, the measure has failed to account for every possibility.

3 Rules of Probability

The probabilities of each event must be non-negative.
The probabilities of each event must sum to a non-infinite constant.
For any two mutually exclusive events, the probability of one or the other occurring is the sum of their individual probabilities.

Thus, the probabilities of each event must be non-negative, the probabilities of each event must sum to a non-infinite constant, and, for any two mutually exclusive events, the probability of one or the other occurring is the sum of their individual probabilities. Thanks to the tedious work of Andrey Kolmogorov, we know that anything quantified as a probability must obey these three axioms.

But where does the “sum to one” rule come from? Probability accounting for (read: counting) one or several possible events out of all possible events naturally leads to its representation as a fraction, where the numerator (one or several possible outcomes) is a smaller piece of a denominator (all possible outcomes). In other words, out of all the possible outcomes, one must occur. Since each possible event represents some proportion of the space of all possible events and every possible outcome precludes all others, the sum total of these fractional possibilities must be one, which is the probability that any one of the possible outcomes occurs.

In the context of the canonical coin-flip example, possible outcomes are always either heads or tails, and only one of these two outcomes occurs every flip. So, the probability that either heads or tails occurs must sum to one. But does this mean that the probability of a heads must equal the probability of a tails? What if a new flipper takes over? What if the flipper inexplicably changes coins? Changes flipping hands? It seems that the probabilist has not taken into account all the possible ways that the probabilities of heads and tails can add up to one.

In fact, the probabilist’s judgment that the probability of heads and tails are each one-half turns on his ignorance of all the possibilities outside of the coin flip. The probabilities of heads or tails must, by definition, sum to one. But their probabilities need not be equal. The probabilist is simply operating from a state of not knowing anything about the coin flip aside from the bare minimum amount of information needed to construct a probability, having never seen the coin flip before.

Enter the statistician, who takes the probabilist’s assumption that the coin flip is “fair” (i.e., both heads and tails are equally likely) as given. “However,” the statistician believes, “I don’t know this to be certain. It seems a bit naïve to think that every process with two outcomes must have equal probabilities that sum to one. I mean, outside it’s either raining or not raining. Does that mean that the probability of precipitation is always 50/50? No, of course not. Rain relies on other factors, many of which I don’t know about. Perhaps there might be something about this coin flip that makes it unfair. Presumably, this flipper will execute more than one coin toss, so I’ll count the number of heads I see and divide it by the total number of flips. That way, after a while, I’ll be more certain that this coin tossing truly is fair.”

What the statistician has proposed is accumulating evidence from enough coin flips to be certain of the probabilist’s assumption that there are no other events in the set of possible outcomes besides those pertaining to a fair coin. The amount of evidence the statistician gathers on the number of heads and flips represents the likelihood that the coin is fair. Note the (correct!) use of the term “likelihood” rather than the term “probability.” Rather than starting from axiomatic principles and theoretical inference to construct a numerical possibility of an outcome, likelihood starts with a given numerical possibility and accumulates evidence from outcomes either for or against this specific possibility.

More in Data ScienceBubble Sort Time Complexity and Algorithm Explained

What Is Likelihood?

Likelihood, like probability, is numerically quantifiable. To see how, return to the watchful statistician at the seedy gambling parlor (who is now furiously scribbling a running tally of coin-toss results on cocktail napkins). As the statistician reasons, “I’m testing whether that coin is a fair coin, so I’m assuming a 50 percent probability for heads as well as tails. I’ll call a Heads outcome a one, because it occurs, and a Tails toss a zero, because it doesn’t occur. If a tails toss occurs, however, I’ll simply switch these so that a tails is one minus heads.”

“But wait,” the statistician thinks, “what do I compare this likelihood to? There must be some alternative explanation for why the heads proportion of the flips count is not 50 percent. Maybe this seedy casino is loading the coin or flipping in a specific way that heads is more likely than tails to come up. How far away from fair do I think that coin might be? I’ll say, as an alternative, that the coin’s probability of coming up heads is 10 percent. That way, I’ll have a comparison of what an unfair coin looks like.”

How to Compute Likelihood

The statistician constructs a mathematical function to compute the likelihood for the fair-coin assumption as L(fair) = 0.5*(Total # of heads) + 0.5*(Total # of tails). Similarly, the statistician sets up the unfair-coin alternative assumption as L(unfair) = 0.1*(Total # of heads) + 0.9*(Total # of tails). As the statistician reasons, “I’ll use each likelihood score to show me how the outcomes favor the coin being fair or not.”

After the first throw, a heads, the statistician computes the two likelihoods as L(fair) = 0.5*(1) + 0.5*(0) = 0.5 and L(unfair) = 0.1*(1) + 0.9*(0) = 0.1. “Comparing the two,” thinks the statistician, “I see that the outcomes favor the coin being fair because 0.5 is greater than 0.1. But I’ll wait on some more tosses to increase my confidence.”

The statistician records 10 tosses’ outcomes as one heads and nine tails and computes likelihoods of L(fair) = 0.5*(1) + 0.5*(9) = 5 and L(unfair) = 0.1*(1) + 0.9*(9) = 8.2. The statistician realizes that the accumulating evidence of the outcomes (or, “data,” to use another precisely correct statistical term) is moving the likelihood of the coin having a fair probability to an unfair probability. But, letting his skepticism prevail, the statistician observes 90 more coin flips and records 90 more tails. The likelihood functions are now L(fair) = 0.5*(1) + 0.5*(99) = 50 and L(unfair) = 0.1*(1) + 0.9*(99) = 89.2. The statistician finally concedes that the weight of the coin-flip evidence disfavors the initial assumption (or “null hypothesis”) that the coin is fair.

More in Data ScienceGaussian Naive Bayes Explained With Scikit-Learn

How Are Likelihood and Probability Different?

The statistician’s likelihood evaluates the weight of the evidence for an assumed hypothesis. Likelihood does not prove that the initial assumption of a fair coin is true. It only compares how likely the data produced by the coin-flipping process are, assuming a fair coin. This “alternative hypothesis” (i.e., how unfair the coin is purported to be) could have been any non-50 percent figure the statistician might have wanted. The only caveat is that the closer the null and alternative hypotheses, the more data the statistician requires to accurately compare the likelihoods against one another.

In other words, rather than choosing an extreme amount of coin unfairness, the statistician could have chosen an alternative that the coin is just slightly unfair, at say a 51 percent probability of tails. Distinguishing this minuscule level of difference from a fair coin with any amount of confidence, however, necessitates the statistician collecting a large amount of data to ensure that the difference is not due simply to random chance in coin-flip outcomes.

We should also note that likelihood does not construct a probability. The choice of a fair coin as an initial assumption is arbitrary. The statistician could have just as easily chosen some other probability of heads for a null hypothesis and evaluated it with similar empirical evidence from the coin-flip outcome data.

For instance, the likelihood the statistician could have chosen might have been for the coin to be slightly biased, with a heads probability of 55 percent. The likelihood that the statistician used for this example would have also been undermined by the weight of 99 tails and one heads in the likelihood computation.

Likelihoods do not obey the three axioms that probabilities must adhere to. Thus, in the example above, computing likelihoods results in numbers greater than one. This violates the second probability axiom.

In the end, both measures attach numerical quantities to the abstract concept of possible outcomes from some process. Likelihoods attach these numbers to competing hypotheses of what can best explain observed outcomes. Probabilities attach fractional weights to each of all possible future outcomes of a given process. In both cases, rather than being the inquisitorial instruments of definitional Torquemadas, the logical consistency and generalizability of the concepts of likelihood and probability are what make each, in its own right, scientifically valid and immensely useful mathematical tools.