I first came across Cohen’s kappa on Kaggle during the Data Science Bowl competition. While I didn’t compete, and the metric was the quadratic weighted kappa, I forked a kernel to play around with it and understand how it works. I then revisited the topic for the University of Liverpool — Ion Switching competition. Let’s break down Cohen’s kappa.

What Is Cohen’s Kappa?

Cohen’s kappa is a quantitative measure of reliability for two raters that are rating the same thing, correcting for how often the raters may agree by chance.

 

Validity and Reliability Defined

To better understand that definition, I’d like to lay an important foundation down about validity and reliability. When we talk of validity, we are concerned with the degree to which a test measures what it claims to measure. In other words, how accurate the test is. Reliability is concerned more with the degree to which a test produces similar results under consistent conditions. Or to put it another way, the precision of a test. 

What Is Validity and Reliability?

  • Validity: This represents the degree to which a test measures what it claims to measure, or in other words, the accuracy of a test. 
  • Reliability: This represents the degree to which a test is able to reproduce similar results. In other words, the precision of a test.  

I will put these two measurements into perspective so that we know how to distinguish them from one another.

The problem that the University of Liverpool had — hence why it hosted the Ion Switching competition — was that the existing methods of detecting when an ion channel is open are slow and laborious. It wanted data scientists to employ machine learning techniques to enable rapid automatic detection of an ion channels current events in raw data. Validity would measure if the scores obtained represented whether the ion channels were opened, and reliability would measure if the scores obtained were consistent at identifying whether a channel is opened or closed.

Dart boards with different score patterns.
Figure 1: The dart board example of reliability and validity. | Image: Trochim, William M. The Research Methods Knowledge Base, 2nd Edition.

For the results of an experiment to be useful, the observers of the test would have to agree on its interpretation. Otherwise, subjective interpretation by the observer can come into play. Therefore, good reliability is important. However, reliability can be broken down into different types: Intra-rater reliability and inter-rater reliability.

Intra-rater reliability is related to the degree of agreement between different measurements made by the same person.

A drawing of a person rating the same situation ywice.
Figure 2 representing a sketch of an intra-rater reliability. | Image: Kurtis Pykes

Inter-rater reliability is related to the degree of agreement between two or more raters.

A drawing of two people rating the same situation.
Figure 2 representing a sketch of an intra-rater reliability. | Image: Kurtis Pykes

Recall that earlier we said that Cohen’s kappa is used to measure the reliability for two raters rating the same thing, while correcting for how often the raters may agree by chance.

 

Video explaining the basics of the Cohen’s kappa coefficient. | Video: Christian Hollman

More on Data ScienceThe Poisson Process and Poisson Distribution, Explained (With Meteors!)

 

How to Calculate Cohen’s Kappa

The value for kappa can be less than zero, or in other words, negative. A score of zero means that there is random agreement among raters, whereas a score of one means that there is a complete agreement between the raters. Therefore, a score that is less than zero means that there is less agreement than random chance. I will show you the formula to work this out, but it is important that you acquaint yourself with “Figure 4” to have a better understanding.

A 4 quadrant grid used to interpret rater results.
Figure 4 is an N x N grid used to interpret results of raters. | Image: Kurtis Pykes

The reason I highlighted two grids will become clear in a moment, but for now, let me break down each grid.

  • A: The total number of instances that both raters said were correct. The raters are in agreement.
  • B: The total number of instances that Rater 2 said was incorrect, but Rater 1 said were correct. This is a disagreement.
  • C: The total number of instances that Rater 1 said was incorrect, but Rater 2 said were correct. This is also a disagreement.
  • D: The total number of instances that both raters said were incorrect. Raters are in agreement.

In order to work out the kappa value, we first need to know the probability of agreement, hence why I highlighted the agreement diagonal. This formula is derived by adding the number of tests in which the raters agree then dividing it by the total number of tests. Using the example from “Figure 4,” that would mean: (A + D)/(A + B+ C+ D).

The equation for the probability of agreement.
Figure 5 is the equation for the probability of agreement. | Image: Kurtis Pykes

More on Machine LearningK-Nearest Neighbor Algorithm: Explained

The next step is to work out the probability of random agreement. Using “Figure 4” as a guide, the expected value is the total number of times that “Rater 1” said “correct” divided by the total number of instances, multiplied by the total number of times that “Rater 2” said “correct” divided by the total number of instances, added to the total number of times that “Rater 1” said “incorrect” multiplied by the total number of times that “Rater 2” said “incorrect.” That’s a lot of information to take in, so I have formulated the equation using the previous grid.

The formula to derive probability of random agreement.
Figure 6 is the formula to derive probability of random agreement. | Image: Kurtis Pykes

The formula for Cohen’s kappa is the probability of agreement minus the probability of random agreement, divided by one minus the probability of random agreement.

Cohen’s kappa coefficient formula.
Figure 7 is Cohen’s kappa coefficient formula. | Image: Kurtis Pykes

You’re now able to distinguish between reliability and validity, explain Cohen’s kappa and evaluate it. This statistic is very useful. Now that I understand how it works, I believe that it may be under-utilized when optimizing algorithms to a specific metric. Additionally, Cohen’s kappa also does a good job of measuring both multi-class and imbalanced class problems

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us