I first came across Cohen’s kappa on Kaggle during the Data Science Bowl competition. While I didn’t compete, and the metric was the quadratic weighted kappa, I forked a kernel to play around with it and understand how it works. I then revisited the topic for the University of Liverpool — Ion Switching competition. Let’s break down Cohen’s kappa.
What Is Cohen’s Kappa?
Validity and Reliability Defined
To better understand that definition, I’d like to lay an important foundation down about validity and reliability. When we talk of validity, we are concerned with the degree to which a test measures what it claims to measure. In other words, how accurate the test is. Reliability is concerned more with the degree to which a test produces similar results under consistent conditions. Or to put it another way, the precision of a test.
What Is Validity and Reliability?
- Validity: This represents the degree to which a test measures what it claims to measure, or in other words, the accuracy of a test.
- Reliability: This represents the degree to which a test is able to reproduce similar results. In other words, the precision of a test.
I will put these two measurements into perspective so that we know how to distinguish them from one another.
The problem that the University of Liverpool had — hence why it hosted the Ion Switching competition — was that the existing methods of detecting when an ion channel is open are slow and laborious. It wanted data scientists to employ machine learning techniques to enable rapid automatic detection of an ion channels current events in raw data. Validity would measure if the scores obtained represented whether the ion channels were opened, and reliability would measure if the scores obtained were consistent at identifying whether a channel is opened or closed.
For the results of an experiment to be useful, the observers of the test would have to agree on its interpretation. Otherwise, subjective interpretation by the observer can come into play. Therefore, good reliability is important. However, reliability can be broken down into different types: Intra-rater reliability and inter-rater reliability.
Intra-rater reliability is related to the degree of agreement between different measurements made by the same person.
Inter-rater reliability is related to the degree of agreement between two or more raters.
Recall that earlier we said that Cohen’s kappa is used to measure the reliability for two raters rating the same thing, while correcting for how often the raters may agree by chance.
How to Calculate Cohen’s Kappa
The value for kappa can be less than zero, or in other words, negative. A score of zero means that there is random agreement among raters, whereas a score of one means that there is a complete agreement between the raters. Therefore, a score that is less than zero means that there is less agreement than random chance. I will show you the formula to work this out, but it is important that you acquaint yourself with “Figure 4” to have a better understanding.
The reason I highlighted two grids will become clear in a moment, but for now, let me break down each grid.
- A: The total number of instances that both raters said were correct. The raters are in agreement.
- B: The total number of instances that Rater 2 said was incorrect, but Rater 1 said were correct. This is a disagreement.
- C: The total number of instances that Rater 1 said was incorrect, but Rater 2 said were correct. This is also a disagreement.
- D: The total number of instances that both raters said were incorrect. Raters are in agreement.
In order to work out the kappa value, we first need to know the probability of agreement, hence why I highlighted the agreement diagonal. This formula is derived by adding the number of tests in which the raters agree then dividing it by the total number of tests. Using the example from “Figure 4,” that would mean: (A + D)/(A + B+ C+ D)
.
The next step is to work out the probability of random agreement. Using “Figure 4” as a guide, the expected value is the total number of times that “Rater 1” said “correct” divided by the total number of instances, multiplied by the total number of times that “Rater 2” said “correct” divided by the total number of instances, added to the total number of times that “Rater 1” said “incorrect” multiplied by the total number of times that “Rater 2” said “incorrect.” That’s a lot of information to take in, so I have formulated the equation using the previous grid.
The formula for Cohen’s kappa is the probability of agreement minus the probability of random agreement, divided by one minus the probability of random agreement.
You’re now able to distinguish between reliability and validity, explain Cohen’s kappa and evaluate it. This statistic is very useful. Now that I understand how it works, I believe that it may be under-utilized when optimizing algorithms to a specific metric. Additionally, Cohen’s kappa also does a good job of measuring both multi-class and imbalanced class problems