Violin plots are a method of plotting numeric data and can be considered a combination of the boxplot with a kernel density plot. In the violin plot, we can find the same information as in the boxplots:

  • Median: A white dot on the violin plot.
  • Interquartile range: The black bar in the center of a violin plot.
  • The lower/upper adjacent values: The black lines stretched from the bar defined as first quartile — 1.5 IQR and third quartile + 1.5 IQR, respectively. These values can be used in a simple outlier detection technique, Tukey’s fences. Observations lying outside of these “fences” can be considered outliers.

The unquestionable advantage of the violin plot over the boxplot is that it shows the entire distribution of the data. This is of interest, especially when dealing with multimodal data, i.e., a distribution with more than one peak.

What Is a Violin Plot?

A violin plot depicts numerical data by combining both boxplots and kernel density plots. It shows the entire distribution of data, which can be useful when working with multimodal data. 

 

How to Implement a Violin Plot in Python

In this article, we’ll use the following libraries:

seaborn    0.9.0
numpy      1.17.2
pandas     0.25.1
matplotlib 3.1.1

We start by defining the number of random observations we’ll draw from certain distributions, as well as setting the seed for reproducibility of the results.

N = 10 ** 4
np.random.seed(42)

Then, we define a function plotting the following:

  • A histogram with a kernel density estimate (KDE).
  • A boxplot.
  • A violin plot.

We will use this function for inspecting the randomly created samples.

def plot_comparison(x, title):
    fig, ax = plt.subplots(3, 1, sharex=True)
    sns.distplot(x, ax=ax[0])
    ax[0].set_title('Histogram + KDE')
    sns.boxplot(x, ax=ax[1])
    ax[1].set_title('Boxplot')
    sns.violinplot(x, ax=ax[2])
    ax[2].set_title('Violin plot')
    fig.suptitle(title, fontsize=16)
    plt.show()

More on Data Science: How to Use a Z-Table and Create Your Own

 

Standard Normal Distribution

We start with the most basic distribution, standard normal. First we’ll draw 10,000 numbers at random and plot the results.

sample_gaussian = np.random.normal(size=N)
plot_comparison(sample_gaussian, 'Standard Normal Distribution')
A standard normal distribution displayed in a histogram with kernel density estimation, a boxplot and a violin plot.
A standard normal distribution displayed in a histogram with kernel density estimation, a boxplot and a violin plot. | Image: Eryk Lewinson

Some of the observations we can make:

  • In the histogram, we see the symmetric shape of the distribution.
  • We can see the previously mentioned metrics — median, IQR, Tukey’s fences — in both the box plot as well as the violin plot
  • The kernel density plot used for creating the violin plot is the same as the one added on top of the histogram. Wider sections of the violin plot represent a higher probability of observations taking a given value; the thinner sections correspond to a lower probability.

I believe that showing these three plots together provides a good idea of what a violin plot is and what information it contains.

 

Log-Normal Distribution

In the second example, we consider the log-normal distribution, which is more skewed than the normal distribution.

sample_lognormal = np.random.lognormal(size=N)
plot_comparison(sample_lognormal, 'Log-normal Distribution')
A Log-Normal distribution shown in a histogram and kernel density estimation plot, a boxplot and a violin plot.
A Log-Normal distribution shown in a histogram and kernel density estimation plot, a boxplot and a violin plot. | Image: Eryk Lewinson

 

Mixture of Gaussians: Bimodal

In the previous two examples, we’ve already seen that the violin plots contain more information than the box plot. This is even more apparent when we consider a multimodal distribution. In this example, we create a bimodal distribution as a mixture of two Gaussian distributions.

Without looking at a histogram/density plot, it would be impossible to spot the two peaks in our data.

A bimodal distribution represented in a histogram and kernel density estimation plot, a boxplot and a violin plot.
A bimodal distribution represented in a histogram and kernel density estimation plot, a boxplot and a violin plot. | Image: Eryk Lewinson
A guide on the basics of violin plots. | Video: GraphPad Software

More on Data Science: Multiclass Classification With an Imbalanced Data Set

 

Violin Plots Advanced Usage

Violin plots are often used to compare the distribution of a given variable across some categories. We present a few of the possibilities below. To do so, we load the tips dataset from seaborn.

tips = sns.load_dataset("tips")

In the first example, we look at the distribution of the tips per gender. Additionally, we change the structure of the violin plot to display the quartiles only. Some other possibilities include point for showing all the observations or box for drawing a small box plot inside the violin plot.

ax = sns.violinplot(x="sex", y="tip", inner='quartile', data=tips)
ax.set_title('Distribution of tips', fontsize=16);
A distribution of tips for men and women in a violin plot.
A distribution of tips for men and women in a violin plot. | Image: Eryk Lewinson

We see that the overall shape and distribution of the tips are similar for both genders — the quartiles are very close to each other— but there are more outliers in the case of males.

In the second example, we investigate the distribution of the total bill amount per day. Additionally, we’ll split by gender. Immediately we see that the largest difference in the shape of the distribution between genders happens on Fridays.

ax = sns.violinplot(x="day", y="total_bill", hue="sex", data=tips)
ax.set_title('Distribution of total bill amount per day', fontsize=16);
A series of violin plots showing the distribution of total bill amount per day by gender.
A series of violin plots showing the distribution of total bill amount per day by gender. | Image: Eryk Lewinson

In the last example, we investigate the same thing as in the previous case, however, we set split=True. By doing so, instead of eight violins, we end up with four. Each side of the violin corresponds to a different gender.

ax = sns.violinplot(x="day", y="total_bill", hue="sex", split=True, data=tips)
ax.set_title('Distribution of total bill amount per day', fontsize=16);
A violin plot distribution based on gender for a distribution of total bill amounts per day.
A violin plot distribution based on gender for a distribution of total bill amounts per day. | Image: Eryk Lewinson

In this article, I showed what are the violin plots, how to interpret them and what their advantages are over the boxplots. One last remark worth making is that the boxplots don’t adapt as long as the quartiles stay the same. We can modify the data in a way that the quartiles do not change, but the shape of the distribution differs dramatically.  

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us