Anscombe’s quartet is a group of four data sets that are nearly identical in simple descriptive statistics, but there are peculiarities that fool the regression model once you plot each data set. As you can see, the data sets have very different distributions so they look completely different from one another when you visualize the data on scatter plots.

What Is Anscombe’s Quartet?

Anscombe’s quartet was constructed in 1973 by statistician Francis Anscombe to illustrate the importance of plotting data before you analyze it and build your model. These four data sets have nearly the same statistical observations, which provide the same information (involving variance and mean) for each x and y point in all four data sets. However, when you plot these data sets, they look very different from one another.
Anscombe’s Quartet

Dig In With Our Data ScientistsPlotting With Plotly in Python: Data Exploration Made Easy

 

What Is the Purpose of Anscombe’s Quartet in Data Visualization?

Anscombe’s quartet tells us about the importance of visualizing data before applying various algorithms to build models. This suggests the data features must be plotted to see the distribution of the samples that can help you identify the various anomalies present in the data (outliers, diversity of the data, linear separability of the data, etc.). Moreover, the linear regression can only be considered a fit for the data with linear relationships and is incapable of handling any other kind of data set. 

We can define these four plots as follows:

anscombe's-quartet

The statistical information for these four data sets are approximately similar. We can compute them as follows: 

anscombe's-quartet

However, when these models are plotted on a scatter plot, each data set generates a different kind of plot that isn’t interpretable by any regression algorithm, as you can see below:

anscombe's-quartet

We can describe the four data sets as:

Anscombe’s Quartet Four Datasets

  • Data Set 1: fits the linear regression model pretty well.
  • Data Set 2: cannot fit the linear regression model because the data is non-linear.
  • Data Set 3: shows the outliers involved in the data set, which cannot be handled by the linear regression model.
  • Data Set 4: shows the outliers involved in the data set, which also cannot be handled by the linear regression model.

As you can see, Anscombe’s quartet helps us to understand the importance of data visualization and how easy it is to fool a regression algorithm. So, before attempting to interpret and model the data or implement any machine learning algorithm, we first need to visualize the data set in order to help build a well-fit model. 

More From Built In ExpertsA Primer on Model Fitting

 

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us