Interpreting Anscombe’s quartet.

Yazmin T. Montana
Mar 26, 2023
3 min read

The Anscombe’s quartet is a set of four datasets that were created by the statistician Francis Anscombe in 1973 to demonstrate the importance of visualizing data. Each of the four datasets has the same mean, variance, correlation, and regression line, but they are visually different from each other.

The first dataset is a simple linear relationship with some random noise added to the y-values. The second dataset is a more complex, nonlinear relationship with a quadratic curve. The third dataset has a clear linear relationship, but with one outlier that skews the correlation coefficient. Finally, the fourth dataset is a set of points that all have the same x-value, but different y-values, creating a horizontal line.

Here’s an Anscombe’s quartet generated with a random normally distributed population of 100 samples:


10.08.0410.09.1410.07.468.06.58
8.06.958.08.148.06.778.05.76
13.07.5813.08.7413.012.748.07.71
9.08.819.08.779.07.118.08.84
11.08.3311.09.2611.07.818.08.47
14.09.9614.08.1014.08.848.07.04
6.07.246.06.136.06.088.05.25
4.04.264.03.104.05.3919.012.50
12.010.8412.09.1312.08.158.05.56
7.04.827.07.267.06.428.07.91
5.05.685.04.745.05.738.06.89

Each column represents a different dataset with the same descriptive statistics as the others, but with different shapes and patterns.

I created an Anscombe’s quartet plot on python with the following code:

import numpy as np
import matplotlib.pyplot as plt

# Generate random normally distributed data for Anscombe's Quartet
np.random.seed(42)
x = np.random.normal(10, 2, 100)
y1 = 0.5 * x + np.random.normal(3, 1, 100)
y2 = 0.2 * x**2 + np.random.normal(3, 1, 100)
y3 = 0.5 * x + 5
y4 = np.ones(100) * 7.5

# Create a figure with four subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# Plot the first dataset
axs[0, 0].scatter(x, y1)
axs[0, 0].set_title("Dataset I")

# Plot the second dataset
axs[0, 1].scatter(x, y2)
axs[0, 1].set_title("Dataset II")

# Plot the third dataset
axs[1, 0].scatter(x, y3)
axs[1, 0].set_title("Dataset III")

# Plot the fourth dataset
axs[1, 1].scatter(x, y4)
axs[1, 1].set_title("Dataset IV")

# Add x and y labels to all subplots
for ax in axs.flat:
    ax.set(xlabel='X', ylabel='Y')

# Add a main title to the plot
fig.suptitle("Anscombe's Quartet")

# Display the plot
plt.show()

This is what the plot looks like:

Notice how:

Dataset I, II and III all have linear relationships, although they have outliners distributed differently. Dataset IV shows the effects of outliers.

Here’s a description of the data:

Dataset I: This dataset consists of 100 (x, y) pairs where x values are randomly generated from a normal distribution with a mean of 10 and standard deviation of 2, and y values are generated from a linear equation y = 0.5x + ε, where ε is random noise generated from a normal distribution with a mean of 3 and standard deviation of 1.
Dataset II: This dataset also consists of 100 (x, y) pairs where x values are generated from the same normal distribution as in Dataset I, and y values are generated from a quadratic equation y = 0.2x² + ε, where ε is random noise generated from a normal distribution with a mean of 3 and standard deviation of 1.
Dataset III: This dataset also consists of 100 (x, y) pairs where x values are generated from the same normal distribution as in Dataset I and II, and y values are generated from a linear equation y = 0.5x + 5, without any additional noise.
Dataset IV: This dataset consists of 100 (x, y) pairs where x values are all set to 8, and y values are generated from a constant value of 7.5, without any additional noise.

By generating random normally distributed data for each dataset, we are able to replicate the key characteristics of Anscombe’s Quartet, which consists of four datasets that have the same descriptive statistics (mean, variance, correlation, and linear regression line) but look very different when you plot a graph of the datasets.

Looking only at descriptive statistics, you could conclude the datasets are very similar, but that is not the case.

After plotting the data you can see the differences:

The datasets have different distributions.
The graphs appear totally unidentical.

The main point of Anscombe’s quartet is to show that summary statistics like mean, variance, and correlation can be the same across different datasets, but the visual patterns and relationships can be very different. This highlights the importance of visualizing data and understanding the underlying patterns, rather than relying solely on summary statistics.

Anscombe’s quartet has since become a classic example used in statistics courses to illustrate the importance of data visualization, and has inspired further research into the use of data visualization in statistics and data science.