Simpson’s Paradox: A Simple Illustration

Share this article

[This blog post is part of a series of explorations of commonly observed “mysteries” in data science – see also my entries on Stein’s paradox, and Lord’s paradox.]

Background

Simpson’s paradox is one of the most counterintuitive phenomena in data analysis. It describes situations where a trend observed within groups disappears—or even reverses—when the data is aggregated. The underlying cause is often a confounding variable that distorts the overall trend. Let’s examine a concrete example.

An Example

Imagine two hospitals, $A$ and $B$ , treating patients for a particular condition with two treatment options, $T1$ and $T2$ . Hospital $A$ , located in a higher-income neighborhood, primarily receives healthier patients, while Hospital $B$ , in a lower-income neighborhood, tends to treat sicker patients. The effectiveness of the treatments is measured as improvement in a continuous health score.

We are interested in examining whether one of the treatment options leads to better health outcomes. Consider the following data gathered across both hospitals.

Hospital	Treatment	Health Improvement	N
A	T1	20	90
A	T2	20	10
B	T1	10	10
B	T2	10	90

Health improvement by hospital and treatment type. Both treatments T1 and T2 are equally effective within each hospital.

Let’s now look at what happens when we combine the data from both hospitals.

Treatment	N	Health Improvement
T1	100	19 = 20 * .9 + 10 * .1
T2	100	11 = 10 * .9 + 20 * .1

Health improvement by treatment type. Treatment T1 is more effective overall.

Within each hospital, the data shows that both treatments are equally effective. However, combining the data across both hospitals reveals that treatment $T1$ appears to be significantly more effective overall. Why does this happen?

The confounding variable here is the underlying health status of patients. Hospital $A$ treats mostly healthier patients, while Hospital $B$ handles more severe cases. This difference in patient distribution influences the overall success rates of the treatments, even though both treatments perform identically within each hospital.

Visualizing the Paradox

To illustrate, imagine a scatter plot where each dot represents a patient. The color of the dot indicates the treatment they received, and the horizontal lines represent the average health improvement for each group. The vertical axis depicts the outcome variable (health improvement).

In the aggregated data, $T1$ shows a higher average improvement, creating the illusion of greater effectiveness. But when disaggregated by hospital, the averages for $T1$ and $T2$ are identical.

You can find the code to reproduce the figure in this GitHub repo.

Where to Learn More

Start with the Wikipedia entry, where you will find all necessary additional resources.

Bottom Line

Simpson’s paradox manifests when an observable pattern within groups disappears if the data is aggregated.
It is a reminder of the critical role confounding variables play in data analysis.
It underscores the importance of stratifying data by meaningful subgroups and carefully considering the context before drawing conclusions from aggregated statistics.

3 responses

Lord’s Paradox: A Simple Illustration – yasenov.com

December 19, 2024

[…] recently illustrated the more commonly discussed Simpson’s paradox. Interestingly, a 2008 paper claims that two phenomena are closely related, with the Lord’s […]

Stein’s Paradox: A Simple Illustration – yasenov.com

January 27, 2025

[…] [This blog post concludes a series of explorations of commonly observed “mysteries” in data science – see also my entries on Lord’s paradox, and Simpson’s paradox.] […]

Causation without Correlation – yasenov.com

February 9, 2025

[…] fascinating scenarios include Lord’s Paradox, and Simpson’s Paradox, where a causal relationship can appear to reverse or disappear when data is […]