Simpson’s Paradox: A Simple Illustration

Share this article

Background

Simpson’s paradox is one of the most counterintuitive phenomena in data analysis. It describes situations where a trend observed within groups disappears—or even reverses—when the data is aggregated. The underlying cause is often a confounding variable that distorts the overall trend. Let’s examine a concrete example.

An Example

Imagine two hospitals, A and B, treating patients for a particular condition with two treatment options, T1 and T2. Hospital A, located in a higher-income neighborhood, primarily receives healthier patients, while Hospital B, in a lower-income neighborhood, tends to treat sicker patients. The effectiveness of the treatments is measured as improvement in a continuous health score.

We are interested in examining whether one of the treatment options leads to better health outcomes. Consider the following data gathered across both hospitals.

HospitalTreatmentHealth ImprovementN
AT12090
AT22010
BT11010
BT21090
Health improvement by hospital and treatment type. Both treatments T1 and T2 are equally effective within each hospital.

Let’s now look at what happens when we combine the data from both hospitals.

TreatmentNHealth Improvement
T110019 = 20 * .9 + 10 * .1
T210011 = 10 * .9 + 20 * .1
Health improvement by treatment type. Treatment T1 is more effective overall.

Within each hospital, the data shows that both treatments are equally effective. However, combining the data across both hospitals reveals that treatment T1 appears to be significantly more effective overall. Why does this happen?

The confounding variable here is the underlying health status of patients. Hospital A treats mostly healthier patients, while Hospital B handles more severe cases. This difference in patient distribution influences the overall success rates of the treatments, even though both treatments perform identically within each hospital.

Visualizing the Paradox

To illustrate, imagine a scatter plot where each dot represents a patient. The color of the dot indicates the treatment they received, and the horizontal lines represent the average health improvement for each group. The vertical axis depicts the outcome variable (health improvement).

In the aggregated data, T1 shows a higher average improvement, creating the illusion of greater effectiveness. But when disaggregated by hospital, the averages for T1 and T2 are identical.

You can find the code to reproduce the figure in this GitHub repo.

Where to Learn More

Start with the Wikipedia entry, where you will find all necessary additional resources.

Takeaways
  • Simpson’s paradox manifests when an observable pattern within groups disappears if the data is aggregated.
  • It is a reminder of the critical role confounding variables play in data analysis.
  • It underscores the importance of stratifying data by meaningful subgroups and carefully considering the context before drawing conclusions from aggregated statistics.

Leave a Reply

Your email address will not be published. Required fields are marked *