Background
Simpson’s paradox is one of the most counterintuitive phenomena in data analysis. It describes situations where a trend observed within groups disappears—or even reverses—when the data is aggregated. The underlying cause is often a confounding variable that distorts the overall trend. Let’s examine a concrete example.
An Example
Imagine two hospitals, and , treating patients for a particular condition with two treatment options, and . Hospital , located in a higher-income neighborhood, primarily receives healthier patients, while Hospital , in a lower-income neighborhood, tends to treat sicker patients. The effectiveness of the treatments is measured as improvement in a continuous health score.
We are interested in examining whether one of the treatment options leads to better health outcomes. Consider the following data gathered across both hospitals.
Hospital | Treatment | Health Improvement | N |
A | T1 | 20 | 90 |
A | T2 | 20 | 10 |
B | T1 | 10 | 10 |
B | T2 | 10 | 90 |
Let’s now look at what happens when we combine the data from both hospitals.
Treatment | N | Health Improvement |
T1 | 100 | 19 = 20 * .9 + 10 * .1 |
T2 | 100 | 11 = 10 * .9 + 20 * .1 |
Within each hospital, the data shows that both treatments are equally effective. However, combining the data across both hospitals reveals that treatment appears to be significantly more effective overall. Why does this happen?
The confounding variable here is the underlying health status of patients. Hospital treats mostly healthier patients, while Hospital handles more severe cases. This difference in patient distribution influences the overall success rates of the treatments, even though both treatments perform identically within each hospital.
Visualizing the Paradox
To illustrate, imagine a scatter plot where each dot represents a patient. The color of the dot indicates the treatment they received, and the horizontal lines represent the average health improvement for each group. The vertical axis depicts the outcome variable (health improvement).
In the aggregated data, shows a higher average improvement, creating the illusion of greater effectiveness. But when disaggregated by hospital, the averages for and are identical.
You can find the code to reproduce the figure in this GitHub repo.
Where to Learn More
Start with the Wikipedia entry, where you will find all necessary additional resources.
Takeaways
- Simpson’s paradox manifests when an observable pattern within groups disappears if the data is aggregated.
- It is a reminder of the critical role confounding variables play in data analysis.
- It underscores the importance of stratifying data by meaningful subgroups and carefully considering the context before drawing conclusions from aggregated statistics.
Leave a Reply