E.H. Simpson first described this phenomenon in 1951. The name Simpson’s paradox was introduced by Colin R. Blyth in 1972. Blyth mentioned that:

G. W. Haggstrom pointed out that Simpson’s paradox is the simplest form of the false correlation paradox in which the domain of x is divided into short intervals, on each of which y is a linear function of x with large negative slope, but these short line segments get progressively higher to the right, so that over the whole domain of x, the variable y is practically a linear function of x with large positive slope.

Simpson’s paradox arises from the combination of an ignored confounding variable and a disproportionate allocation of the variable, and it can lead to a conclusion about an intervention effect that is the opposite of the correct inference (hence a paradox). Simpson demonstrated how differential analyses of contingency tables (i.e., analysis in which the confounding variable is excluded or included) can lead to different conclusions. (The topic of interactions in contingency tables dates back over eight decades. Early works include the ones from Bartlett, Norton, Lancaster, Darroch, Lewis, Whittemore and Davis.)

Mathematically, Simpson’s paradox is the following:

It is possible to have

P(A|B) < P(A|B’)

and have at the same time both

P(A|BC)≥P(A|B’C)

P(A|BC’) ≥ P(A|B’C’)

The paradox rests upon the dependence or interaction of B and C. Else, the paradox would not hold as the weights would be the same in the following (the weights are in curly brackets):

P(A|B) = {P(C|B)} P(A|BC) + {P(C’|B)} P(A|BC’)

P(A|B’) = {P(C|B’)} P(A|B’C) + {P(C’|B’)} P(A|B’C’)

The extreme form of the Simpson’s paradox is given by the following:

Subject to the conditions:

P(A|BC) ≥ γ P(A|B’C)

P(A|BC) ≥ γ P(A|B’C)

with ≥ γ 1, it is possible to have

P(A|BC) ≅ 0 and P(A|BC)≅ 1/γ

Simpson’s paradox has been extensively studied in a variety of fields such as, but not limited to, statistics, medicine, cognitive sciences and social sciences. In the context of operations, we had discussed the Simpson’s paradox is an earlier blog and a research paper.

In their 1981 paper, Lindley and Novick argued the following statistical inference:

Standard procedures concentrate on the data and tend to ignore the connection with the case to which the inference is to be applied. … this connection can be established using either de Finetti’s idea of exchangeability or Fisher’s concept of sub-population.

Using causal calculus, Pearl showed the “resolution” of Simpson’s paradox.

As alluded several times previously in our blog series, marrying the context to statistical analysis is key to extracting actionable insights. For instance, the plots below show the Webpage Response Time for the two different offering of AT&T. The first plot corresponds to performance on Desktop whereas the second plot corresponds to performance of Mobile. From the plot above we note that the average (and the median) performance of U-Verse is worse than its counterpart. More importantly, U-Verse experiences over 2.5x the number of >10 sec spikes than its counterpart. This difference in performance would be masked in an aggregated view. On deeper analysis, we find geography as the confounding factor. Concretely speaking, on dissecting the performance between East/West Coast vs. Mid-West in the US, we noted that the in difference in performance with U-Verse and its counterpart disappeared. From the plot above we note that the average (and the median) performance of U-Verse is better (unlike in the Desktop case) than its counterpart. Further, U-Verse experiences less than 0.74x the number of >10 sec spikes than its counterpart. This difference in performance would be masked in an aggregated view. Akin to above, we noted that the in difference in performance with U-Verse and its counterpart disappeared on dissecting the performance between East/West Coast vs. Mid-West in the US.

Akin to Part I, let’s analyze what would be the impact if the Webpage Response Time was downsampled by a factor of two in the aforementioned cases. On comparing the plot above with the other plot corresponding to performance of Desktop (without downsampling) we note although U-Verse’s average/median performance is still worse, the ratio of number of >10 sec spikes reduced to 2.3x. This artificially boosts the worst-case performance. On comparing the plot above with the other plot corresponding to performance of Mobile (without downsamling) we note although U-Verse’s average/median performance is still better, the ratio of the number of >10 sec spikes reduced from 0.74x to 0.29x. Akin to the Desktop case, downsampling artificially boosts the worst-case performance.

The key takeaway from the above is that to avoid Simpson’s paradox one should judiciously dissect the operational performance along different dimensions. (Drawing conclusions based on high level aggregate metrics can be potentially misleading.) This would in turn help avoid any wild goose chases. J Having said that, the sub-samples obtained based on slicing along different dimensions should be large to ensure high statistical power of the subsequent analysis.