Q J Med 2002; 95: 247-249
© 2002 Association of Physicians
The nature of truth: Simpson's Paradox and the limits of statistical data
M. Heydtmann
From the Liver Research Laboratories, Queen Elizabeth Hospital, Birmingham, UK
 |
Introduction
|
|---|
Give me a fruitful error any time, full of seeds, bursting
with its own corrections. You can keep your sterile truth for
yourself. Vilfredo Pareto
We usually think in terms of true and false, and often believe that we know which is which. Nonetheless, sometimes information which appears to be true is in fact false. Although we try to base our medical knowledge on objective evidenceresearch and statisticsrather than our personal opinions, Simpson's Paradox reminds us of the limitations of statistical evidence. In this phenomenon, an apparent paradox arises because aggregated data can support a conclusion which is opposite from that suggested by the same data before aggregation.
 |
An example
|
|---|
One would generally conclude from the data in Table 1

that treatment
A is the treatment of choice for the condition studied (given
that side-effects are equal). Suppose, however, that these patients
consisted of two subgroups: those with a high serum level of
substance X, and those with a low level. Table 2

shows the data
for the patients with high serum X. For this subgroup of patients,
treatment B seems to be better than treatment A. Since A is
the preferable treatment in the group as a whole, one might
intuitively expect the other patients to be better off with
treatment A. But this is not the case (Table 3

). Even in patients
with low serum X, treatment B is still better (although fewer
of these patients benefit from either treatment).
View this table:
[in this window]
[in a new window]
|
Table 2 Number of patients with high serum X responding to treatment A vs. treatment B: in this subgroup, B is better than A
|
|
View this table:
[in this window]
[in a new window]
|
Table 3 Number of patients with low serum X responding to treatment A vs. treatment B: in this subgroup too, B is better than A
|
|
Thus, if the patient's serum X level is unknown, treatment A
seems to be better, but if serum X is known, treatment B is
preferable (and one can better predict the response rate of
a patient). This phenomenon is a result of the aggregation of
two (or more) subgroups.
1 The numbers of the example are kept
simple to demonstrate this phenomenon of severe confounding,
but there are a number of real examples in the literature, including
the medical literature.
24. This aggregation effect can
occur in the case of an uneven distribution of a latent
variable (in this case the serum X level) among the groups
studied.
Clearly, if available, one should consider the data for the subgroups, because they give you the most relevant information for a given patient, and in a trial one would have to report the data for the subgroups. The danger lies in a case where the aggregation data alone are available, but the detailed analysis would recommend a different conclusion. One could call this type S error, after Simpson its discoverer.
 |
Is Simpson's Paradox dependent on the absolute numbers, and does statistical significance protect from the effect?
|
|---|
In the example, the benefit of treatment B over treatment A
for the subgroups is not statistically significant: Although
the
p value for the aggregation data is 0.04, the
p value for
benefit of treatment B over treatment A is 0.3 in both subgroups.
But if a zero is added to the numbers of patients in all three
tables (Tables 4


6

), the benefit of treatment B over A
becomes statistically significant in both the aggregate group
and the subgroups.
Thus the aggregation effect is not dependent on absolute numbers,
and Simpson's Paradox can occur in cases with statistical significance.
When studies with low numbers of patients in different subgroups
are combined and data become aggregated, Simpson's Paradox can
arise: an important issue in meta-analyses.
In clinical practice, the aggregation data become irrelevant as soon as one performs or is even aware of the more detailed analysis. One would then always favour treatment B, in the example. This is true even if one could not measure the serum X level in a patient, because he or she would always fall in one of the two subgroups. Although in Tables 1
to 3
the statistical basis for preferring treatment B is weak, it would be wrong to favour treatment A simply because the benefit in Table 1
was statistically significant.
 |
Does a properly designed trial prevent Simpson's Paradox?
|
|---|
The aggregation effect shown above is dependent on the uneven
distribution of subgroups of patients into the two treatment
groups. Naturally, one tries to avoid an uneven distribution
of variables. An investigator controls for the known variables,
and minimizes the unknown by randomization. In the case above,
randomization of 30 out of 40 patients with a latent variable
to treatment A and only 10 to treatment B (Tables 1


3

)
does not seem very likely. Increasing the numbers of patients
helps: the randomization of 300 out of 400 patients to one group
and only 100 to the other (Tables 4


6

) is even more unlikely.
But high numbers do not absolutely prevent such an uneven distribution
and the possibility of type S error. Considering
the astronomical numbers of latent variables that are not and
will never be controlled for, there is a good chance that one
of them is unevenly distributed. Simpson's aggregation effect
could then lead to a false conclusion.
 |
Is Simpson's Paradox common?
|
|---|
The paradox and its associated type S error have
been described in both medical and non-medical studies.
24 In a paper by Charing
et al. on comparison of success rates
of kidney stone removal with different techniques, percutaneous
nephrolithotomy had a better overall outcome than open surgery
(83% success rate vs. 78%). But when the patients were divided
into a group with a single stone <2 cm and a group with one
larger stone or multiple stones, success rates were better in
open surgery for both groups: Open surgery had a 93% success
rate vs. 87% for percutaneous nephrolithotomy in the group with
a single small stone. In the second group, there was 73% success
for surgery vs. 69% for percutaneous nephrolithotomy. Here the
aggregation effect occurs because most patients with a single
small stone (234/289) were treated percutaneously, whereas the
majority of those with multiple or large stones (192/273) were
treated with open procedures.
2
Similarly, Early and Nicholas demonstrated a fall in the percentage of male patients in a psychiatric hospital between 1970 and 1975, but breaking down the results according to the patients age (age >65 and age <65) there was an increase in male patients in both age groups. In this study the effect was caused by a predominance of younger males and older females in the hospital, and a marked decrease in the overall number of hospitalized patients during the time studied.3 Reintjes et al. describes another recent medical example in a multi-centre study on nosocomial infections.4
Even though these represent only a minority of published statistics, the error can well occur without us noticing. The more one analysed studies in detail for latent variables, the more likely one would be to find more examples. Further, the more we question our currently accepted knowledge in this way, the more uncertain our evidence becomes. If we analyse the data and find a Simpson's Paradox, this does not protect us from a second type S error resulting from a further aggregation effect. Returning to Table 6
, where treatment B was preferable to treatment A, it could still be the other way round if we analysed the data according to further latent variables (Tables 7
and 8
). Since finding severe confounding does not protect from the type S error, can we believe in the results of statistics at all?
View this table:
[in this window]
[in a new window]
|
Table 7 Subgroup of patients from Table 6 (low serum X) who have high serum levels of Y, and their responses to treatment A vs. treatment B
|
|
View this table:
[in this window]
[in a new window]
|
Table 8 Subgroup of patients from Table 6 (low serum X) who have low serum levels of Y, and their responses to treatment A and B
|
|
All results need to be viewed in their scientific context. Where
the results of one study diverge markedly from those of other
studies, care is needed, and latent variables which might have
been overlooked have to be considered. The possibility arises
that the results were confounded by factors that were not controlled
for or perhaps not even measured. Statistical analysis of data
can lead to statistical illusions, which like
optical illusions, cause misinterpretation. Where the results
of several studies are similar and make sense in the context
of all the other available evidence, they can probably be relied
on. But are they true? And what is truth? The answer of a statistician
would be: a statement is generally regarded as true
if it is true with a probability close to 1, even if this probability
will never reach 1. Can we avoid that what we believe to be
true is actually false? No, only someone with very limited knowledge
would be able to state that everything he or she knows is true.
The more knowledge a person accumulates, the more likely it
is that some of his or her knowledge is actually false.
 |
Acknowledgments
|
|---|
I would like to thank Professor James Neuberger and Dr Carl
Rasmussen for their helpful discussions and review of the manuscript.
 |
Notes
|
|---|
Address correspondence to Dr M. Heydtmann, Liver Research Laboratories,
Queen Elizabeth Hospital, Edgbaston B15 2TH. e-mail:
m.heydtmann{at}bham.ac.uk 
 |
References
|
|---|
1.
Simpson EH. The interpretation of interaction in contingency tables. J R Statist Soc B1951;
2:23841.
2.
Charig CR, Webb DR, Payne SR, Wickham OE. Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy. Br Med J1986; 292:87982.
3.
Early DF, Nicholas M. Dissolution of the mental hospital: fifteen years on. Br J Psychiat1977; 130:11722.[Abstract/Free Full Text]
4.
Reintjes R, de Boer A, van Pelt W, Mintjes-de Groot J. Simpson's paradox: an example from hospital epidemiology. Epidemiology2000; 11:813.[Medline]

CiteULike
Connotea
Del.icio.us What's this?