OUP user menu

Probabilistic reasoning and clinical decision-making: do doctors overestimate diagnostic probabilities?

A. Cahan, D. Gilon, O. Manor, O. Paltiel
DOI: http://dx.doi.org/10.1093/qjmed/hcg122 763-769 First published online: 18 September 2003

Abstract

Background: The ‘threshold approach’ is based on a physician’s assessment of the likelihood of a disease expressed as a probability. The use of Bayes’ theorem to calculate disease probability in patients with and without a particular characteristic, may be hampered by the presence of subadditivity (i.e. the sum of probabilities concerning a single case scenario exceeding 100%).

Aim: To assess the presence of subadditivity in physicians’ estimations of probabilities and the degree of concordance among doctors in their probability assessments.

Design: Prospective questionnaire.

Methods: Residents and trained physicians in Family Medicine, Internal Medicine and Cardiology (n = 84) were asked to estimate the probability of each component of the differential diagnosis in a case scenario describing a patient with chest pain.

Results: Subadditivity was exhibited in 65% of the participants. The total sum of probabilities given by each participant ranged from 44% to 290% (mean 137%). There was wide variability in the assignment of probabilities for each diagnostic possibility (SD 16–21%).

Discussion: The finding of substantial subadditivity, coupled with the marked discordance in probability estimates, questions the applicability of the threshold approach. Physicians need guidance, explicit tools and formal training in probability estimation to optimize the use of this approach in clinical practice.

Introduction

Clinicians make decisions in the face of uncertainty. Kahneman, the recipient of this year’s Nobel prize, (with Tversky and others) provided important insights concerning judgment and decision-making under uncertainty.1,,2 In order to deal with uncertainty, doctors often over-emphasize the importance of diagnostic tests, at the expense of the history and physical examination, believing laboratory tests to be more accurate.3 This ignores the fact that medical tests are far from being perfect or innocuous. The inappropriate use of diagnostic tests also contributes to the growing cost of medical care.

The ‘threshold approach’4 is increasingly recommended for optimizing medical decision-making in the context of uncertainty, applying probabilistic thinking to solving questions concerning diagnosis and treatment. Given the high costs and virtual impossibility of achieving 100% diagnostic certainty, the doctor must weigh the relative risks and benefits of testing and of treatment, and make decisions on the basis of probabilities.5 Two thresholds are defined for a given disease investigation (depending on its natural history, facility of diagnosis, risks of the diagnostic tests and the advantages and disadvantages of empiric treatment) (Figure 1): (i) the testing threshold and (ii) the test-treatment threshold. These create three probability zones that guide management.

Figure 1.

Decision-making using the threshold approach. If the estimated probability for the presence of a disease (P) is lower than the test threshold (Tt), there should be no further investigation and no treatment. If, on the other hand, P is greater than the test-treat threshold (Ttrx) treatment is to be initiated. If P lies between the two thresholds, performing a test and deciding on management according to its results is the advisable step. Adapted from reference 4.

The basic input in the application of the threshold approach is the physician’s degree of belief regarding the presence of a disease, as expressed by the prior probability estimate P.4–,7 The revision or adjustment of P in view of the test’s results can be achieved by means of Bayes’ theorem.6–,11 This theorem is used ‘to obtain the probability of disease in a group of people with some characteristic on the basis of the overall rate of that disease (the prior probability of disease) and the likelihood of that characteristic in healthy and diseased individuals’.12

This approach necessarily presumes that P is: (i) a probability in its mathematical sense; and (ii) reasonably accurate and unbiased. Furthermore, there should be reasonable consensus among doctors in estimates of P. Failure to fulfil any of the above conditions invalidates the model.

Despite the cardinal nature of these conditions, little attention has been paid to their fulfilment in the medical literature. Some authors have argued that ‘clinical intuition’ and ‘clinical experience’ enable a fairly accurate estimation of the probability of a disease.5–7,11,,13 Even when possible biases are adduced, they are not considered to be sufficiently severe to invalidate the entire decision-making framework. In contrast, the question of ‘whether degree of belief can, or should be, represented by the calculus of chance’ has been the focus of a long and lively debate in the psychological literature.14 Bayesians assume that probabilities are assigned to events.14. The ‘support theory’, originally presented by Tversky and Koehler,14,,15 claims that probabilities are assigned to hypotheses or descriptions of events (in medical decision-making, the clinical manifestations) rather than to the event itself. The differential diagnosis (DD) formed by the doctor is, in fact, a list of hypotheses. Thus, different subjective probabilities (SPs) may be assigned to different descriptions of the same event, despite the fact that the occurrence of an event has only one probability. Hence, SPs are not probabilities in a formal sense. The SP assigned to a hypothesis reflects the subject’s degree of belief in it and is determined by the support for the hypothesis in one’s mind.

Tversky’s theory points to two main characteristics that distinguish subjective from true probabilities: (i) the unpacking effect, and (ii) subadditivity.

The unpacking effect15–,17 is demonstrated when a description of a hypothesis provided in more detail (‘unpacked’), results in an increase in the estimated probability. For example, physicians presented with an identical case scenario were randomly assigned to estimate the probabilities of diagnoses appearing in one of two lists.17 The first list had three categories: ‘gastroenteritis’, ‘ectopic pregnancy’ and a residual category ‘none of the above’. The second list differed only in the third category, which had been unpacked to include three more categories in addition to ‘none of the above’: ‘appendicitis’, ‘pyelonephritis’ and ‘P.I.D’. ‘According to Bayes’ theorem, unpacking by itself should not change the total probability of the original hypothesis. Thus, one should expect that the sum of the unpacked categories mentioned in the second list would equal that of the ‘none of the above’ category in the first. In fact, a significant difference was demonstrated between them (a mean of 69% vs. 50%), thus demonstrating the unpacking effect.

Subadditivity occurs when the sum of probabilities for more than two alternative hypotheses exceeds 1.0,14,,18 reflecting overestimation of the true probabilities. Subadditivity may be demonstrated when the probability is estimated in a quantitative or qualitative manner.14,,18

Most studies in this field have not been conducted on doctors. The presence of unpacking in medical decision-making has been characterized in only one study,17 without further confirmation in other studies. Moreover, to the best of our knowledge, the issue of subadditivity of prior probabilities provided by doctors in constructing differential diagnosis (DD) has not been investigated, and its impact on pre-test probability estimations is unknown. Thus, even doctors who use the threshold approach might not be aware of the extent of its limitations.

Consistent with Occam’s razor,19,,20 physicians are encouraged to attribute all signs and symptoms to the expression of a single disease. Therefore, the sum of the probabilities of all diagnoses that are part of the DD in most cases ought not exceed 1.0 (or 100%). One could think of it as a pie chart of probabilities. Whether doctors obey this principle is not known. Our study aimed to assess the existence and extent of subadditivity in physicians’ assessment of a written case scenario. A second objective was to assess the level of concordance among doctors in their assignment of prior probabilities.

Methods

Participants

From January 1 to June 30, 2001, 125 doctors (specialists, residents and interns) from four hospitals in Jerusalem, Israel (practicing Cardiology, Internal Medicine and Geriatrics), as well as community-based Family Physicians and General Practitioners, were approached at weekly staff meetings and asked to participate in the study.

Instruments

Participants were asked to complete an anonymous questionnaire (Box 1), in which a clinical case dealing with chest pain including the history, physical examination and electrocardiogram interpretation was presented. Based on the case, the participants were asked to estimate the probabilities (as percentages) of five diseases, which served as the theoretical DD. The doctors were also permitted to add up to four other possible diagnoses (and to estimate the probability of each).

Data analysis

The probabilities assigned by subjects to each diagnosis, including the diagnoses they generated themselves, were summed, yielding the ‘total probability’ (TP).

Box 1. Case scenario

A 58 y.o woman presents to the E.R with an episodic pressing/burning chest pain that began two days earlier for the first time in her life. The pain started while she was walking, radiates to the back and is accompanied by nausea, diaphoresis and mild dyspnea, but is not increased on inspiration. The latest episode of pain ended half an hour prior to her arrival.

She has had three normal deliveries and had two abortions.

Risk factors: hypertension known for years partially treated (in the past), truncal obesity (height–161 cm, weight–85 Kg ). She denies smoking, diabetes mellitus, hypercholesterolemia or a family history of heart disease.

She currently takes no medications

On physical examination upon arrival: appears to be in distress, pulse regular 100/min, B.P 135/80, 18 respirations/min, temperature 36.7°. The lungs are clear, the heart sounds are normal with no murmurs or extra sounds, the abdomen is soft with no organomegaly. No pedal edema is noted and the peripheral pulses are normal.

On the E.C.G: normal sinus rhythm 101/min, axis 45°, borderline ST elevation of 0.5 mm in leads V2-V4.

    Questions:

  • What is the probability (in percents), in your opinion, that the patient has:

  • Active coronary artery disease?

  • A dissecting aortic aneurysm?

  • Reflux esophagitis?

  • Biliary colic?

  • Anxiety disorder?

Do you believe there other diagnoses relevant to the differential diagnosis in this case? If you do, please specify diagnoses and their probabilities below: Embedded Image

Concordance between doctors was evaluated using two strategies: (a) a quantitative comparison, using crude probabilities, and (b) a method which corrected for between-subject differences in the sum of probabilities. These ‘standardized’ probabilities were computed by dividing the estimated probability for each of the five diagnoses that appeared in the questionnaire by the total probability (for questionnaires with a total probability > 100%). Correlations between physician’s age, years of experience and number of proposed diagnoses and the sum of probabilities was assessed using Pearson’s coefficient. Associations between categorical variables and subadditivity were assessed using the χ2 test.

For all statistical analyses, a two-tailed p value of < 0.05 was considered statistically significant.

Results

Of 125 physicians approached, 84 (67%) filled in the questionnaire, two incompletely. We have no data on non-participants. The demographic characteristics of the participants are shown in Table 1. Participants’ mean age was 40 ± 8.1 years; mean length of clinical experience was 12 ± 8.9 years. Thirty five percent of the doctors suggested additional diagnoses, the number ranging between 1 and 4. These are listed in Table 2.

View this table:
Table 1

Demographic characteristics and total probability for all diagnostic categories by subgroup

n%Mean TPSDParticipants showing subadditivityp
Total84100136.753.965%
Gender*
Male4958.3127.049.157%0.05
Female3440.5150.657.974%
Status
Intern55.9149.873.860%0.20
Resident3339.3128.150.558%
Specialist4148.8142.255.171%
General practitioner55.9139.446.260%
Specialty
Internal medicine13744.0129.642.764%0.39
Cardiology89.5125.833.962%
Family medicine2934.5146.667.970%
Work location
Hospital5059.5131.044.262%0.25
Community3440.5145.664.768%
  • 1Including two residents in geriatrics

  • TP, total probability (sum of the probabilities offered).

  • *One participant missing data on gender.

View this table:
Table 2

Alternative diagnoses provided by physicians

Diagnosisn
Pulmonary embolism14
Pericarditis10
Musculoskeletal disorder10
Pleuritis/pleuritic pain/pneumonia6
Diaphragmatic hernia5
Peptic ulcer disease5
Acute/chronic pancreatitis3
Syndrome X2
Biceps tendinitis1
Chondritis1
Myocarditis1
Non-ulcer dyspepsia1
Cervical/thoracic radiculopathy1
Breast congestion1
Boerhaave syndrome1
Pneumothorax1
Esophageal spasm1
Acute mesenteric event1
Left ventricle hypertrophy/ HOCM1
Phaeochromocytoma1

Subadditivity

The distribution of the total probability in the sample is shown in Figure 2. The TP ranged between 44% and 290% (mean 137 ± 54). For 65% of the subjects the TP exceeded 100%. These subjects exhibited subadditivity.

Figure 2.

Frequency distribution of the total probabilities assigned by participants. The mean total probability was 136.7% (± 53.9%). Sixty-five percent of participants had a total probability > 100% (i.e. exhibited subadditivity).

A relatively small proportion (15%) provided answers summing up to exactly 100%. In some of the questionnaires belonging to this latter category, erasures were noted, indicating that the doctors seemingly corrected themselves and changed the numbers so that the total would not exceed 100%. That is, these subjects made a conscious effort to correct their intuitive answers in order to avoid subadditivity.

Physicians’ age (r = −0.04), professional experience (r = −0.08), gender, main working location, status or field of specialization were not associated with the frequency or magnitude of subadditivity (total probabilities assigned). In each of these categories subadditivity was exhibited among the majority of the physicians (see Table 1). Furthermore, the degree of subadditivity was unrelated to the number of diagnoses suggested by the physicians themselves (r = 0.08).

Concordance

Figure 3 and Table 3 indicate the distribution of the probabilities assigned to each of the five diagnoses presented. The standard deviations (SD) of the probability estimates and the inter-quartile ranges were between 16–21%, and 15–30% across diagnoses, respectively. The ‘standardized’ or corrected probability (CP) distribution is also shown in Figure 3. The CP SDs were 8–20% across the five diagnoses, indicating a high degree of disagreement among physicians. The obtained CPs are particularly high with respect to the low means of the last four probabilities, resulting in a coefficient of variation (mean/SD) approximating one. The probability assigned to ‘dissecting aortic aneurysm’ was 16/100 (± 17%), much higher than the mean probabilities assigned to more common diagnoses such as ‘anxiety disorder’ and ‘biliary colic’ [15/100 (± 16%) and 13/100 (± 16%), respectively].

View this table:
Table 3

Distribution parameters of estimated probabilities

Active CADDissecting aortic aneurysmReflux esophagitisBiliary colicAnxiety disorder
Median65%10%20%5%10%
IQR30%22.5%20%17%15%
25th percentile50%5%10%3%5%
75th percentile80%27.5%30%20%20%
Figure 3.

The range of estimated probabilities for each of the five diagnoses suggested to the participants. For each diagnosis, the ranges of `crude' and `standardized' probabilities are shown as the left- and right-hand lines, respectively. The means are shown as dots.

Discussion

This study shows that the sum of estimated probabilities for the presence of diseases comprising the DD for a single patient scenario exceeds 100% for the majority of physicians questioned, regardless of their field of specialization, age, sex or clinical experience. This despite the fact that the total probability ought not to have reached 100%, for two reasons: (a) all diagnoses described in the questionnaire should be regarded as mutually exclusive and (b) the suggested list of diagnoses did not exhaust the DD.

Our findings provide evidence for the existence of a considerable degree of subadditivity among doctors. The uniformity of results across specialty fields and level of training supports their generalizability and minimizes the possible effects of the relatively low participation rate.

The wide variability in probability estimates for each of the diagnoses or hypotheses is in contrast with the implicit assumptions of the threshold approach. Even within subgroups of participants, expected to be more homogenous, prominent differences in probability estimations could be demonstrated. Thus, the error attributable to a physician’s estimation might be more significant than the error involved in many of the medical tests.

It can be argued that it is unnatural for people to give numerical estimations, and that using verbal estimations (such as ‘pretty sure’ or ‘unlikely’), may yield more reliable answers. Nevertheless, the threshold approach requires that the prior probability be given in a numerical form. Moreover, numerical estimations have been found to be comparable to verbal estimations.21

The simulative nature of our study may affect the validity of the results. A written case scenario lacks the additional information available in clinical practice (e.g. patient’s appearance and other physical findings). Nevertheless, it is hard to avoid simulation when presenting the same case to a group of doctors. Moreover, the complaint of chest pain is frequently encountered and physicians are exposed to abundant data concerning its prevalence and diagnosis, presumably making them more capable of assessing probabilities associated with it. The relatively small sample size limits the power to analyse subgroups.

The finding that prior probabilities assigned by participating doctors do not conform to the mathematical rules of probability is consistent with accumulating empirical evidence regarding decision-making under uncertainty found in the psychological literature. The only study which demonstrated subadditivity among doctors17 was not concerned with probability estimations in the context of diagnosis, but rather with prognosis. Subadditivity means not only that the sum of SPs exceeds 1.0, but also implies that, in general, estimated probabilities assigned to hypotheses by the doctors are too high. These consequences are opposite to the ones the threshold approach strived to reach in the first place. This bias in physicians’ estimations of probability will result in more hypotheses crossing the test or test-treat threshold, demanding more tests be performed and more patients be treated, some unnecessarily.

Let us consider two examples of the possible effects of a biased pre-test probability on the calculated post-test probability, based on the estimated probability of ‘dissecting aortic aneurysm’ (DAA), the second diagnosis in the case scenario. The mean assigned probability was 16%, while the known prevalence of DAA among patients presenting with acute chest pain is < 1%.22–,24 The patient described in the case scenario had none of the cardinal characteristics of DAA, and should be assigned an even lower pre-test probability. Using a test with 80% sensitivity and 84% specificity (likelihood ratio = 5), a pre-test probability estimate of 1% would result in a post-test probability estimate of about 5%, whereas a pre-test probability of 16% would yield a post-test probability of almost 50%. Twenty-five percent of the participants estimated the probability to be 5% or less, while another 25% estimated it to be 27% or more (Table 3). Using the above test, the calculated post-test probabilities for DAA for the lower and upper quartiles, respectively, would be about 22% or less, and about 65% or more. These differences in post-test probability would clearly result in different approaches to management.

The wide range of SPs given by doctors assessing the same hypothesis is a matter of additional concern for the threshold approach. In this study, the standard deviation was about 18%. That is, the average difference between probabilities assigned by any two doctors to the same diagnosis is ∼25% Embedded Image. In our opinion, differences of this magnitude are unacceptable, since they may lead to wide variations in the post-test probability.

To summarize, the prior probabilities as estimated by doctors in this study do not fulfil the demands of the threshold approach, as: (i) they are not genuine probabilities; (ii) they are not consistent with Bayes’ theorem; and (iii) they are inaccurate and biased. Thus, their use as a tool for clinical decision-making might cause more harm than benefit.

Conclusions

We have shown that doctors’ estimations of prior probabilities based on a written case scenario show marked subadditivity. The impact of cognitive biases affecting a physician’s probability assessment requires further investigation and confirmation in other settings.

In order for the threshold approach to be applicable, explicit tools for estimating the pre-test probability need to be used, and reliable source of estimates of disease prevalence and presentation must be available in the clinical setting, taking into consideration the way doctors think. There is some evidence that probability estimation can be learned.18 Several tools to accomplish this end have been suggested: published likelihood ratios calculated for clinical symptoms and signs3 can help the physician calculate the pre-test probability more accurately; clinical prediction rules25 guide evidence-based management without the need for direct estimation of probabilities; and computer programs26 may be able to translate clinical findings into a statistically meaningful data. Combining these means in a user-friendly way may help physicians and patients benefit more from the rapidly growing knowledge we have on diseases, tests and therapies.

References

View Abstract