The term predictive value has often been used as a synonym for the
posttest probability. Unfortunately, clinicians commonly misinterpret
reported predictive values as intrinsic measures of test accuracy rather
than calculated probabilities. Studies of diagnostic test performance
compound the confusion by calculating predictive values from the
same sample used to measure sensitivity and specificity. Such calculations are misleading unless the test is applied subsequently to populations with exactly the same disease prevalence. For these reasons, the
term predictive value is best avoided in favor of the more descriptive
posttest probability following a positive or a negative test result.
The nomogram version of Bayes’ rule (Fig. 4-2) helps us to understand at a conceptual level how it estimates the posttest probability of
disease. In this nomogram, the impact of the diagnostic test result is
summarized by the likelihood ratio, which is defined as the ratio of
the probability of a given test result (e.g., “positive” or “negative”) in a
patient with disease to the probability of that result in a patient without
disease, thereby providing a measure of how well the test distinguishes
those with from those without disease.
The likelihood ratio for a positive test is calculated as the ratio of the
true-positive rate to the false-positive rate (or sensitivity/[1 – specificity]).
For example, a test with a sensitivity of 0.90 and a specificity of 0.90
has a likelihood ratio of 0.90/(1 – 0.90), or 9. Thus, for this hypothetical test, a “positive” result is 9 times more likely in a patient with the
disease than in a patient without it. Most tests in medicine have likelihood ratios for a positive result between 1.5 and 20. Higher values
are associated with tests that more substantially increase the posttest
likelihood of disease. A very high likelihood ratio positive (>10) usually
implies high specificity, so a positive high specificity test helps “rule
in” disease (the “SpPin” mnemonic introduced earlier). If sensitivity is
excellent but specificity is less so, the likelihood ratio positive will be
reduced substantially (e.g., with a 90% sensitivity but a 55% specificity,
the likelihood ratio positive is 2.0).
The corresponding likelihood ratio for a negative test is the ratio of the
false-negative rate to the true-negative rate (or [1 – sensitivity]/specificity).
Good
Fair
No predictive value
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
True-positive rate
0 0.1 0.2 0.3 0.4
False-positive rate
0.5 0.6 0.7 0.8 0.9 1
FIGURE 4-1 Each receiver operating characteristic (ROC curve) illustrates a tradeoff that occurs between improved test sensitivity (accurate detection of patients
with disease) and improved test specificity (accurate detection of patients without
disease), as the test value defining when the test turns from “negative” to “positive”
is varied. A 45° line would indicate a test with no predictive value (sensitivity =
specificity at every test value). The area under each ROC curve is a measure of
the information content of the test. Thus, a larger ROC area signifies increased
diagnostic accuracy.
26PART 1 The Profession of Medicine
99
1
2
5
0.01
0.1
0.2
0.5
1
2
5
10
20
30
40
50
60
70
80
90
95
98
99
0.02
0.05
0.1
0.2
0.5
10
99
98
95
90
80
70
60
50
40
30
20
10
1
2
5
10
20
50
0.01
0.1
0.2
0.5
1
2
5
10
20
30
40
50
60
70
80
90
95
98
99
0.02
0.05
0.1
0.2
0.5
5
2
1
0.5
0.2
0.1
20
50
98
95
90
80
70
60
50
40
30
20
10
5
2
1
0.5
0.5
0.1
Pretest
Probability, %
Posttest
Probability, %
Likelihood
Ratio
Pretest
Probability, %
Posttest
Probability, %
Likelihood
Ratio
FIGURE 4-2 Nomogram version of Bayes’ theorem used to predict the posttest probability of disease (right-hand scale)
using the pretest probability of disease (left-hand scale) and the likelihood ratio for a positive or a negative test (middle
scale). See text for information on calculation of likelihood ratios. To use, place a straightedge connecting the pretest
probability and the likelihood ratio and read off the posttest probability. The right-hand part of the figure illustrates the
value of a positive exercise treadmill test (likelihood ratio 4, green line) and a positive exercise thallium single-photon
emission CT perfusion study (likelihood ratio 9, broken yellow line) in a patient with a pretest probability of coronary
artery disease of 50%. (Adapted from Centre for Evidence-Based Medicine: Likelihood ratios. Available at http://www.
cebm.net/likelihood-ratios/.)
Lower likelihood ratio negative values more substantially lower the
posttest likelihood of disease. A very low likelihood ratio negative
(falling below 0.10) usually implies high sensitivity, so a negative
high sensitivity test helps “rule out” disease (the SnNout mnemonic).
The hypothetical test considered above with a sensitivity of 0.9 and a
specificity of 0.9 would have a likelihood ratio for a negative test result
of (1 – 0.9)/0.9, or 0.11, meaning that a negative result is about onetenth as likely in patients with disease than in those without disease
(or about 10 times more likely in those without disease than in those
with disease).
■ APPLICATIONS TO DIAGNOSTIC TESTING IN CAD
Consider two tests commonly used in the diagnosis of CAD: an exercise treadmill and an exercise single-photon emission CT (SPECT)
myocardial perfusion imaging test (Chap. 241). A positive treadmill
ST-segment response has an average sensitivity of ~60% and an average
specificity of ~75%, yielding a likelihood ratio positive of 2.4 (0.60/
[1 – 0.75]) (consistent with modest discriminatory ability because it
falls between 2 and 5). For a 41-year-old man with nonanginal pain and
a 10% pretest probability of CAD, the posttest probability of disease
after a positive result rises to only ~30%. For a 60-year-old woman with
typical angina and a pretest probability of CAD of 80%, a positive test
result raises the posttest probability of disease to ~95%.
In contrast, exercise SPECT myocardial perfusion test is more accurate for diagnosis of CAD. For simplicity, assume that the finding of a
reversible exercise-induced perfusion defect has both a sensitivity and
a specificity of 90% (a bit higher than
reported), yielding a likelihood ratio for
a positive test of 9.0 (0.90/[1 – 0.90])
(consistent with intermediate discriminatory ability because it falls between
5 and 10). For the same 10% pretest
probability patient, a positive test raises
the probability of CAD to 50% (Fig.
4-2). However, despite the differences in
posttest probabilities between these two
tests (30 vs 50%), the more accurate test
may not improve diagnostic likelihood
enough to change patient management
(e.g., decision to refer to cardiac catheterization) because the more accurate
test has only moved the physician from
being fairly certain that the patient
did not have CAD to a 50:50 chance
of disease. In a patient with a pretest
probability of 80%, exercise SPECT test
raises the posttest probability to 97%
(compared with 95% for the exercise
treadmill). Again, the more accurate test
does not provide enough improvement
in posttest confidence to alter management, and neither test has improved
much on what was known from clinical
data alone.
In general, positive results with an
accurate test (e.g., likelihood ratio for
a positive test of 10) when the pretest
probability is low (e.g., 20%) do not
move the posttest probability to a range
high enough to rule in disease (e.g.,
80%). In screening situations, pretest
probabilities are often particularly low
because patients are asymptomatic. In
such cases, specificity becomes especially important. For example, in screening first-time female blood donors
without risk factors for HIV, a positive
test raised the likelihood of HIV to only
67% despite a specificity of 99.995%
because the prevalence was 0.01%. Conversely, with a high pretest
probability, a negative test may not rule out disease adequately if it is
not sufficiently sensitive. Thus, the largest change in diagnostic likelihood following a test result occurs when the clinician is most uncertain
(i.e., pretest probability between 30 and 70%). For example, in patients
with a pretest probability for CAD of 50%, a positive exercise treadmill test moves the posttest probability to 80% and a positive exercise
SPECT perfusion test moves it to 90% (Fig. 4-2).
As presented above, Bayes’ rule employs a number of important
simplifications that should be considered. First, few tests provide only
“positive” or “negative” results. Many tests have multidimensional outcomes (e.g., extent of ST-segment depression, exercise duration, and
exercise-induced symptoms with exercise testing). Although Bayes’
theorem can be adapted to this more detailed test result format, it
is computationally more complex to do so. Similarly, when multiple
sequential tests are performed, the posttest probability may be used
as the pretest probability to interpret the second test. However, this
simplification assumes conditional independence—that is, that the
results of the first test do not affect the likelihood of the second test
result—and this is often not true.
Finally, many texts assert that sensitivity and specificity are
prevalence-independent parameters of test accuracy. This statistically
useful assumption, however, is often incorrect. A treadmill exercise
test, for example, has a sensitivity of ~30% in a population of patients
with one-vessel CAD, whereas its sensitivity in patients with severe
three-vessel CAD approaches 80%. Thus, the best estimate of sensitivity
Decision-Making in Clinical Medicine
27CHAPTER 4
to use in a particular decision may vary, depending on the severity of
disease in the local population. A hospitalized, symptomatic, or referral
population typically has a higher prevalence of disease and, in particular, a higher prevalence of more advanced disease than does an outpatient population. Consequently, test sensitivity will likely be higher in
hospitalized patients and test specificity higher in outpatients.
■ STATISTICAL PREDICTION MODELS
Bayes’ rule, when used as presented above, is useful in studying diagnostic testing concepts, but predictions based on multivariable statistical models can more accurately address these more complex problems
by simultaneously accounting for additional relevant patient characteristics. In particular, these models explicitly account for multiple, even
possibly overlapping, pieces of patient-specific information and assign
a relative weight to each on the basis of its unique independent contribution to the prediction in question. For example, a logistic regression
model to predict the probability of CAD ideally considers all the relevant independent factors from the clinical examination and diagnostic
testing and their relative importance instead of the limited data that
clinicians can manage in their heads or with Bayes’ rule. However,
despite this strength, prediction models are usually too complex computationally to use without a calculator or computer. Guideline-driven
treatment recommendations based on statistical prediction models
available online, e.g., the American College of Cardiology/American
Heart Association risk calculator for primary prevention with statins
and the CHA2
DS2
-VASC calculator for anticoagulation for atrial fibrillation, have generated more widespread usage. When electronic health
records (EHRs) will provide sufficient platform support to allow for
routine use of predictive models in clinical practice and increase their
impact on clinical encounters and outcomes remains uncertain.
One reason for limited clinical use is that, to date, only a handful
of prediction models have been validated sufficiently (for example,
Wells criteria for pulmonary embolism; Table 4-2). The importance
of independent validation in a population separate from the one used
to develop the model cannot be overstated. An unvalidated prediction
model should be viewed with the skepticism appropriate for any new
drug or medical device that has not had rigorous clinical trial testing.
When statistical survival models in cancer and heart disease have
been compared directly with clinicians’ predictions, the survival models have been found to be more consistent, as would be expected, but
not always more accurate. On the other hand, comparison of clinicians
with websites and apps that generate lists of possible diagnoses to
help patients with self-diagnosis found that physicians outperformed
the currently available programs. For students and less-experienced
clinicians, the biggest value of diagnostic decision support may be in
extending diagnostic possibilities and triggering “rational override,”
but their impact on knowledge, information-seeking, and problemsolving needs additional research.
FORMAL DECISION SUPPORT TOOLS
■ DECISION SUPPORT SYSTEMS
Over the past 50 years, many attempts have been made to develop
computer systems to aid clinical decision-making and patient management. Conceptually, computers offer several levels of potentially
useful support for clinicians. At the most basic level, they provide
ready access to vast reservoirs of information, which may, however, be
quite difficult to sort through to find what is needed. At higher levels,
computers can support care management decisions by making accurate
predictions of outcome, or can simulate the whole decision process,
and provide algorithmic guidance. Computer-based predictions using
Bayesian or statistical regression models inform a clinical decision but
do not actually reach a “conclusion” or “recommendation.” Machine
learning methods are being applied to pattern recognition tasks such
as the examination of skin lesions and the interpretation of x-rays.
Artificial intelligence (AI) systems attempt to simulate or replace
human reasoning with a computer-based analogue. Natural language
processing allows the system to access and process large amounts of
data, both from the EHR and from the medical literature. To date, such
approaches have achieved only limited success. The most prominent
example, IBM’s Watson program, introduced publicly in 2011, has
yet to produce persuasive evidence of clinical decision support utility.
Reminder or protocol-directed systems do not make predictions but
use existing algorithms, such as guidelines or appropriate utilization criteria, to direct clinical practice. In general, however, decision
support systems have so far had little impact on practice. Reminder
systems built into EHRs have shown the most promise, particularly in
correcting drug dosing and promoting adherence to guidelines. Checklists may also help avoid or reduce errors.
■ DECISION ANALYSIS
Compared with the decision support methods discussed earlier,
decision analysis represents a normative prescriptive approach to
decision-making in the face of uncertainty. Its principal application
is in complex decisions. For example, public health policy decisions
often involve trade-offs in length versus quality of life, benefits versus
resource use, population versus individual health, and uncertainty
regarding efficacy, effectiveness, and adverse events as well as values or
preferences regarding mortality and morbidity outcomes.
One recent analysis using this approach involved the optimal
screening strategy for breast cancer, which has remained controversial,
in part because a randomized controlled trial to determine when to
begin screening and how often to repeat screening mammography is
impractical. In 2016, the National Cancer Institute–sponsored Cancer
Intervention and Surveillance Network (CISNET) examined eight
strategies differing by whether to initiate mammography screening at
age 40, 45, or 50 years and whether to screen annually, biennially, or
annually for women in their forties and biennially thereafter (hybrid).
The six simulation models found biennial strategies to be the most
efficient for average-risk women. Biennial screening for 1000 women
from age 50–74 years versus no screening avoided seven breast cancer
deaths. Screening annually from age 40–74 years avoided three additional deaths but required 20,000 additional mammograms and yielded
1988 more false-positive results. Factors that influenced the results
included patients with a 2–4-fold higher risk for developing breast
cancer in whom annual screening from age 40–74 years yielded similar
benefits as biennial screening from age 50–74. For average-risk patients
with moderate or severe comorbidities, screening could be stopped
earlier, at age 66–68 years.
This analysis involved six models that reproduced epidemiologic
trends and a screening trial result, accounted for digital technology and
treatments advances, and considered quality of life, risk factors, breast
density, and comorbidity. It provided novel insights into a public health
problem in the absence of a randomized clinical trial and helped weigh
the pros and cons of such a health policy recommendation. Although
such models have been developed for selected clinical problems, their
benefit and application to individual real-time clinical management
has yet to be demonstrated.
TABLE 4-2 Wells Clinical Prediction Rule for Pulmonary
Embolism (PE)
CLINICAL FEATURE POINTS
Clinical signs of deep-vein thrombosis 3
Alternative diagnosis is less likely than PE 3
Heart rate >100 beats/min 1.5
Immobilization ≥3 days or surgery in previous
4 weeks
1.5
History of deep-vein thrombosis or pulmonary
embolism
1.5
Hemoptysis 1
Malignancy (with treatment within 6 months)
or palliative
1
INTERPRETATION
Score >6.0 High
Score 2.0–6.0 Intermediate
Score <2.0 Low
28PART 1 The Profession of Medicine
DIAGNOSIS AS AN ELEMENT OF QUALITY
OF CARE
High-quality medical care begins with accurate diagnosis. The incidence of diagnostic errors has been estimated by a variety of methods
including postmortem examinations, medical record reviews, and
medical malpractice claims, with each yielding complementary but
different estimates of this quality of care patient-safety problem. In the
past, diagnostic errors tended to be viewed as a failure of individual
clinicians. The modern view is that they are mostly a system of care
deficiencies. Current estimates suggest that nearly everyone will experience at least one diagnostic error in their lifetime, leading to mortality, morbidity, unnecessary tests and procedures, costs, and anxiety.
Solutions to the “diagnostic errors as a system of care” problem
have focused on system-level approaches, such as decision support
and other tools integrated into EHRs. The use of checklists has been
proposed as a means of reducing some of the cognitive errors discussed
earlier in the chapter, such as premature closure. While checklists have
been shown to be useful in certain medical contexts, such as operating
rooms and intensive care units, their value in preventing diagnostic
errors that lead to patient adverse events remains to be shown.
EVIDENCE-BASED MEDICINE
Clinical medicine is defined traditionally as a practice combining medical knowledge (including scientific evidence), intuition, and judgment
in the care of patients (Chap. 1). Evidence-based medicine (EBM)
updates this construct by placing much greater emphasis on the processes by which clinicians gain knowledge of the most up-to-date and
relevant clinical research to determine for themselves whether medical
interventions alter the disease course and improve the length or quality
of life. The phrase “evidence-based medicine” is now used so often and
in so many different contexts that many practitioners are unaware of
its original meaning. The intention of the EBM program, as described
in the early 1990s by its founding proponents at McMaster University,
becomes clearer through an examination of its four key steps:
1. Formulating the management question to be answered
2. Searching the literature and online databases for applicable research
data
3. Appraising the evidence gathered with regard to its validity and
relevance
4. Integrating this appraisal with knowledge about the unique aspects
of the patient (including the patient’s preferences about the possible
outcomes)
The process of searching the world’s research literature and appraising the quality and relevance of studies can be time-consuming and
requires skills and training that most clinicians do not possess. In a
busy clinical practice, the work required is also logistically not feasible.
This has led to a focus on finding recent systematic overviews of the
problem in question as a useful shortcut in the EBM process. Systematic reviews are regarded by some as the highest level of evidence in the
EBM hierarchy because they are intended to comprehensively summarize the available evidence on a particular topic. To avoid the potential
biases found in narrative review articles, predefined reproducible
explicit search strategies and inclusion and exclusion criteria seek to
find all of the relevant scientific research and grade its quality. The prototype for this kind of resource is the Cochrane Database of Systematic
Reviews. When appropriate, a meta-analysis is used to quantitatively
summarize the systematic review findings (discussed further below).
Unfortunately, systematic reviews are not uniformly the acme of
the EBM process they were initially envisioned to be. In select circumstances, they can provide a much clearer picture of the state of
the evidence than is available from any individual clinical report, but
their value is less clear when only a few trials are available, when trials
and observational studies are mixed, or when the evidence base is only
observational. They cannot compensate for deficiencies in the underlying research available, and many are created without the requisite
clinical insights. The medical literature is now flooded with systematic
reviews of varying quality and clinical utility. The peer review system
has, unfortunately, not proved to be an effective arbiter of quality of
these papers. Therefore, systematic reviews should be used with circumspection in conjunction with selective reading of some of the best
empirical studies.
■ SOURCES OF EVIDENCE: CLINICAL TRIALS AND
REGISTRIES
The notion of learning from observation of patients is as old as medicine itself. Over the past 50 years, physicians’ understanding of how
best to turn raw observation into useful evidence has evolved considerably. Medicine has received a hard refresher lesson in this process
from COVID-19 pandemic. Starting in the spring of 2020, case reports,
personal and institutional anecdotal experience, and small singlecenter case series started appearing in the peer-reviewed literature
and within months turned into a flood of confusing and often contradictory evidence. Observational reports of treatments for COVID-19
fueled the confusion. Despite >40,000 publications appearing in the
first 7 months of the pandemic, an enormous amount of uncertainty
around prevention, diagnosis, treatment, and prognosis of the disease remained. Many of the early 2020 publications were either small
observational series or reviews of published series, neither of which
can resolve the key uncertainties clinicians need to address in caring
for these patients. These small observational studies often have substantial limitations in validity and generalizability, and although they
may generate important hypotheses or be the first reports of adverse
events or therapeutic benefit, they have no role in formulating modern
standards of practice. The major tools used to develop reliable evidence
consist of randomized clinical trials supplemented strategically by large
(high-quality) observational registries. A registry or database typically
is focused on a disease or syndrome (e.g., different types of cancer,
acute or chronic CAD, pacemaker capture, or chronic heart failure), a
clinical procedure (e.g., bone marrow transplantation, coronary revascularization), or an administrative process (e.g., claims data used for
billing and reimbursement).
By definition, in observational data, the investigator does not control patient care. Carefully collected prospective observational data,
however, can at times achieve a level of evidence quality approaching
that of major clinical trial data. At the other end of the spectrum, data
collected retrospectively (e.g., chart review) are limited in form and
content to what previous observers recorded and may not include the
specific research data being sought (e.g., claims data). Advantages of
observational data include the inclusion of a broader population as
encountered in practice than is typically represented in clinical trials
because of their restrictive inclusion and exclusion criteria. In addition,
observational data provide primary evidence for research questions
when a randomized trial cannot be performed. For example, it would
be difficult to randomize patients to test diagnostic or therapeutic
strategies that are unproven but widely accepted in practice, and it
would be unethical to randomize based on sex, racial/ethnic group,
socioeconomic status, or country of residence or to randomize patients
to a potentially harmful intervention, such as smoking or deliberately
overeating to develop obesity.
A well-done prospective observational study of a particular management strategy differs from a well-done randomized clinical trial most
importantly by its lack of protection from treatment selection bias.
The use of observational data to compare diagnostic or therapeutic
strategies assumes that sufficient uncertainty and heterogeneity exists
in clinical practice to ensure that similar patients will be managed
differently by diverse physicians. In short, the analysis assumes that a
sufficient element of randomness (in the sense of disorder rather than
in the formal statistical sense) exists in clinical management. In such
cases, statistical models attempt to adjust for important imbalances
to “level the playing field” so that a fair comparison among treatment
options can be made. When management is clearly not random (e.g.,
all eligible left main CAD patients are referred for coronary bypass
surgery), the problem may be too confounded (biased) for statistical
correction, and observational data may not provide reliable evidence.
No comments:
Post a Comment
اكتب تعليق حول الموضوع