Greene T (University of Utah VA), Rubin M
(Univeristy of Utah VA), Nebeker J
(University of Utah VA), Sauer B
(University of Utah VA), Samore M
(University of Utah VA), Leecaster M
(University of Utah VA)
Objectives:
Clinical classifications made by experts are widely used as a gold standard in health services research, adverse event surveillance, and performance monitoring. Typically, these expert assessments are used to estimate the accuracy of diagnosis or outcome classifications made by other raters or assigned by computable algorithms. When the available information is limited, the gold standard is fallible and conventional estimates of sensitivity and specificity are biased. We propose a new method for evaluating the performance of diagnostic tests in which experts provide numerical estimates of the probability that cases are positive for the condition of interest.
Methods:
We define the true probability that a case is positive for a condition as the fraction of cases with similar evidence that have the condition. Given a test sample, we propose a statistical model with two components, one relating the true probabilities of the condition to estimated probabilities provided by each of two or more experts, and the other relating the true probabilities to dichotomous classifications provided by one or more raters. The true probabilities are treated as realizations of a beta-distributed latent variable. The model accounts for fallibility in the estimated probabilities based on variation in the estimates between experts. Using simulated data, we compare estimates of false-positive and false-negative rates provided by the new method to conventional estimates based on dichotomous expert classifications of the presence or absence of the condition.
Results:
Under many scenarios, estimates of false-positivity and false-negativity deviate by more than 2-fold from the true values when rater performance is evaluated by conventional methods using dichotomous expert classifications. Bias is significantly reduced if the proposed model is fit to the experts’ numerical estimates of the disease probabilities when the probability estimates are approximately median-unbiased.
Implications:
When evidence for the presence of a medical condition is ambiguous, some of the difficulties associated with use of a fallible gold standard may be addressed by substituting numerical estimates of probability for dichotomous classifications.
Impacts:
Use of improved methods to quantify uncertainty in gold standards may provide improved understanding of the accuracy of classification procedures used in health services research and other applications.