## Abstract

Multireader multicase (MRMC) variance analysis has become widely utilized to analyze observer studies for which the summary measure is the area under the receiver operating characteristic (ROC) curve. We extend MRMC variance analysis to binary data and also to generic study designs in which every reader may not interpret every case. A subset of the fundamental moments central to MRMC variance analysis of the area under the ROC curve (AUC) is found to be required. Through multiple simulation configurations, we compare our unbiased variance estimates to naïve estimates across a range of study designs, average percent correct, and numbers of readers and cases.

© 2007 Optical Society of America

## 1. INTRODUCTION

The study of image quality often involves the use of psychophysical studies to evaluate an imaging system, or perhaps as a validation of model observer predictions for circumstances new to that model observer. Studies involving human readers are also central to the evaluation of new imaging technologies for which there is no alternative to the use of clinical images from actual patients. Just as important as the mean performance of the observer is the uncertainty of the measurement.

Previous publications have presented methods for the analysis of the uncertainty in the summary measure of observer performance using the multireader multicase (MRMC) paradigm, mainly in the context of analyzing the area under the receiver operating characteristic (ROC) curve [[1], [2], [3], [4], [5]] and a “fully crossed” study design, where every reader reads every case. The data analyzed in these publications are typically the matrix of ROC scores obtained from each reader for each case.

In this paper we present an unbiased method for estimating the variance in an experiment with multiple readers and multiple cases for which the outcomes are binary and the summary performance measure is a percent correct (PC). We also extend the analysis beyond the fully crossed study design to allow arbitrary study designs, including the “doctor–patient” study design, where each doctor sees his or her own patients.

Some examples of PCs are sensitivity, specificity, and the PC in an *M*-alternative forced-choice (MAFC) experiment. Sensitivity is the percent of abnormals correctly identified, and specificity is the percent of normals correctly identified. We shall also refer to the abnormals as the signal-present cases (hypothesis 1, ${H}_{1}$), and the normals as the signal-absent cases (hypothesis 0, ${H}_{0}$).

In an MAFC experiment, the reader must choose which of *M*-alternatives within a trial contains the signal. So, in the typical two-alternative forced-choice (2AFC) task a trial is often a pair of images, one signal-absent and one signal-present, displayed side by side or in sequence. The outcome of the choice is binary; the reader is either right or wrong. The rate at which the reader correctly picks the alternative with the signal is the PC.

Regardless of the specific task, readers, and cases, we denote the binary success outcome generically by $s(g,\gamma )$, where *g* specifies the case and *γ* specifies the reader. This success outcome is 0 when reader *r* incorrectly identifies case *g* and 1 when the reader is successful.

In a particular study, there is a set of ${N}_{g}$ cases and a set of ${N}_{\gamma}$ readers. Without replicating readings, we could collect ${N}_{g}\times {N}_{\gamma}$ outcomes if every reader reads every case (the fully crossed design). For the doctor–patient study design, depicted pictorially on the left of Fig. 1 , some of these data are not collected. The shaded area in Fig. 1 indicates which cases were read by which readers. Since each case is read by only one reader, a significant amount of data are missing compared to the fully crossed design, which would fill the whole matrix. Additionally, we allow the number of cases, or “case load,” read by each reader to be different.

On the right in Fig. 1 we provide a simple example demonstrating the data from a binary-outcome experiment with multiple readers, each reading their own cases. The PC in the last row weighs each reading equally: 100 correct decisions divided by 130 readings is 77%. Now one might assume that the readings are all independent and identically distributed (iid) and estimate the standard error using the sample variance divided by the total number of readings; this equals 3.7. However, since each reader may have a different skill at the task, the readings are not identically distributed and this naïve estimate likely underestimates the true variance.

Instead of calculating the average performance as in Fig. 1, one might average the three reader-specific PCs, yielding $(88+73+20)\u22153=60$, which is noticeably different from the previous average performance. One might continue and estimate the standard error using the sample variance of the three reader-specific PCs and divide by the number of readers, yielding 20.6. This result is more than five times that of the previous underestimate but, in reality, probably overestimates the true variance. This overestimate is due to the reader-specific PCs being noisy realizations of the true PCs.

This simple example highlights two naïve estimates of variance. The first incorrectly treats the readings as identically distributed, and the second incorrectly treats the reader PCs as being measured without error. The variance estimate that we provide appropriately accounts for the readers, cases, and correlations that arise from the actual study design. These variance estimates apply to the average PC when readers are treated equally or when readings are treated equally.

In what follows, we make the following assumptions: Readers are iid, cases are also iid, and readers are independent of cases. Additionally, given a reader and a case, an outcome can be deterministic, as when the reader is a mathematical classifier, or an outcome can be a random variable, as might be expected when the reader is a human and unable to reproduce the same decision on subsequent readings (reader jitter). This distinction is unnecessary for the current work; our variance estimate accounts for reader jitter whether it exists or not.

## 2. THEORY AND METHODS

#### 2A. Setup

We define a design matrix *D* and a success matrix *S*. Both matrices are ${N}_{g}\times {N}_{\gamma}$; their elements are denoted ${d}_{ir}$ and ${s}_{ir}$, where *i* stands for the $i\text{th}$ case and *r* for the $r\text{th}$ reader. The design matrix holds a one in every position where an outcome was collected and a zero everywhere else. The success matrix holds the observed success outcomes ${s}_{ir}=s({g}_{i},{\gamma}_{r})$. For the $r\text{th}$ reader, we denote the number of cases read by ${N}_{g\mid r}={\sum}_{i=1}^{{N}_{g}}{d}_{ir}$ and the PC by

When an outcome is not collected, ${d}_{ir}=0$ and ${s}_{ir}$ is technically undefined. In practice, we can set ${s}_{ir}$ to any number we want when ${d}_{ir}=0$, since it will always appear with ${d}_{ir}$ and the product will always be zero. Therefore, to ease the transition to ensemble statistics, we think of ${s}_{ir}$ as the success outcome whether or not it was collected in the study.

We shall assume that the design matrix does not depend on the success matrix and vice versa, as such dependencies would certainly bias the study. In this paper we shall consider fixed study designs and random study designs. For a fixed study design, *D* is specified before data are collected; for a random study design, there is a protocol, or sampling scheme, that determines a distribution for the possible study designs.

The typical endpoint in a study is a reader-averaged PC:

While this average appears trivial, there is a choice to be made about how to average, that is, how to weigh each reader. Two common choices exist for the doctor–patient study design, as mentioned in the Introduction: weigh each reader equally $({w}_{r}=1\u2215{N}_{\gamma})$ or weigh each reading equally $({w}_{r}={N}_{g\mid r}\u2215{N}_{g})$. We denote the resulting PCs as ${\widehat{P}}_{\gamma}$ and ${\widehat{P}}_{g}$, respectively. Now, when cases are read by more than one reader, the total number of readings is more than ${N}_{g}$. Considering this situation, a more general expression for the second set of weights is ${w}_{r}={N}_{g\mid r}\u2215{\sum}_{r=1}^{{N}_{\gamma}}{N}_{g\mid r}$. These weights always sum to one. Of course, if each reader reads the same number of cases, ${\widehat{P}}_{g}={\widehat{P}}_{\gamma}$, whereas if the case load of each reader is random, the weights of ${\widehat{P}}_{g}(\ne {\widehat{P}}_{\gamma})$ will also be random.Other choices for weights may be driven by the experience or skill of each reader. In the most general framework the weights are arbitrary, as long as they sum to one.

#### 2B. Population Quantities

### 2B1. Fixed Study Designs

The mean of $\widehat{P}$ for a fixed study design *D* is straightforward:

*D*and average over the remaining random quantities: the readers and cases. The expected reader-averaged PC, as is shown above, has no dependence on the study design or the reader weights.

Next, carefully accounting for possible correlations across readers and cases (see Appendix A), the population variance of $\widehat{P}$ for a fixed study design is

The unique numbering of the coefficients above is driven by how we label the moments. We refer to the moments in Eq. (4) as ${M}_{1}$, ${M}_{4}$, ${M}_{5}$, and ${M}_{8}$ to coincide with notation previously derived for the empirical area under the ROC curve (AUC) [[4], [5]]. For AUC, there are eight fundamental moments of the success outcomes. The factor of 2 increase in the number of moments comes from partitioning cases into two subsets: signal-absent and signal-present.

The variance can be written concisely as a scalar product between the coefficients and the moments arranged in vectors $\underset{\u0331}{c}$ and $\underset{\u0331}{M}$; that is, ${V}_{\mid D}={\underset{\u0331}{c}}^{t}\underset{\u0331}{M}$, where coefficients ${c}_{2}$, ${c}_{3}$, ${c}_{6}$, and ${c}_{7}$ are all understood to equal zero. This variance will carry a subscript *γ* or *g* when needed to indicate weights treating each reader equally or weights treating each reading equally. The moments themselves are nothing more than second moments (${M}_{1}$ through ${M}_{7}$) and a mean squared $\left({M}_{8}\right)$, as are expected in a variance. Finally, we shall extend this notation to include ${M}_{0}=\u27e8s(g,\gamma )\u27e9$, the success outcome averaged over reader *γ* and case *g*.

The simple form of the variance expression in Eq. (4) hides complexity that comes with all the different possible study designs and weights. It is worthwhile to see how the variance of $\widehat{P}$ is related to the variances of the reader-specific ${\widehat{p}}_{r}$. In general,

*D*is the variance you get when you select a random reader and a random set of ${N}_{g\mid r}$ cases from the entire population. Since readeres are sampled from a common population, this variance does not depend on any particular reader. This variance depends only on the number of cases read, which can be different for each reader depending on the study design. The covariance of ${\widehat{p}}_{r},{\widehat{p}}_{{r}^{\prime}}$ has a stronger dependence on the study design since it considers two readers.

In Appendix A we derive the variance and covariance appearing in Eq. (5). The single-reader variance is

The covariance for the general study design simplifies for the fully crossed and doctor–patient study designs. The covariance for the fully crossed study design is Eq. (A13) minus the mean squared, or

### 2B2. Special Cases and Random Study Designs

The vector of coefficients for a fixed study design is made up of complicated sums that simplify for the study designs considered in this paper (see Table 1 ). If we allow the study design to be random (with some distribution), we get the variance of $\widehat{P}$ by averaging the coefficients of the fixed study design over the distribution of study designs. This is possible because we assume the design and success matrices are independent. When averaged over the distribution of study designs, the variance is no longer dependent, or conditional, on a fixed study, and the subscript $\mid D$ should be dropped.

#### 2C. Variance Estimates

### 2C1. Fixed Study Design

Expressing the fixed-study-design variance of $\widehat{P}$ as a linear combination of moments, as described in the previous section, leads to the unbiased moment estimator that we present here,

The weights for each pair of observations are analogous to the weights used for the average performance: Each case (or pair of cases) is given equal weight for each reader, and the readers are given the same (relative) weights as before. In theory, the weights could be different from those for $\widehat{P}$; however, there does not seem to be a good reason to make them different. As for $\widehat{P}$ and ${V}_{\mid D}$, we add the subscript *γ* or *g* to ${\widehat{V}}_{\mid D}$ when necessary to indicate whether the weights equally weigh each reader or each reading.

In situations where two readers $r,{r}^{\prime}$ have nonoverlapping case samples, the denominator in ${\widehat{M}}_{5}$ can be zero. But at the same time, the numerator will be zero as well. In these situations the $r,{r}^{\prime}$ contribution to ${\widehat{M}}_{5}$ is taken to be zero. Consequently, for the doctor–patient study design, where readers never read the same cases, ${\widehat{M}}_{5}$ is entirely zero.

When replacing the expected values with sums for estimation there are two things to remember: Avoid biases and count the number of samples that are being summed. The elements of the design matrix are an easy way to count the number of samples that are being summed.

Biases creep in when we replace a squared average with a squared sum. To avoid the bias, replace the squared average with two sums and do not include the index of the first sum in the second sum. For example, the estimate of ${M}_{5}$ squares the average over readers for a fixed case. When replacing this squared average with sums over *r* and ${r}^{\prime}$, we do not let ${r}^{\prime}$ equal *r*. We also normalize the weights in the sum over ${r}^{\prime}$ so that they sum to one. The result can be shown to be unbiased with standard algebraic and probabilistic manipulations.

Unfortunately, our moment-based MRMC variance estimate is not necessarily positive. It is a linear combination of sums of squares, where one coefficient, ${c}_{8}$, is negative. The possibility of negative estimates are an unfortunate consequence of estimating variances with sums of squares and too few samples. Bayesian and maximum-likelihood estimates could avoid the unfortunate negative estimates, but that approach is beyond the scope of the nonparametric treatment of this paper.

### 2C2. Random Study Design

The only change needed to account for random study designs is to replace $\underset{\u0331}{c}$ with an estimate of $\u27e8\underset{\u0331}{c}\u27e9$. One estimate of $\u27e8\underset{\u0331}{c}\u27e9$ is just the observed $\underset{\u0331}{c}$ itself, which would not be an actual change of the fixed-study-design variance estimator. Other estimates of $\u27e8\underset{\u0331}{c}\u27e9$ would require priors on the distribution of possible study designs. For this manuscript, we shall investigate the fixed-study-design estimator and consider other estimators at a later date.

### 2C3. Naïve Estimates

As a basis for comparison, we consider the two naïve estimates described in the Introduction. Neither accounts for the MRMC nature of the data, but both have been used in the literature. The first estimate essentially assumes that all the readings are iid, indirectly assuming that readers all have the same skill and are reading different cases. Given this assumption, the success outcomes are all independent Bernoulli trials with the same probability of success, and the variance of the reader-averaged PC is estimated as

The second estimate uses the sample variance of the reader-specific PCs:

#### 2D. Simulation

### 2D1. Model

We shall utilize the Monte Carlo (MC) simulation scheme developed by Roe and Metz [[6]] to investigate the variance estimates presented above in a 2AFC experiment. This simulation scheme assumes that a reader generates two scores $\left({t}_{0ir,}{t}_{1ir}\right)$ for each case, where a case represents a signal-absent and signal-present pair of alternatives. If the score of the signal-absent alternative is lower than the score of the signal-present alternative, the success outcome for the case is one; otherwise, it is zero:

The model for the scores is a sum of Gaussian random variables:

Here, ${t}_{0ir}$ and ${t}_{1ir}$ are the $r\text{th}$ reader’s scores for the signal-absent and signal-present alternatives of the $i\text{th}$ case. Except for ${\mu}_{t}$, which indicates the separation between the two score distributions, the terms in ${t}_{0ir},{t}_{1ir}$ are independent zero-mean Gaussian random variables that we refer to as the reader effect $\left({\sigma}_{R}^{2}\right)$, the case effect $\left({\sigma}_{C}^{2}\right)$, and the reader/case interaction effect $\left({\sigma}_{RC}^{2}\right)$. We shall follow the convenient constraint used by Roe and Metz on the sum of the variances of the random effects such thatWith such a simple description for the scores, we can characterize the distribution of the PC to second order. First, the $r\text{th}$ reader’s skill averaged over all cases is

where $s({t}_{1ir}-{t}_{0ir})$ equals one for ${t}_{1ir}>{t}_{0ir}$, and zero otherwise. Since we have fixed the reader, the remaining randomness in ${t}_{1ir}-{t}_{0ir}$ is the sum of two case terms and two reader/case terms. Since all these terms are independent, ${t}_{1ir}-{t}_{0ir}$ given ${\left[R\right]}_{1r}-{\left[R\right]}_{0r}$ is a Guassian random variable with mean ${\mu}_{t}+{\left[R\right]}_{1r}-{\left[R\right]}_{0r}$ and variance $2{\sigma}_{C}^{2}+2{\sigma}_{RC}^{2}$. Therefore,*Φ*is the cumulative distribution function (cdf) of the standard normal. Furthermore, since the only randomness in this last expression comes from the currently fixed reader effects ${\left[R\right]}_{1r}-{\left[R\right]}_{0r}$, the cdf of reader skill is given by

The first option is to (numerically) calculate the average over the two independent reader components [Eq. (21)], letting *τ* go to infinity. The second option starts over, eliminating the condition on *r* in Eq. (20). Noticing that ${t}_{1ir}-{t}_{0ir}$ is simply a Gaussian with variance two centered on ${\mu}_{t}$,

This leaves ${M}_{4}$ and ${M}_{5}$ as the remaining second-order moments unaccounted for in this problem. Without a familiar probability density function (pdf) for ${p}_{r}$, the only option we found for calculating these moments is through numerical integration. The integral expressions for ${M}_{4}$ and ${M}_{5}$ are

### 2D2. Simulation Configurations

The relevant parameters for the simulation are listed in Table 2 . We vary all the simulation parameters in a factorial design, yielding $3\times 3\times 3\times 3\times 3\times 3=729$ total configurations. For each of these, we run 10,000 MC trials. Compared to the simulation parameters of Roe and Metz [[6]], we consider a broader range of reader variance $\left({\sigma}_{R}^{2}\right)$ for the scores, especially on the high end. The range they considered was 1%–10% of the total; our range is 5%–83%.

Another factor that we investigate is how the cases are distributed among the readers. We investigate six study designs with the expected number of cases read by each reader given in Table 2. Table 3 exemplifies the study designs with five readers and an average of 102 cases read by each reader. The first four of the study designs listed are doctor–patient study designs, the next is fully crossed, and the last has a unique hybrid structure that is neither fully crossed nor doctor–patient.

The first doctor–patient study design is flat; every reader reads the same number of cases. For the Poisson doctor–patient study design, the number of cases each reader reads is five cases plus a Poisson random variable with mean ${\overline{N}}_{g\mid r}-5$. For the uniform distributions, the number is selected from the interval $[5,2\ast {\overline{N}}_{g\mid r}-5]$ for the broad distribution or $[0.5\ast {\overline{N}}_{g\mid r},1.5\times {\overline{N}}_{g\mid r}]$ for the moderate distribution. These distributions force a minimum of five readings per reader.

The final study design we consider is motivated by an observer study conducted by investigators at the National Cancer Institute. The observer study used a subset of images from the atypical squamous cells of undetermined significance (ASCUS) low-grade squamous intraepithelial lesion (LSIL) triage study known as ALTS [[7], [8]]. In that study a small subset of the cases were read by all the study colposcopists. The remaining cases were each read by three readers. Here we have a data set of ${N}_{\gamma}({\overline{N}}_{g\mid r}\u22153-3)$ cases. Each reader reads the first three cases of this data set; the remaining cases are each read by three randomly selected readers. The curious size of the data set is chosen so that the total number of readings for this study design is ${N}_{\gamma}\times {\overline{N}}_{g\mid r}$, the same total expected for the other study designs. We shall refer to this study design as the hybrid study design.

Finally, we consider both weighting methods mentioned above: equally weighing readers and equally weighing readings.

## 3. SIMULATION RESULTS AND DISCUSSION

In what follows, we compare our variance estimators to the truth, the population quantities. Our population quantities are calculated from the integral expressions in Eqs. (22, 23, 24) and the MC averages for the coefficients. So the truth still has an element of uncertainty in it; the expected values of the nonlinear coefficients are intractable.

To verify the integral expressions, we compare each population variance (from integration) to the sample variances of 10,000 independent MC performance estimates. A separate point is given for each of the 729 simulation configurations, 6 types of study designs, and both ways to weigh individual reader PCs in Fig. 2 . Across all these simulation configurations, which cover a broad range of variances, the maximum difference found was 6% and the mean was $-0.1\%$.

*Expected Variance*. Before we assess our estimates, it is worthwhile to show the variances expected from all the experiments. Figure 3 shows the population variances for all the high-PC (0.96) simulation configurations compared to the expected values (from MC averaging) of the naïve variance estimates. The expected values of our moment estimators are unbiased and thus equal the population variances. At the bottom of each column of plots, the *x* axis is labeled according to the size of the simulated experiment. The 27 different components of variance configurations are then explored within each experiment size according to the reader component of variance $\left({\sigma}_{R}^{2}\right)$. This sorting shows that the reader component of variance has a strong impact on the expected variance of the experiment. The size of the experiment also affects the experimental variance, though to a lesser degree. Additionally, the simulation configurations for lower PCs (0.86, 0.70; not shown) are quite similar to those given in Fig. 3 except that they are shifted upward. This behavior mimics the binomial variance, which increases with decreasing performance.

Interestingly, the overall scale across the different study designs is relatively constant for each experiment size. Recall that each study design has the same expected number of readings given the same experiment size. However, we can see that different study designs behave differently across different components-of-variance configurations.

Regarding the impact of reader weights, ${V}_{\gamma}=\mathrm{var}\left({\widehat{P}}_{\gamma}\right)$ lies on top of ${V}_{g}=\mathrm{var}\left({\widehat{P}}_{g}\right)$ in all the plots except for the broad uniform doctor–patient study design (Fig. 3d). In that plot, ${V}_{g}$ can be $\pm 30\%$ that of ${V}_{\gamma}$ (notice some dots peaking out from behind the solid curve). What this means is that the variance of the reader-averaged PC does not depend on the reader weights except when the reader case loads are very different.

Finally, in each plot the naïve estimates bracket the true MRMC variances: $\u27e8{\widehat{V}}_{\mathit{\text{naive}}\u0331\gamma}\u27e9$ is biased high (the dotted curve upper bound) and $\u27e8{\widehat{V}}_{\mathit{\text{naive}}\u0331g}\u27e9$ is biased low (the dashed curve lower bound). In the plots, ${\widehat{V}}_{\mathit{\text{naive}}\u0331\gamma}$ can be nine times the true variance, whereas ${\widehat{V}}_{\mathit{\text{naive}}\u0331g}$ can be as little as 2% of the true variance.

*Root-mean-square error*. Here we assess the variance estimators with the relative root-mean-square error (RRMSE), or

Figure 4 plots the RRMSE $(\times 100\%)$ for the fully crossed study design: Plot A shows the high PC (0.96), and Plot B shows the low PC (0.70). As for the previous plots, the *x* axis is labeled according to the size of the simulated experiment, while the different variance configurations are explored within each experiment size, sorted by the reader component of variance $\left({\sigma}_{R}^{2}\right)$. Recall that for the fully crossed study design, equally weighing readers is the same as equally weighing readings. Consequently, our MRMC variance estimators of ${\widehat{P}}_{\gamma}$ and ${\widehat{P}}_{g}$ are also equal: ${\widehat{V}}_{\gamma}={\widehat{V}}_{g}$.

We first point out that at high PC (Fig. 4a) and with only three readers, the RRMSE of our MRMC estimators runs above 100% (solid curve). Three readers are not enough to do the MRMC variance estimation, and as the reader component of variance increases, the estimator gets even noisier. In this regime, the naïve estimator ${\widehat{V}}_{\mathit{\text{naive}}\u0331g}$ appears to be performing fairly well (dashed curve), that is, until we recall how biased it is (Fig. 3e). The bias of ${\widehat{V}}_{\mathit{\text{naive}}\u0331\gamma}$, on the other hand, is driving the RRMSE to extreme values (dotted curve).

As the size of the experiment grows, the RRMSE of our MRMC estimator decreases, while that of ${\widehat{V}}_{\mathit{\text{naive}}\u0331g}$ does not. That is to say, our MRMC estimator improves with more data, while the naïve estimator cannot adapt to the overdispersive nature of the data. Nonetheless, even with ten readers, each reading 102 cases on average, our MRMC estimator has too much error when the PC and reader variability are high.

When the PC is lower (Fig. 4b), the estimation problem can be done with reasonable precision and accurracy. When there are ten readers and 50 cases in the experiment, the RRMSE of our MRMC estimator ranges between 20% and 40%.

For the broad uniform study design (Figs. 5a, 5b ), the overall story is similar, but we now see a difference from the reader weights. In experiments with little data and high PC, the error estimating ${V}_{g}$ (dashed-dotted curve) is significantly larger than that estimating ${V}_{\gamma}$ (solid curve). However, for the experiments with adequate readers (ten) and moderate PC, where the RRMSE ranges between 30% and 60%, the difference in errors becomes negligible.

Finally, the RRMSE stories for the other study designs are similar to either that of the fully crossed or the broad uniform study designs. The hybrid study design, with its additional case correlations from cases being read by at least three readers, mimics the fully crossed study design. The other doctor–patient study designs mimic the broad uniform doctor–patient study design, although the differences between the RRMSEs for ${\widehat{V}}_{\gamma}$ and ${\widehat{V}}_{g}$ are not as pronounced.

In summary, the reader weights do not play a significant role in the total variance of the average PC except when the case loads are very different, as in the broad uniform study design. Additionally, it takes about ten readers and a moderate PC to reasonably estimate the MRMC variance. In this regime the error estimating ${V}_{\gamma}$ and ${V}_{g}$ is about the same. Finally, our MRMC estimator improves as more data are collected and performance is moderate; it is a consistent estimator. In contrast, the naïve estimators are not consistent; they do not get closer to the truth with more data.

## 4. CONCLUSIONS AND FUTURE WORK

We have presented a framework for estimating the variance of a binary-outcome experiment that appropriately accounts for readers and cases as random effects. This framework is based on the larger one developed for estimating the MRMC variance of AUC [[4], [5]] obtained according to a fully crossed study design. The MRMC variance of AUC has eight fundamental second-order moments of the success outcomes, whereas for the binary-outcome experiment there are only four fundamental moments. We have also generalized the framework to accommodate any MRMC random or fixed study design. A fully crossed study design is not required, though we have highlighted it and another special study design, the doctor–patient study design.

In addition to quantifying the uncertainty of the MRMC experiment conducted, the framework provided can be used to consider other study designs. For example, a small pilot study can be used to estimate the moments of the success outcomes. Then a larger pivotal study can be considered by simply changing the study design matrix, which will change the coefficients $\underset{\u0331}{c}$. This larger pivotal study does not even need to be of the same type as the pilot study, as long as the appropriate moments have been estimated.

We have examined our estimator with the MC simulation scheme developed by Roe and Metz.[[6]] This simulation was originally developed to investigate the Dorfman–Berbaum–Metz (DBM) linear-random-effects (components-of-variance) model of AUC [[1]] and has since served as a testbed for assessing other MRMC approaches [[9], [10], [11]]. Within our framework, we have also been able to derive integral expressions for numerically calculating the fundamental moments of the success outcomes for the Roe and Metz simulation. Extending these results to the eight fundamental moments of the MRMC variance of AUC is available upon request from the author and is being drafted for publication. This result ties off a loose end that has been present since the simulation model was developed. For a short discussion showing how the success moments are related to the components of variance, see Appendix B.

The variance estimates presented are useful for the visual perception investigator performing clinical studies or human psychophysics experiments, as well as for the investigator developing models of the human or ideal observer. For the latter, the utility comes to bear when the model observer is estimated from a finite set of training cases. If another set of cases is obtained (same size), another estimate of the observer (same model) could be obtained. These two model-observer estimates can be thought of as samples from a population of readers. In this setting, a MRMC performance experiment can be run where we generate a sample of readers (trained on independent sets of cases) and a sample of testing cases (cases that are independent of the ones used for training any observer). Performing an MRMC variance analysis on this experiment will allow the investigator to account for the variability from training the model with a finite set of training samples and from testing the model with a finite set of testers. Such an accounting is essential to model development and is starting to be appreciated in the field of computer-aided diagnosis and detection of disease [[12]].

One direction for future work in this area is to estimate MRMC covariances. The method we presented in this paper generalizes easily to estimating covariances when the readers and cases are paired across two reading conditions or modalities. Simply replace the success matrix with a difference of success matrices and proceed as described for the single-modality MRMC variance analysis. These covariances can be used to quantify the statistical difference between the performance of a set of readers reading the same cases in two modalities, or the difference between two observer models.

Another direction for future work is to take the general study design concepts to AUC [[5]]. Pooling ROC scores happens just as pooling success outcomes happens. The subsequent variance analysis and hypothesis tests done do not typically account for the fact that the scores from several readers reading different cases are not identically distributed. For AUC, however, not only is the variance analysis wrong, but the actual pooled AUC can be quite different from the average reader AUC [[13]], especially when the readers use the ROC score axis differently.

## APPENDIX A: SECOND-MOMENT, FIXED STUDY DESIGN

Here we assume that the design matrix and weights are fixed, and we calculate the second moment of $\widehat{P}$. It is

The squared sum over readers and cases is a quadruple sum that we separate into four parts:

Since the readers and cases are iid, the moments in each line of the expression above do not depend on *r*, ${r}^{\prime}$ or *i*, ${i}^{\prime}$, which we define to coincide with notation previously derived for the empirical AUC [[4], [5]].

Given that the moments in Eq. (A2) are independent of the readers *r*, ${r}^{\prime}$ and cases *i*, ${i}^{\prime}$, we can see that the second moment is simply four moments weighted by four coefficients. The variance utilizes the same four coefficients, while subtracting 1 from the last coefficient to account for subtracting the mean squared from $\u27e8{\widehat{P}}^{2}\mid D\u27e9$. Therefore, after some algebraic manipulations, the coefficients are

The general case expression in Eq. (A2) simplifies for the study designs considered in this paper (see Table 1). For the fully crossed study design, ${d}_{ir}$ always equals one, so sums over all *i* equal ${N}_{g}$ and sums over ${i}^{\prime}\ne i$ equal ${N}_{g}-1$. For doctor–patient study designs, readers never read the same cases, so the sum over *i* of ${d}_{ir}{d}_{i{r}^{\prime}}$ always equals zero and the sum over *i* and ${i}^{\prime}\ne i$ of ${d}_{ir}{d}_{i{r}^{\prime}}={N}_{g\mid r}{N}_{g\mid {r}^{\prime}}$.

It is also handy to derive the expected value of

## APPENDIX B: COMPONENTS OF VARIANCE

In this section we relate our moment decomposition of the variance given in Eq. (4) to a components-of-variances (CofVs) decomposition [[1], [2], [14]]. We begin by considering the distribution of reader skill; some readers are better than others. The skill of a reader is the success outcome for a given reader *γ* averaged over all cases in the population, or

*γ*reading a random set of ${N}_{\phantom{\mid}g\mid \gamma}$ cases. Specifically,

Likewise, we consider the distribution of case difficulty. The case difficulty is the success outcome for a given case *g* averaged over all readers in the population, or

Instead of the development above, the DBM model starts by decomposing the performance into three random effects:

where*G*denotes a set of cases; $\overline{\beta}$ denotes the average performance; ${\beta}_{\gamma}$ is a random effect accounting for reader skill; ${\beta}_{G}$ is a random effect accounting for the difficulty of the case set; and ${\beta}_{G\gamma}$ quantifies two random effects, a possible reader-case interaction and reader jitter. The interaction and reader jitter effects are inseparable if there are no repeated readings. All the random effects are assumed to be independent zero-mean Gaussian random variables. The corresponding reader CofV is identical to ${\sigma}_{\gamma}^{2}$, and the case CofV equals ${\sigma}_{g}^{2}$ scaled per case set, or ${\sigma}_{g}^{2}\u2215{N}_{g\mid r}$.

At first, the variance of the interaction term is not obvious. The reason is that the variance of the interaction term depends on the study design. It depends on how the readers and cases are sampled and combined in the summary performance statistic. We can actually figure out the variance of the interaction term by starting with the total variance and organizing it according to reciprocal powers of ${N}_{\gamma}$, ${N}_{g}$, much like is done in the work of Barrett *et al.* [[15], [16]]. For the fully crossed study design, we have

**1. **D. D. Dorfman, K. S. Berbaum, and C. E. Metz, “Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method,” Invest. Radiol. **27**, 723–731 (1992). [CrossRef] [PubMed]

**2. **S. V. Beiden, R. F. Wagner, and G. Campbell, “Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis,” Acad. Radiol. **7**, 341–349 (2000). [CrossRef] [PubMed]

**3. **N. A. Obuchowski, S. V. Beiden, K. S. Berbaum, S. L. Hillis, H. Ishwaran, H. H. Song, and R. F. Wagner, “Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods,” Acad. Radiol. **11**, 980–995 (2004). [CrossRef] [PubMed]

**4. **B. D. Gallas, “One-shot estimate of MRMC variance: AUC,” Acad. Radiol. **13**, 353–362 (2006). [CrossRef] [PubMed]

**5. **B. D. Gallas and D. G. Brown, “Reader studies for validation of CAD systems,” submitted to Neural Networks.

**6. **C. A. Roe and C. E. Metz, “Dorfman–Berbaum–Metz method for statistical analysis of multireader, multimodality receiver operating characteristic (ROC) data: validation with computer simulation,” Acad. Radiol. **4**, 298–303 (1997). [CrossRef] [PubMed]

**7. **M. Schiffman and M. E. Adrianza, “ASCUS-LSIL triage study: design, methods and characteristics of trial participants,” Acta Cytol. **44**, 726–742 (2000). [CrossRef] [PubMed]

**8. **J. Jeronimo, L. S. Massad, and M. Schiffman, “Visual appearance of the uterine cervix: correlation with human papillomavirus detection and type,” Am. J. Obstet. Gynecol. **97**, 47.e1–47.e8 (2007). [CrossRef]

**9. **S. L. Hillis and K. S. Berbaum, “Monte Carlo validation of the Dorfman–Berbaum–Metz method using normalized pseudovalues and less data-based model simplification,” Acad. Radiol. **12**, 1534–1541 (2005). [CrossRef] [PubMed]

**10. **S. L. Hillis, N. A. Obuchowski, K. M. Schartz, and K. S. Berbaum, “A comparison of the Dorfman–Berbaum–Metz and Obuchowski–Rockette methods for receiver operating characteristic (ROC) data,” Stat. Med. **24**, 1579–1607 (2005). [CrossRef] [PubMed]

**11. **X. Song and X.-H. Zhou, “A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data,” Biostatistics **6**, 303–312 (2005). [CrossRef] [PubMed]

**12. **W. A. Yousef, R. F. Wagner, and M. H. Loew, “Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach,” IEEE Trans. Pattern Anal. Mach. Intell. **28**, 1809–1817 (2006). [CrossRef] [PubMed]

**13. **M. S. Pepe, *The Statistical Evaluation of Medical Tests for Classification and Prediction* (Oxford U. Press, 2003).

**14. **C. A. Roe and C. E. Metz, “Variance-component modeling in the analysis of receiver operating characteristic (ROC) index estimates,” Acad. Radiol. **4**, 587–600 (1997). [CrossRef] [PubMed]

**15. **H. H. Barrett, M. A. Kupinski, and E. Clarkson, “Probabilistic Foundations of the MRMC Method,” Proc. SPIE **5749**, 21–31 (2005). [CrossRef]

**16. **E. Clarkson, M. A. Kupinski, and H. H. Barrett, “A probabilistic model for the MRMC method. Part 1. theoretical development,” Acad. Radiol. **13**, 1410–1421 (2006). [CrossRef] [PubMed]