Benchmarking Rater Agreement Indices: Statistical Properties and Power Analysis

Vanacore, Amalia; Pellegrino, Maria Sole

This paper presents a critical review of some kappa-type indices proposed in the literature to measure the degree of rater agreement. Single measures of agreement provide only limited information and do not account for statistical uncertainty thus, following recommended guidelines for reporting agreement studies, we will present agreement indices including their confidence intervals. The magnitude of each estimated agreement coefficient will be related to the notion of extent of agreement by comparing the lower limit of its confidence interval against a benchmark scale. Specifically, we will explore the case of agreement among series of ratings referring to n items classified into k ordered categories by different raters (i.e., inter-rater agreement) or, equivalently, by the same rater in different occasions (i.e., intra-rater agreement). The reviewed indices are Gwet’s AC2 and the linear weighted variants of Scott’s Pi coefficient, Cohen’s Kappa and Brennan-Prediger statistic. In order to evaluate the statistical behavior of the reviewed indices and of a non-parametric benchmarking procedure, a Monte Carlo simulation study has been conducted for several scenarios differing from each other in sample size, rating scale dimension and agreement level. The estimate precision is evaluated in terms of relative bias, variance and coverage rate of the percentile bootstrap confidence interval, whereas the effectiveness of the benchmarking procedure is assessed in terms of statistical power. Simulation results suggest that the analyzed indices have satisfactory estimate precision that improves as n, k and agreement level increase and a coverage rate close to its nominal level, only for n ≥ 30; the benchmarking procedure is generally adequately powered in testing null and non-null cases of rater agreement and thus it can be suitably applied for the characterization of agreement over a small or moderate number of subjective ratings provided by one or more raters.

Benchmarking Rater Agreement Indices: Statistical Properties and Power Analysis / Vanacore, Amalia; Pellegrino, MARIA SOLE. - (2017), pp. 100-100.