Validation of model performance (binary outcomes)

1.Discrimination

It often happens that a model with a new marker is compared to a model without the marker to evaluate how the markers improves performance. When a given risk threshold is used to classify patients as positive or negative for the disease under study, we have shown through a simulation study that adding a predictive marker will often decrease either sensitivity or specificity (1). This shows the importance of reporting the change in sensitivity and specificity separately (hence not in terms of a statistical summary measure such as accuracy, Youden index, or NRI). When a single summary measure is desired, decision-analytic measures such as Net Benefit, Relative Utility, or weighted NRI are suggested.

2. Calibration

We have shown using simulated data that models that are calibrated for a given population, i.e. models that give accurate risk estimates for the population at hand, result in better decision making compared to models that are not calibrated (i.e. models that systematically over- or underestimate risk, or models that provide risk estimates that are too extreme or not extreme enough) (2). This underscores the relevance of assessing calibration of a model.

3.Decision-analytic evaluation of risk prediction models

In recent years, several decision-analytic performance measures have been proposed. This measures try to move beyond purely statistical evaluation in order to assess the potential clinical utility of a model without performing a full blown cost effectiveness analysis.

We have worked out the relationships between the most common decision-analytic measures (Net Benefit, Relative Utility, and weighted NRI), and have elaborated on how these measures relate to the statistical reclassification measure NRI (3). These measures are in fact a revival of a measure that was proposed back in 1884 (indeed, the 19th century) in the fourth volume of Science. This is a remarkable observation, and we have described the sudden increase in citations of this old paper (4). As explained above, we have also shown that potential clinical utility is lower when a risk model is not calibrated for a given population (2).

When adding a new marker to a model, using such decision-analytic measures is more informative than using traditional measures such as the change in AUC or the NRI (5). The potential utility of the new marker can then be expressed through the test tradeoff, the number of patients on which to use the new marker to achieve one additional true positive at the same level of false positives (6).

4.NRI, IDI and added value of novel markers

To compare different models, very often this will be a model without and a model with a new marker of interest, several performance measures exist. Recently, the NRI (Net Reclassification Improvement) and IDI (Integrated Discrimination Improvement) were introduced for this task (7). These are simple and attractive measures that quickly become very popular in the medical literature (8). We noticed similarities between IDI from the medical statistical community and ‘probabilistic AUCs’ from the machine learning community (9). Given that both communities continue to be fairly separate, we described these similarities in Statistics in Medicine (9).

After the introduction of NRI, decision-analytic measures gained more attention and approval. We described the relationships between NRI and decision-analytic measures, and have suggested that the latter are more informative to assess new markers (1, 3, 5). We have also described the test trade-off as an interesting statistic to describe the added value of new marker from a decision-analytic perspective (6), and have presented an overview of approaches to graphically present the added value of a new marker (10).

The NRI has been criticized because it can give misleading results. The main explanation for this is that NRI is not a ‘proper’ measure of performance improvement: problems with model calibration can influence the results, and can even suggest that random variables have added value. We have therefore presented a distinction between two frameworks: the marker perspective and the model perspective (11). In the marker perspective, focus in on the potential added value of a marker. In this perspective, calibration issues have to be resolved before NRI can be used (or decision-analytic alternatives). In the model perspective, focus is on the performance of a risk model when they are applied to patients as a decision support system. In that perspective, calibration is an inherent aspect of model performance that requires the use of decision-analytic measures instead of NRI.

5.Overview

We have given an overview of performance measures, including the distinction between the evaluation of the prediction model and a resulting prediction rule (12). A prediction rule results from a prediction model by classifying patients as positive or negative using a risk threshold.

6.Updating

We used data from five international prostate biopsy cohorts to assess the value of a dynamic updating strategy in which a risk model is updated regularly as new data come in (13). The risk model under study is the Prostate Cancer Prevention Trial Risk Calculator (PCPTRC) (14). Yearly updating of the risk model, more specifically recalibration, appeared beneficial for the models predictive performance.

References

Van Calster B, Steyerberg EW, D'Agostino RB, Sr., Pencina MJ. Sensitivity and specificity can change in opposite directions when new predictive markers are added to risk models. Med Decis Making. 2014;34(4):513-22.

Van Calster B, Vickers AJ. Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Making. 2015;35(2):162-9.

Van Calster B, Vickers AJ, Pencina MJ, Baker SG, Timmerman D, Steyerberg EW. Evaluation of markers and risk prediction models: overview of relationships between NRI and decision-analytic measures. Med Decis Making. 2013;33(4):490-501.

Van Calster B. It takes time: A remarkable example of delayed recognition. Journal of the American Society for Information Science and Technology. 2012;63(11):2341-4.

Steyerberg EW, Pencina MJ, Lingsma HF, Kattan MW, Vickers AJ, Van Calster B. Assessing the incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest. 2012;42(2):216-28.

Baker SG, Van Calster B, Steyerberg EW. Evaluating a new marker for risk prediction using the test tradeoff: an update. Int J Biostat. 2012;8(1).

Pencina MJ, D'Agostino RB, Sr., D'Agostino RB, Jr., Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157-72; discussion 207-12.

Leening MJ, Vedder MM, Witteman JC, Pencina MJ, Steyerberg EW. Net reclassification improvement: computation, interpretation, and controversies: a literature review and clinician's guide. Ann Intern Med. 2014;160(2):122-31.

Van Calster B, Van Huffel S. Integrated discrimination improvement and probability-sensitive AUC variants. Stat Med. 2010;29(2):318-9.

Steyerberg EW, Vedder MM, Leening MJ, Postmus D, D'Agostino RB, Sr., Van Calster B, et al. Graphical assessment of incremental value of novel markers in prediction models: From statistical to decision analytical perspectives. Biom J. 2014.

Leening MJ, Steyerberg EW, Van Calster B, D'Agostino RB, Sr., Pencina MJ. Net reclassification improvement and integrated discrimination improvement require calibrated models: relevance from a marker and model perspective. Stat Med. 2014;33(19):3415-8.

Steyerberg EW, Van Calster B, Pencina MJ. [Performance measures for prediction models and markers: evaluation of predictions and classifications]. Rev Esp Cardiol. 2011;64(9):788-94.

Strobl AN, Vickers AJ, van Calster B, Steyerberg E, Leach RJ, Thompson IM, et al. Improving patient prostate cancer risk assessment: Moving from static, globally-applied to dynamic, practice-specific cancer risk calculators. J Biomed Inform. 2015.

Thompson IM, Ankerst DP, Chi C, Goodman PJ, Tangen CM, Lucia MS, et al. Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. J Natl Cancer Inst. 2006;98(8):529-34.