Is the 4C Mortality Score fit for purpose? Some comments and concerns
CCBYOpen access
Rapid response to:
Research
Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score
Is the 4C Mortality Score fit for purpose? Some comments and concerns
To the Editor
We read with interest the paper in the BMJ by Knight et al.,[1] proposing a new risk prediction model for patients admitted to hospital with COVID-19, which the Guardian indicate is expected to be rolled out in the NHS this week (https://www.theguardian.com/world/2020/sep/09/risk-calculator-for-covid-...). On the whole, the paper appears of higher quality than most other articles we have reviewed in our living review [2]. For example, the dataset was large enough;3 there was a very clear target population; missing data was handled using multiple imputation; multiple metrics of predictive performance were considered (including calibration and net benefit, which are often ignored); and reporting followed the TRIPOD guideline [4 5]. However, we have identified some concerns and issues, that we want to flag to BMJ readers.
Firstly, a potential issue is that calibration (i.e. agreement between observed and predicted risks) appears perfect in both the development dataset and the validation dataset, with a CITL of 0 and a calibration slope of 1. Now, we’d expect these results in the model development dataset (at least when using unpenalised regression models; here the lasso was used, but the data is so large that penalisation is most likely negligible, and so a CITL of 0 and calibration slope of 1 may be possible). However, to observe such perfect results upon validation is highly unusual. What is concerning is whether there has been recalibration of the model (perhaps unknowingly) in the validation dataset, before then calibration measures have been estimated. Perhaps the authors can ask an independent statistician to check this? It would also help to know how they calculated CITL and calibration slope. Perhaps their statistical analysis code could be made available?
A related point is that the calibration plot is given for the derivation dataset (Fig 2 and Appendix 11), but not for the validation dataset. Perhaps Figure 2 is for the validation dataset? Regardless, it appears that straight lines have been used to join the groups on the plot (which should be avoided), rather than a smoothed calibration curve being added.[6 7] This may hide some potential deviation.
Secondly, the authors have created a simplified score from their final prediction model (shown in Table 2), and it is the simplified score they are recommending for practice. However, it is not clear if the score itself was validated, or the model equation based on the lasso. If the performance metrics relate to the original (lasso) model, then this is not a reflection of the performance of the simplified score. Unlike the regression based model, there are challenges in validating a simplified score developed in this manner, namely due to the lack of predicted risks which impedes an assessment of calibration. How can calibration slope be 1 (even in model development) when the score no longer maps 1-to-1 with the risk estimates of the original model? Clarity is needed. For example, how do the authors get predicted risks? Do they take the average risk for everyone with that score (Fig 2, center panel)? How do they then examine CITL and calibration slope?
A related issue is that the decision curve of a score should have the form of a step function, but the one shown here is smooth. Hence, something appears amiss, and it is important that the authors can clarify this.
Thirdly, as part of the three-stage model building strategy, the authors selected ‘optimal’ cut-off values to categorise (often dichotomise) continuous predictors. It is well-known that this approach is biologically implausible and loses information.[8-11] A simplified score can still be created after modelling the continuous predictors on their continuous scale and allowing for non-linear relationships.[12 13] Indeed, the lack of allowing for non-linearity in the regression based model is one potential reason for why it does (slightly) worse in terms of discrimination than the machine learning based model (XGBoost). Indeed, it not clear why results from this particular machine learning method were presented (by definition, the method precludes bedside use), as they serve no obvious purpose except to distract, and provide an unfair comparison to the regression based model.
Fourthly, when evaluating the performance of a prediction model in large datasets (as done here), it is important to evaluate performance in relevant subgroups and settings, not just on average.[14 15] A model’s performance may work well on average (e.g. in average across England, Wales and Scotland), but not work well in particular countries, regions, or hospitals. A particular concern is the potential for miscalibration at the region or individual hospital-level, where heterogeneity (e.g. in case-mix, clinical care) may lead to large differences from the average. Surely before the NHS decides to implement this approach, this should be checked closely? Otherwise decisions are being made on potentially miscalibrated risk predictions in each hospital. A related point: It is good to see that discrimination was checked in ethnic and sex subgroups; but why not also calibration?
Fifthly, logistic regression models were used in the model development. The authors ‘included patients without an outcome after four weeks and considered to have had no event’. But what about longer term mortality? It should be much clearer then that this model is for prediction of mortality risk by 4 weeks after admission, and not for prediction of longer time outcome.
Sixthly, we note that predictions from this new tool should be viewed as predicted risks in the context of current care. That is, a low risk does not mean that the patient should immediately be sent home without care; rather it means that in the context of current care and treatment pathways, the patient is unlikely to die within 4 weeks. They might still be severely ill well beyond 4 weeks. If the model is rolled out in the NHS, this needs to be absolutely clear to those that are implementing it.
Finally, the comparison with existing scores is problematic, as the sample size varies between 197 and 19361 (i.e. each model uses a different validation sample based on the availability of predictors), and the two models added to the decision curve are from 15 years ago for a related but different condition. Therefore, the claim that the developed score is better than what exists is poorly supported in this way.
We hope readers and the authors find our comments constructive.
With best wishes,
Richard D Riley, Gary S Collins, Maarten van Smeden, Kym Snell, Ben Van Calster, Laure Wynants
Reference List
1. Knight SR, Ho A, Pius R, et al. Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score. BMJ 2020;370:m3339.
2. Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ 2020;369:m1328.
3. Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ 2020;368:m441.
4. Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594.
5. Moons KG, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162(1):W1-73.
6. Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med 2014;33(3):517-35.
7. Van Calster B, Nieboer D, Vergouwe Y, et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016; 74:167-76.
8. Collins GS, Ogundimu EO, Cook JA, et al. Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model. Stat Med 2016;35(23):4124-35.
9. Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53.
10. Altman DG, Royston P. Statistics notes: The cost of dichotomising continuous variables. BMJ 2006;332::1080.
11. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25(1):127-41.
12. Bonnett LJ, Snell KIE, Collins GS, et al. Guide to presenting clinical prediction models for use in clinical settings. BMJ 2019;365:l737.
13. Sullivan LM, Massaro JM, D'Agostino RB, Sr. Presentation of multivariate data for clinical use: The Framingham Study risk score functions. Stat Med 2004;23(10):1631-60.
14. Riley RD, Ensor J, Snell KI, et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ 2016;353:i3140.
15. Riley RD, van der Windt D, Croft P, et al., editors. Prognosis Research in Healthcare: Concepts, Methods and Impact. Oxford, UK: Oxford University Press, 2019.
Competing interests:
No competing interests
15 September 2020
Richard D Riley
Professor of Biostatistics
Gary S Collins, Maarten van Smeden, Kym IE Snell, Ben Van Calster, Laure Wynants
Rapid Response:
Is the 4C Mortality Score fit for purpose? Some comments and concerns
To the Editor
We read with interest the paper in the BMJ by Knight et al.,[1] proposing a new risk prediction model for patients admitted to hospital with COVID-19, which the Guardian indicate is expected to be rolled out in the NHS this week (https://www.theguardian.com/world/2020/sep/09/risk-calculator-for-covid-...). On the whole, the paper appears of higher quality than most other articles we have reviewed in our living review [2]. For example, the dataset was large enough;3 there was a very clear target population; missing data was handled using multiple imputation; multiple metrics of predictive performance were considered (including calibration and net benefit, which are often ignored); and reporting followed the TRIPOD guideline [4 5]. However, we have identified some concerns and issues, that we want to flag to BMJ readers.
Firstly, a potential issue is that calibration (i.e. agreement between observed and predicted risks) appears perfect in both the development dataset and the validation dataset, with a CITL of 0 and a calibration slope of 1. Now, we’d expect these results in the model development dataset (at least when using unpenalised regression models; here the lasso was used, but the data is so large that penalisation is most likely negligible, and so a CITL of 0 and calibration slope of 1 may be possible). However, to observe such perfect results upon validation is highly unusual. What is concerning is whether there has been recalibration of the model (perhaps unknowingly) in the validation dataset, before then calibration measures have been estimated. Perhaps the authors can ask an independent statistician to check this? It would also help to know how they calculated CITL and calibration slope. Perhaps their statistical analysis code could be made available?
A related point is that the calibration plot is given for the derivation dataset (Fig 2 and Appendix 11), but not for the validation dataset. Perhaps Figure 2 is for the validation dataset? Regardless, it appears that straight lines have been used to join the groups on the plot (which should be avoided), rather than a smoothed calibration curve being added.[6 7] This may hide some potential deviation.
Secondly, the authors have created a simplified score from their final prediction model (shown in Table 2), and it is the simplified score they are recommending for practice. However, it is not clear if the score itself was validated, or the model equation based on the lasso. If the performance metrics relate to the original (lasso) model, then this is not a reflection of the performance of the simplified score. Unlike the regression based model, there are challenges in validating a simplified score developed in this manner, namely due to the lack of predicted risks which impedes an assessment of calibration. How can calibration slope be 1 (even in model development) when the score no longer maps 1-to-1 with the risk estimates of the original model? Clarity is needed. For example, how do the authors get predicted risks? Do they take the average risk for everyone with that score (Fig 2, center panel)? How do they then examine CITL and calibration slope?
A related issue is that the decision curve of a score should have the form of a step function, but the one shown here is smooth. Hence, something appears amiss, and it is important that the authors can clarify this.
Thirdly, as part of the three-stage model building strategy, the authors selected ‘optimal’ cut-off values to categorise (often dichotomise) continuous predictors. It is well-known that this approach is biologically implausible and loses information.[8-11] A simplified score can still be created after modelling the continuous predictors on their continuous scale and allowing for non-linear relationships.[12 13] Indeed, the lack of allowing for non-linearity in the regression based model is one potential reason for why it does (slightly) worse in terms of discrimination than the machine learning based model (XGBoost). Indeed, it not clear why results from this particular machine learning method were presented (by definition, the method precludes bedside use), as they serve no obvious purpose except to distract, and provide an unfair comparison to the regression based model.
Fourthly, when evaluating the performance of a prediction model in large datasets (as done here), it is important to evaluate performance in relevant subgroups and settings, not just on average.[14 15] A model’s performance may work well on average (e.g. in average across England, Wales and Scotland), but not work well in particular countries, regions, or hospitals. A particular concern is the potential for miscalibration at the region or individual hospital-level, where heterogeneity (e.g. in case-mix, clinical care) may lead to large differences from the average. Surely before the NHS decides to implement this approach, this should be checked closely? Otherwise decisions are being made on potentially miscalibrated risk predictions in each hospital. A related point: It is good to see that discrimination was checked in ethnic and sex subgroups; but why not also calibration?
Fifthly, logistic regression models were used in the model development. The authors ‘included patients without an outcome after four weeks and considered to have had no event’. But what about longer term mortality? It should be much clearer then that this model is for prediction of mortality risk by 4 weeks after admission, and not for prediction of longer time outcome.
Sixthly, we note that predictions from this new tool should be viewed as predicted risks in the context of current care. That is, a low risk does not mean that the patient should immediately be sent home without care; rather it means that in the context of current care and treatment pathways, the patient is unlikely to die within 4 weeks. They might still be severely ill well beyond 4 weeks. If the model is rolled out in the NHS, this needs to be absolutely clear to those that are implementing it.
Finally, the comparison with existing scores is problematic, as the sample size varies between 197 and 19361 (i.e. each model uses a different validation sample based on the availability of predictors), and the two models added to the decision curve are from 15 years ago for a related but different condition. Therefore, the claim that the developed score is better than what exists is poorly supported in this way.
We hope readers and the authors find our comments constructive.
With best wishes,
Richard D Riley, Gary S Collins, Maarten van Smeden, Kym Snell, Ben Van Calster, Laure Wynants
Reference List
1. Knight SR, Ho A, Pius R, et al. Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score. BMJ 2020;370:m3339.
2. Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ 2020;369:m1328.
3. Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ 2020;368:m441.
4. Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594.
5. Moons KG, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162(1):W1-73.
6. Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med 2014;33(3):517-35.
7. Van Calster B, Nieboer D, Vergouwe Y, et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016; 74:167-76.
8. Collins GS, Ogundimu EO, Cook JA, et al. Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model. Stat Med 2016;35(23):4124-35.
9. Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53.
10. Altman DG, Royston P. Statistics notes: The cost of dichotomising continuous variables. BMJ 2006;332::1080.
11. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25(1):127-41.
12. Bonnett LJ, Snell KIE, Collins GS, et al. Guide to presenting clinical prediction models for use in clinical settings. BMJ 2019;365:l737.
13. Sullivan LM, Massaro JM, D'Agostino RB, Sr. Presentation of multivariate data for clinical use: The Framingham Study risk score functions. Stat Med 2004;23(10):1631-60.
14. Riley RD, Ensor J, Snell KI, et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ 2016;353:i3140.
15. Riley RD, van der Windt D, Croft P, et al., editors. Prognosis Research in Healthcare: Concepts, Methods and Impact. Oxford, UK: Oxford University Press, 2019.
Competing interests: No competing interests