Is the 4C Mortality Score fit for purpose? Authors response.
CCBYOpen access
Rapid response to:
Research
Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score
Is the 4C Mortality Score fit for purpose? Authors response.
Dear Editor
Many thanks to all correspondents for their interest in our work on the 4C Mortality Score. The feedback is greatly appreciated, and we will respond to all points in time. We write here specifically in response to the letter from Professor Riley and colleagues (1).
We are pleased to be able to use the large and detailed ISARIC 4C dataset to provide a pragmatic decision support tool for use by clinicians in these challenging times. We thank the correspondents for highlighting the quality of the work; we are particularly indebted to the large and dedicated team of research nurses and students who worked hard to collect these data. We are grateful for the opportunity to address the technical points raised.
Development of this tool began during lockdown, when nurses and doctors in emergency departments and intensive care units were working incredibly hard in difficult circumstances to provide the best care for patients unwell with covid-19. The request from clinical teams was for a pragmatic decision support tool that could be applied quickly using admission information without the need for a web or mobile phone-based app, given contamination risks (2). This is what has been achieved.
The first and second points are around score calibration. Calibration describes the relationship between predictions and observations and is important in ensuring that a prognostic tool is accurate across the range of risk. Figure 2 refers to validation data and an erratum has been requested to correct a previous legend substituted in error. All performance metrics and analyses presented in this study were performed using the prognostic index (score), not using the regression models which were used to generate this score. The calibration in the validation dataset is excellent, though not perfect. It is correct to highlight that these are averages within deciles of risk. The good performance is due in part to this being an internal validation, albeit using data from different patients admitted at later points in time. Particularly in a pandemic, it is likely that calibration will change over time, perhaps differently by region and by patient subgroup. The simplicity of the 4C Mortality Score makes it susceptible to this, and we would not expect calibration to be as good in planned external validation exercises. Calibration-in-the-large (CITL) and the slope of the calibration curve were generated in the standard manner. The score was fitted to the outcome in a logistic regression model using derivation data. Predictions on the log-odds scale were made in the validation dataset and fitted in a logistic regression model to determine the calibration slope (1.034). Linear predictions were fitted as an offset to determine CITL (0.030). LOESS curve fitting did not alter the appearance of the plot.
With regard to clinical utility, we included a comparison of decision curves for the best discriminating scores that could be applied to >50% of the complete case cohort. It is asked why decision curves do not have the form of a step function (a staircase relationship in lay terms; in mathematical terms, a linear combination of indicator functions). We admit to being a little confused by this question, as we know the correspondents have a great deal of experience in this area. Net benefit is defined as the fraction of true positives minus the fraction of false positives at a given threshold odds, multiplied by the threshold odds (3). Given that a decision curve is a function of threshold odds, the result can never be a step function. The discrete changes in net benefit given the discrete nature of the prognostic index are seen as expected.
There are difficulties in using decision curves for points score models, given that such a model does not incorporate an underlying probability function in the way a regression model does. We considered three approaches to this. The first is to use the original outcome probabilities of comparison scores across the full range of risk if available; these are then fitted to validation data. The second is to refit the comparison scores in the derivation data, and to use these to predict outcomes for the validation data. The third is to refit all scores in the validation data and provide a direct comparison. We included the third approach in the paper and found no difference in conclusions using the second approach. All these analyses were performed in the same nested data for each score.
In point 3, it is asked why a machine learning model was developed and reported together with the other approaches. As stated in the paper, we believe it is important when presenting a simple and pragmatic score to understand what discrimination might be achieved when using alternative classification tools. We challenged our simple derived score by asking “to what extent is discrimination being sacrificed for the sake of expediency”? It is an interesting philosophical question to consider the upper bound of predictive power given a set of information at a particular decision point. We are trying to get to that in using all included variables in a flexible modelling framework. The correspondents make no comment on the comparison, which is of interest. Differences in discrimination between modelling approaches were numerically small, possibly reflecting limitations of the area under the receiver operator curve metric. Much of the direct correspondence we have received for this study has been in praise of the inclusion of alternative approaches when defining a pragmatic score. With regard to continuous variables, perhaps capturing non-linearity will have benefits in other performance metrics or score calibration in subgroups, at the expense of the requirement for a nomogram or calculator app.
In point 4, it is emphasised that performance should be ensured across geographical regions and patient subgroups. We completely agree. Discrimination was considered in geographical subsets, by sex, and by ethnicity. It will be useful to externally validate this model in further geographical regions and subgroups, and we look forward to doing so.
In point 5, clarification is asked on inclusion criteria. As stated, patients were required to have at least 4 weeks follow-up at the time of data extraction. Events occurring at a time-point after 4 weeks were considered at that time-point. Included patients who had no outcome, either because it was missing or it had not happened yet (derivation dataset 3.6%), were considered to have had no event. The outcome measure is therefore in-hospital mortality. It should be noted that outcomes after day 28 may not be as reliably collected due to the pressure the data collection teams were under during the pandemic.
In point 6, it is stated that this "new tool should be viewed as predicted risks in the context of current care [and] a low risk does not mean that the patient should immediately be sent home without care”. Yes, we emphasise that the key aim of risk stratification is to support clinical management decisions, not to replace them.
In the final point, the comparison with existing scores is described as problematic given the inability to apply particular scores to the data commonly available at admission to hospital. It is imperative that any new prognostic tool is put in the context of what has come before. As described above, the decision curves analysis of the best scores was performed within a nested dataset. That some existing prognostic scores can only be applied in a small proportion of patients in this dataset is itself an important result. This was a prospective non-interventional study using routine data. Prediction scores that require information not commonly available at the time of decision making have limited applicability in practice. This is particularly important in situations of surge when healthcare demand is high and clinical resources are limited.
Novel biomarkers may be important. Our deep phenotyping work progresses, and we hope to identify biomarkers that can be incorporated into similar tools to help characterise and guide the treatment of patients with covid-19.
Many thanks again for the opportunity to clarify these important points, as clinician-scientists working in hospitals, we are acutely aware of the balances that must be struck around pragmatism when creating decision support tools. We welcome data sharing requests (https://isaric4c.net) and as with all our projects, the code is made public (https://github.com/SurgicalInformatics/4C_mortality_score). We hope this code provides useful solutions to others working in this area.
1. Riley RD, Collins GS, van Smeden M, Snell KIE, Van Calster B, Wynants L. Is the 4C Mortality Score fit for purpose? Some comments and concerns. 2020 Sep 15 [cited 2020 Sep 16]; Available from: https://www.bmj.com/content/370/bmj.m3339/rr-3
2. Phua J, Weng L, Ling L, Egi M, Lim C-M, Divatia JV, et al. Intensive care management of coronavirus disease 2019 (COVID-19): challenges and recommendations. The Lancet Respiratory Medicine. 2020 May 1;8(5):506–17.
3. Vickers AJ, Calster BV, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ [Internet]. 2016 Jan 25 [cited 2020 Sep 16];352. Available from: https://www.bmj.com/content/352/bmj.i6
Competing interests:
No competing interests
18 September 2020
Ewen M Harrison
Professor of Surgery and Data Science
Stephen R Knight, Antonia Ho, Rishi Gupta, Mahdad Noursadeghi, Peter JM Openshaw, J Kenneth Baillie, Malcolm G Semple, Annemarie B Docherty
Centre for Medical Informatics, Usher Institute, Univeristy of Edinburgh
Rapid Response:
Is the 4C Mortality Score fit for purpose? Authors response.
Dear Editor
Many thanks to all correspondents for their interest in our work on the 4C Mortality Score. The feedback is greatly appreciated, and we will respond to all points in time. We write here specifically in response to the letter from Professor Riley and colleagues (1).
We are pleased to be able to use the large and detailed ISARIC 4C dataset to provide a pragmatic decision support tool for use by clinicians in these challenging times. We thank the correspondents for highlighting the quality of the work; we are particularly indebted to the large and dedicated team of research nurses and students who worked hard to collect these data. We are grateful for the opportunity to address the technical points raised.
Development of this tool began during lockdown, when nurses and doctors in emergency departments and intensive care units were working incredibly hard in difficult circumstances to provide the best care for patients unwell with covid-19. The request from clinical teams was for a pragmatic decision support tool that could be applied quickly using admission information without the need for a web or mobile phone-based app, given contamination risks (2). This is what has been achieved.
The first and second points are around score calibration. Calibration describes the relationship between predictions and observations and is important in ensuring that a prognostic tool is accurate across the range of risk. Figure 2 refers to validation data and an erratum has been requested to correct a previous legend substituted in error. All performance metrics and analyses presented in this study were performed using the prognostic index (score), not using the regression models which were used to generate this score. The calibration in the validation dataset is excellent, though not perfect. It is correct to highlight that these are averages within deciles of risk. The good performance is due in part to this being an internal validation, albeit using data from different patients admitted at later points in time. Particularly in a pandemic, it is likely that calibration will change over time, perhaps differently by region and by patient subgroup. The simplicity of the 4C Mortality Score makes it susceptible to this, and we would not expect calibration to be as good in planned external validation exercises. Calibration-in-the-large (CITL) and the slope of the calibration curve were generated in the standard manner. The score was fitted to the outcome in a logistic regression model using derivation data. Predictions on the log-odds scale were made in the validation dataset and fitted in a logistic regression model to determine the calibration slope (1.034). Linear predictions were fitted as an offset to determine CITL (0.030). LOESS curve fitting did not alter the appearance of the plot.
With regard to clinical utility, we included a comparison of decision curves for the best discriminating scores that could be applied to >50% of the complete case cohort. It is asked why decision curves do not have the form of a step function (a staircase relationship in lay terms; in mathematical terms, a linear combination of indicator functions). We admit to being a little confused by this question, as we know the correspondents have a great deal of experience in this area. Net benefit is defined as the fraction of true positives minus the fraction of false positives at a given threshold odds, multiplied by the threshold odds (3). Given that a decision curve is a function of threshold odds, the result can never be a step function. The discrete changes in net benefit given the discrete nature of the prognostic index are seen as expected.
There are difficulties in using decision curves for points score models, given that such a model does not incorporate an underlying probability function in the way a regression model does. We considered three approaches to this. The first is to use the original outcome probabilities of comparison scores across the full range of risk if available; these are then fitted to validation data. The second is to refit the comparison scores in the derivation data, and to use these to predict outcomes for the validation data. The third is to refit all scores in the validation data and provide a direct comparison. We included the third approach in the paper and found no difference in conclusions using the second approach. All these analyses were performed in the same nested data for each score.
In point 3, it is asked why a machine learning model was developed and reported together with the other approaches. As stated in the paper, we believe it is important when presenting a simple and pragmatic score to understand what discrimination might be achieved when using alternative classification tools. We challenged our simple derived score by asking “to what extent is discrimination being sacrificed for the sake of expediency”? It is an interesting philosophical question to consider the upper bound of predictive power given a set of information at a particular decision point. We are trying to get to that in using all included variables in a flexible modelling framework. The correspondents make no comment on the comparison, which is of interest. Differences in discrimination between modelling approaches were numerically small, possibly reflecting limitations of the area under the receiver operator curve metric. Much of the direct correspondence we have received for this study has been in praise of the inclusion of alternative approaches when defining a pragmatic score. With regard to continuous variables, perhaps capturing non-linearity will have benefits in other performance metrics or score calibration in subgroups, at the expense of the requirement for a nomogram or calculator app.
In point 4, it is emphasised that performance should be ensured across geographical regions and patient subgroups. We completely agree. Discrimination was considered in geographical subsets, by sex, and by ethnicity. It will be useful to externally validate this model in further geographical regions and subgroups, and we look forward to doing so.
In point 5, clarification is asked on inclusion criteria. As stated, patients were required to have at least 4 weeks follow-up at the time of data extraction. Events occurring at a time-point after 4 weeks were considered at that time-point. Included patients who had no outcome, either because it was missing or it had not happened yet (derivation dataset 3.6%), were considered to have had no event. The outcome measure is therefore in-hospital mortality. It should be noted that outcomes after day 28 may not be as reliably collected due to the pressure the data collection teams were under during the pandemic.
In point 6, it is stated that this "new tool should be viewed as predicted risks in the context of current care [and] a low risk does not mean that the patient should immediately be sent home without care”. Yes, we emphasise that the key aim of risk stratification is to support clinical management decisions, not to replace them.
In the final point, the comparison with existing scores is described as problematic given the inability to apply particular scores to the data commonly available at admission to hospital. It is imperative that any new prognostic tool is put in the context of what has come before. As described above, the decision curves analysis of the best scores was performed within a nested dataset. That some existing prognostic scores can only be applied in a small proportion of patients in this dataset is itself an important result. This was a prospective non-interventional study using routine data. Prediction scores that require information not commonly available at the time of decision making have limited applicability in practice. This is particularly important in situations of surge when healthcare demand is high and clinical resources are limited.
Novel biomarkers may be important. Our deep phenotyping work progresses, and we hope to identify biomarkers that can be incorporated into similar tools to help characterise and guide the treatment of patients with covid-19.
Many thanks again for the opportunity to clarify these important points, as clinician-scientists working in hospitals, we are acutely aware of the balances that must be struck around pragmatism when creating decision support tools. We welcome data sharing requests (https://isaric4c.net) and as with all our projects, the code is made public (https://github.com/SurgicalInformatics/4C_mortality_score). We hope this code provides useful solutions to others working in this area.
1. Riley RD, Collins GS, van Smeden M, Snell KIE, Van Calster B, Wynants L. Is the 4C Mortality Score fit for purpose? Some comments and concerns. 2020 Sep 15 [cited 2020 Sep 16]; Available from: https://www.bmj.com/content/370/bmj.m3339/rr-3
2. Phua J, Weng L, Ling L, Egi M, Lim C-M, Divatia JV, et al. Intensive care management of coronavirus disease 2019 (COVID-19): challenges and recommendations. The Lancet Respiratory Medicine. 2020 May 1;8(5):506–17.
3. Vickers AJ, Calster BV, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ [Internet]. 2016 Jan 25 [cited 2020 Sep 16];352. Available from: https://www.bmj.com/content/352/bmj.i6
Competing interests: No competing interests