Decision curve analysis or why interpretation of medical decision support models should be based on their potential medical benefit

--

Where we explore why decision curve analysis is a very powerful metric in a medical context

Written by Paul Hiemstra, PhD

Introduction

To guarantee good patient outcomes, making sound diagnostic decisions is key in any medical workflow. The amount of available data to accurately make these decisions is growing quickly and has the potential to improve this decision making process. However, the increase in data can also increases the workload for medical professionals (Huisman et al, 2024). This is where decision support models can ease the workload by synthesizing large amounts of data into more accessible chunks of information or even concrete decisions. For example, we could use a model to predict the probability a patient has Coronary Artery Disease (CAD) based on a number of characteristics specific to that patient. A medical professional now has to deal only with a single probability instead of the intricate patterns and interactions of the numerous characteristics of that patient.

If we use a model to synthesize our data, the performance of this model should be good enough to support this decision making process. However, how do we express model performance? For this many performance measures (metrics) can be used and a logic choice would be how often the model returns the correct answer (i.e., accuracy). Another alternative is looking at the model’s sensitivity and specificity (i.e., how good the model is at accurately predicting if a person has CAD or not). This allows balancing the performance of the model for both patients and healthy individuals.

Therefore, choosing a metric that matches the medical objectives we have for using the model is vital. The goal of this article is to introduce an alternative metric, net benefit (Vickers and Elkin (2006), Vickers et al (2019)), which has a number of key advantages for medical practice. To provide context we also explore a number of common metrics such as accuracy, sensitivity/specificity and area under the curve (AUC). We explore their underlying concepts, and show how net benefit extends on these other metrics. But first, we start by introducing a case study that we use to illustrate all these metrics.

Note that the case study and code in this article is heavily inspired by the offical DCA tutorial. What we added was placing DCA in a relevant statistical context, and explore what makes it work. You can also read the article on our github page.

Our case study

Prostate cancer is a common disease amongst older men and is much better treatable if detected early. Elevated PSA levels in a patient’s blood indicate that something might be wrong, but this is not enough to conclusively diagnose prostate cancer. The definitive way to determine prostate cancer is to collect a tissue sample of the prostate via a biopsy. However, the biopsy in itself is not without risks so we want to choose wisely whom to give a biopsy or not. In this case study we use a statistical model to predict the probability of cancer, and if the probability is high enough we order a biopsy.

As context to the model decision option, we also consider the situation where we biopsy no patients, and where we biopsy all patients. For a model to be worthwhile, it should perform better than either of these two trivial options. Note that an interesting comparison here would be to compare the model decision to that of a medical specialist however, the dataset does not contain this information and it is not directly relevant to the message of the article.

To translate the probability of cancer the model returns to a concrete decision, we need to choose above which probability we order a biopsy. This is called the probability threshold (𝜃), and the process of choosing the threshold is called model calibration. If, for example, we choose a threshold of 0.2 we deem a biopsy to be necessary when the probability of cancer is 20% or higher. Using a higher threshold in this case seems unrealistic to us, any patient that is told that their chance of cancer is 1 in 5 (odds 1:4) would want to have a biopsy to be sure.

The model uses the age of the patient (age), a genetic marker (marker) and whether or not there is a family history of cancer (famhistory) to predict the probability of cancer. For this historical data we also know which patient had cancer or not (cancer), and that will be used as a label to train our regression model.

import pandas as pd
from helper_functions import *

df_cancer_dx = pd.read_csv('https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv').drop(columns=['cancerpredmarker', 'risk_group'])
df_cancer_dx.head()

Note that typically we would split the data into a training and test dataset to prevent overfitting, but for the flow of the article this is not needed.

Based on this data we fit a logistic regression model using the aforementioned variables:

import statsmodels.api as sm

biopsy_model = sm.GLM.from_formula('cancer ~ age + marker + famhistory', data=df_cancer_dx, family=sm.families.Binomial()).fit()
print(biopsy_model.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable: cancer No. Observations: 750
Model: GLM Df Residuals: 746
Model Family: Binomial Df Model: 3
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -213.89
Date: Thu, 23 May 2024 Deviance: 427.79
Time: 07:15:37 Pearson chi2: 743.
No. Iterations: 6 Pseudo R-squ. (CS): 0.2130
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -22.0703 2.211 -9.984 0.000 -26.403 -17.738
age 0.2820 0.031 9.040 0.000 0.221 0.343
marker 0.9783 0.118 8.276 0.000 0.747 1.210
famhistory 0.8652 0.308 2.807 0.005 0.261 1.469
==============================================================================

From this summary, it is mainly important to understand the fitted model coefficients. To get nicely interpretable coefficients, we take the exponent of the parameters to transform them to odds:

np.exp(biopsy_model.params)
Intercept     2.600116e-10
age 1.325770e+00
marker 2.660012e+00
famhistory 2.375435e+00
dtype: float64

All these coefficients represent the odds (𝑝/(1−𝑝)) of a patient having cancer:

  • intercept For a patient with the hypothetical age 0, zero score on the marker and no family history of cancer the odds are almost zero that a patient has cancer.
  • age for each year the patient is older, the odds of cancer become 1.3 times bigger.
  • marker for each unit increase of the marker, the odds of prostate cancer become 2.7 times bigger.
  • famhistory a family history of prostate cancer increases the likelihood of cancer 2.4 times.

To get a first impression of the performance of this model we draw a scatterplot of age versus marker including the decision boundary of probability threshold 0.2 and famhistory == 1. Patients above the green line have a probability of 0.2 or higher, and will thus be given a biopsy. From the scatterplot we see that the model performs quite nicely:

import matplotlib.pyplot as plt

p_thold = 0.2
logit_p = np.log(p_thold / (1 - p_thold))

marker_range = np.linspace(*df_cancer_dx['marker'].agg(['min', 'max']))
ax = (df_cancer_dx[['cancer', 'marker', 'age']]
.plot(kind='scatter', x='marker', y='age', c=df_cancer_dx['cancer'],
colormap='viridis', colorbar=False))

c_intercept, c_age, c_marker, c_famhistory = biopsy_model.params
marker_range = df_cancer_dx['marker'].agg(['min', 'max'])

famhistory = 1
age_pred = (-c_intercept - c_marker * marker_range - c_famhistory * famhistory + logit_p) / c_age
ax.axline((marker_range[0], age_pred[0]), (marker_range[1], age_pred[1]), color='green', label='famhistory')
Decision boundary for decision threshold 0.2 and family history equal to 1.

The confirmed cancer patients (yellow dots) are generally above the decision line, and the bulk of the healthy people are below the decision line. However, some patients that turned out to have cancer are not classified as such and vice versa. So, the model has predictive power, but is certainly not flawless in deciding who should get a biopsy or not.

The building blocks of model evaluation

The building blocks of our metrics are how often the model makes the correct or wrong decision. In the plot above, we have patients with cancer that are not given a biopsy (yellow dots under the line). This is called a False Negative (FN) as we falsely assume that no biopsy is needed. The inverse of that is the False Positive (purple dots above the green line), where based on the model we falsely assume that a biopsy is needed. The model also makes good decisions, in this case True Positives (TP, correctly assign biopsy) and True Negatives (TN, correctly omit biopsy).

TP, TN, FP and FN form the basis for all of the metrics we discuss in this article. The following figure illustrates them for our case study and data:

from sklearn.metrics import ConfusionMatrixDisplay

(ConfusionMatrixDisplay.from_predictions(df_cancer_dx['cancer'],
biopsy_model.predict(df_cancer_dx) >= 0.2,
normalize='true',
display_labels=['biopsy-not-needed','biopsy-needed']))
Confusion matrix for the logistical regression model

This shows that for 85% of the healthy patients the model correctly decides that biopsy is not needed (TN), that leaves 15% (FP) where an unnecessary biopsy is predicted. For patients that are sick, the percentages are 71% (TP) and 29% (FP) respectively. Note that these numbers depend on the decision threshold used. We can increase the performance for sick patients by lowering the threshold, but this will increase the error rate for healthy patients. Setting the decision threshold is an important step in the modeling process.

The most basic metric: accuracy

Accuracy is simply defined as the fraction of times the model makes the correct decision:

where 𝑛_𝑡𝑜𝑡𝑎𝑙 is the total number of patients in the data. The accuracy of our biopsy model is:

from sklearn.metrics import accuracy_score
thold = 0.2

acc_model = accuracy_score(biopsy_model.predict(df_cancer_dx[['age', 'marker', 'famhistory']]) >= thold, df_cancer_dx['cancer'])
# biopsy everyone, so the probability of cancer is always 1
acc_all = accuracy_score(np.ones(len(df_cancer_dx['cancer'])) >= thold, df_cancer_dx['cancer'])
# biopsy no one, so the probability of cancer is always 0
acc_none = accuracy_score(np.zeros(len(df_cancer_dx['cancer'])) >= thold, df_cancer_dx['cancer'])

accs = [acc_none, acc_model, acc_all]
opts = ['biopsy-none', 'biopsy-model', 'biopsy-all']
fig, ax = plt.subplots()

ax.bar(opts, accs)
ax.set_ylabel('Accuracy')
Accuracy scores for all three options: biopsy no one, biopsy everyone, biopsy based on the logistical regression model.

Which shows that the model makes the correct decision 83% of the time when using a probability threshold of 0.2. This is in line with our previous observation that the model has quite some predictive power. To add context, we also plotted the accuracy for the two other treatment options (biopsy-all, biopsy-none). This is an unexpected result: our model seemed to have significant predictive power but the model accuracy is lower then simply giving no one a biopsy (83% versus 86%). This seems counterintuitive, all the cancer patients in the dataset would not get a biopsy (14%, 105 patients) and thus not get the treatment they need. So, why is accuracy such a poor metric in this case?

Looking at the accuracies for biopsy-all and biopsy-none provides a hint as to what is going on: the accuracies are the exact fractions of patients and healthy individuals in the dataset. Note that the number of healthy individuals is much larger or in statistical terms: the dataset is unbalanced. By comparing the number of TP+TN versus the size of the entire dataset the extremely simple biopsy-no-one model can have a very high accuracy. So the accuracy is far too strongly influenced by the imbalance in the dataset to provide a meaningful interpretation of the performance of our model compared to the biopsy-all and biopsy-none options.

A better metric: Area under the curve (AUC)

So, how do we deal with unbalanced datasets? A solution that is often used is to not look at the fraction or rate of good decisions versus the entire dataset, but the fraction of good decisions per prediction category (cancer/non-cancer). The two key ones are the True Positive Rate (TPR) and the False Positive Rate (FPR):

  • TPR, what fraction of the time does the model predict a biopsy when the patient actually has cancer (𝑇𝑃𝑅=𝑇𝑃/𝑛_𝑐𝑎𝑛𝑐𝑒𝑟), and thus a biopsy was correctly performed. This is also known as the sensitivity of the model.
  • FPR, what fraction of the time does the model predict a biopsy when the patient actually does not have cancer (𝐹𝑃𝑅=𝐹𝑃/𝑛_𝑛𝑜𝑛−𝑐𝑎𝑛𝑐𝑒𝑟), and a biopsy was unnecessary. This is also know as one minus the specificity of the model.

By separating the performance per category, the performance of correctly assigning biopsies (TPR) becomes obvious, whilst in the case of accuracy it was obscured by the large amount of healthy people.

As we stated earlier, to make biopsy/no-biopsy decisions we need to pick a decision threshold 𝜃 above for when we assign a biopsy. But how does 𝜃 influence the TPR and FPR? The following plot shows the distribution of probabilities of our biopsy model, with density lines for the patients and healthy individuals. The density lines are comparable to histograms, it shows what fraction of the data has that particular model probability. The black line represents the decision threshold of 0.2, i.e. patients to the right of this line are given a biopsy. From this plot we can see how strongly our model probabilities correlate with a cancer diagnosis (established later):

df_cancer_dx['biopsy_prob'] = biopsy_model.predict(df_cancer_dx)
df_cancer_dx['noone_prob'] = np.zeros(len(df_cancer_dx['cancer']))
df_cancer_dx['everyone_prob'] = np.ones(len(df_cancer_dx['cancer']))

(prob_density_plot(df_cancer_dx,
y='biopsy_prob',
by='cancer',
vline=[0.2],
title='Biopsy Model',
xlabel='Probability of model that patient has cancer',
labels=['Healthy population','Patient population']))
Density plot showing the distribution of probability output of the model. Orange line shows the population that turned out to have cancer, blue show the population that turned out to be healthy.

Important to note in the plot:

  • Non-cancer patients overwhelmingly have a low predicted probability of cancer (blue line). Choosing a threshold probability of 0.2 means that the True Negative Rate or specificity will be high. Because of that, the False Positive Rate (unnecessary biopsies) will be quite low.
  • Cancer patients show a broader range of predicted probabilities (orange line), which means that whichever probability threshold we choose, the True Positive Rate (accurately assigned biopsies) will never be close to one.

Here we see that there is a tradeoff between the model accurately assigning biopsies to cancer patients (TPR) and limiting unneeded biopsies (FPR). If we plot the TPR and FPR for different values of 𝜃 we get a so called Relative Operating Characteristic (ROC) curve. The next figure shows the ROC curves for our three treatment options:

from sklearn.metrics import RocCurveDisplay

plt.figure(figsize=(12,15))
base_subplot = 420

for i, col in enumerate(['biopsy_prob', 'noone_prob', 'everyone_prob']):
roc_pos_offset = (2*(i + 1))
plt.subplot(base_subplot + roc_pos_offset - 1)
if col == 'biopsy_prob':
prob_density_plot(df_cancer_dx, col, 'cancer', title='Biopsy logistic model', vline=[], labels=['Healthy population','Sick population'])
elif col == 'noone_prob':
df_cancer_dx[col].plot(kind='hist', title='Biopsy no one')
else:
ax = df_cancer_dx[col].plot(kind='hist', title='Biopsy everyone', xlabel='Probability of model that patient has cancer')
ax.set(xlabel='Probability of model that patient has cancer')
ax = plt.subplot(base_subplot + roc_pos_offset)
RocCurveDisplay.from_predictions(y_true = df_cancer_dx['cancer'], y_pred = df_cancer_dx[col], ax=ax)
Probability distribution and ROC curves for each of the options: biopsy-none, biopsy-everyone, biopsy based on model.

Note that for the biopsy-no-one and biopsy-everyone options the probability of cancer is always 0 or 1 respectively. A good intuition for the ROC curve is that it measures if the predictions of the model are ranked correctly, i.e. it tells you what the probability is that a randomly chosen cancer patient is ranked higher than a randomly chosen healthy patient (this link).

The plots also shows a so called AUC score, this is the area under the ROC curve. The AUC quantifies the performance of the model over the whole range of possible decision thresholds, in contrast to the accuracy which looks at one particular decision threshold. AUC can range between 0 and 1: 1 means perfect performance, 0.5 means that the model is no better than randomly choosing whom to give a biopsy, and lower than 0.5 means that it is worse than choosing randomly. Important to note in the figure:

  • The ROC curves for the ‘biopsy-no-one’ and ‘biopsy-everyone’ show a flat line and an AUC of 0.5. For example for ‘biopsy-everyone’, the probabilities of cancer are always 1 so for all decision thresholds we give everyone a biopsy.This means the TPR is always 1 as all cancer patients are given a biopsy, but the FPR is also 1 as every non-cancer patient also are given a biopsy. The AUC confirms our intuition that such a simplistic way of assigning biopsies is not effective, as the AUC of 0.5 indicates this option is no better than choosing randomly.
  • The ROC curve for the biopsy model is a lot better, showing the model to be significantly better than randomly assigning biopsies.

So the AUC is a better metric that is less sensitive to unbalanced data than accuracy, and this makes it very suitable for evaluating decision (support) models. There is still one problem however: the AUC judges a correctly assigned biopsy (TP) and an unnecessary biopsy (FP) as equally important. That makes no sense for our case study: missing a cancer patient is obviously more harmful then an unnecessary biopsy. Next we will look at a metric which address that shortcoming: the net benefit metric from decision curve analysis.

Balancing necessary and unnecessary biopsies: net benefit

Decision curve analysis introduces a new metric called the net benefit. It frames the performance of the model as a cost-benefit analysis. The benefit part of net benefit is derived from how good the model is at accurately assigning biopsies, i.e. the True Positive Rate (TPR, sensitivity). The cost part of the net benefit is derived from how many unnecessary biopsies the model suggests, i.e. the False Positive Rate (FPR, 1 — specificity). Note that these are the exact same metrics that the ROC curve and thus the AUC score use.

The unique addition that net benefit introduces is that we can scale how important a correct biopsy is versus an unnecessary biopsy. This scaling is derived from the decision threshold 𝜃. A 𝜃 of 0.2 means that from a 20% probability of cancer onwards we order a biopsy. This means that we expect for each accurate biopsy (20% chance) to accept 4 unnecessary biopsies (100% — 20% = 80% chance). This means an accurate biopsy is 4 times more important, and counts four times more towards the net benefit.

Given a decision threshold 𝜃, the net benefit is defined as:

where:

  • 𝑇𝑃𝑅_𝜃×𝑛_𝑐𝑎𝑛𝑐𝑒𝑟/𝑛 is the benefit of the comparison, in this case the fraction of cancer patients in the data that get the biopsy they need.
  • 𝐹𝑃𝑅_𝜃×𝑛_𝑛𝑜𝑛−𝑐𝑎𝑛𝑐𝑒𝑟/𝑛 is the cost of the comparison, in this case the fraction of healthy people in the data getting an unnecessary biopsy.
  • 𝜃 the decision threshold: how much risk of cancer would warrant a biopsy? Setting this to 0.2 means that when the probability is 20% or higher, we order a biopsy.
  • 𝜃/(1−𝜃) are the odds that specify how much more or less important we judge an unnecessary biopsy to be, which is linked to the decision threshold. For our previous example of 𝜃=0.2, this would lead to 1/4 meaning the unnecessary biopsy is 4 times less important than correctly assigning a biopsy. Alternatively, we would accept 4 unnecessary biopsies for each correct biopsy. The exact value of 𝜃 should be decided together with medical professionals.

The calculation above is for one value of the decision threshold 𝜃, but we can also calculate it for a range of 𝜃 values. If we graph 𝜃 versus net benefit, we obtain the titular decision curve. A decision curve analysis contains a curve for each of the decision options (biopsy-all, biopsy-none, biopsy-model). For our case study we draw the decision curve using the dcurves package:

from dcurves import dca, plot_graphs

dca_multi_df = \
dca(data=df_cancer_dx,
outcome='cancer',
modelnames=['biopsy_prob'],
thresholds=np.arange(0,0.36,0.01))

plot_graphs(plot_df=dca_multi_df,
y_limits=[-0.05, 0.2],
graph_type='net_benefit',
color_names=['blue', 'red', 'green'],
smooth_frac=0.4)
Decision curve plot showing decision threshold versus net benefit

In the code above we used a number of settings of the DCA package code. thresholds is the variable that determines what decision threshold values we are going to explore. We capped this at 0.36 because if the chance of cancer is larger than this, we feel no medical professional would not schedule a biopsy. In addition, we limit the negative part of the y-axis (net benefit), in general a negative net benefit means that this medical decision option is not worth taking, and doing nothing is a better option. So it does not make sense showing that part of the y-axis. For more guidance on configuring the DCA, we recommend this article.

A number of observations are important in the decision curve above:

  • biopsy-none (none) always shows a net benefit of 0: you never make a mistake, but you also never correctly assign a biopsy. No cost, no benefit.
  • biopsy-all (all) shows good net benefit for lower values of the decision threshold 𝜃. This makes sense, as lower 𝜃 values emphasize giving biopsies to those with cancer (TP). In this lower 𝜃 range, biopsy-all is a better option than biopsy-none.
  • biopsy-all (all) shows negative net benefit from a 𝜃 of 0.14 onwards. This means that more harm is done by unnecessary biopsies than benefit is gained from giving biopsies to cancer patients.
  • The net benefit of our biopsy model is almost always higher then both of the other options. Only for really low 𝜃 values the biopsy-everyone option becomes viable, as we are willing to accept a lot of unnecessary biopsies either way.
  • Plotting 𝜃 versus net benefit for all these options shows a more subtle performance distribution than we could observe with both accuracy and AUC. In terms of AUC, the biopsy-none and everyone options were completely useless compared to the model. Now we see that the story is a bit more subtle, in extreme case one might even prefer the everyone option.

Conclusion

Net benefit and decision curves have a number of advantages over accuracy and AUC score. First, it can deal with unbalanced datasets which typically have more healthy individuals than patients. Second, it allows us to weigh different types of errors differently (FP vs. FN), in this case that a FN (no biopsy for a sick patient) is worse than FP (unnecessary biopsy). In addition, the decision threshold 𝜃 has a really nice clinical interpretation: how many wrong unnecessary biopsies are you willing to accept for each correctly assigned biopsy? This makes it easier to use the decision curve in communicating the worth of a model to a medical professional. These two advantages make decision curve analysis and net benefit a strong candidate to be the go-to performance metric for these kinds of decision (support) models.

Acknowledgements

This work is part of the Medical Data Analytics Center project, a collaboration between the Isala hospital, the software company AppBakkers, Windesheim University of Applied Sciences, and has been made possible through a grant from TechForFuture. Special thanks goes out to Eline Ekkelenkamp for her thorough feedback on early drafts and to Levi Schilder, Gido Hakvoort, and Rob Wanders for reviewing this article.

Background information & further reading

If you want to delve deeper into decision curve analysis and evalaution metrics I recommend the following resources:

  • This article by Frank Harrel discusses a number of common mistakes made when applying DCA.
  • The offical DCA page contains good resources such as a literature list. It also includes a full tutorial of the various versions of the DCA software.
  • This paper warns against the use of data augmentation methods such as SMOTE, and lists DCA as a good alternative to SMOTE.
  • This article discusses a number of evaluation metrics, including AUC score. It describes the PR AUC, which could be used ass an alternative to net benefit.
  • PR ROC is listed as a good alternative by this link to the auc score for unbalanced data. However, this paper provides some caveats for using ROC curves and AUC for unbalanced datasets.

--

--

Research Group for IT-innovations in Healthcare

This Research Group is part of Windesheim University. Our focus: practical research on sustainable embedding of promising IT innovations in care practice.