10 Restrictions on Candidate Predictors
Case-study:
Mortality after surgery for esophageal cancer
Let’s consider the example of predicting 30 day mortality after surgery for esophageal cancer, We analyzed data from the SEER-Medicare database. Among 2041 patients who were over 65 years old and diagnosed between 1991 and 1996, 221 had died by 30 days \[JCO 2006\].
For a robust evaluation of the prognostic relevance of comorbidity, we create a simple sum score. It is based on the sum of imputed values for 5 comorbidities. The maximum is 3 comorbidities in this case.
Variable | Meaning |
---|---|
COMORBI |
Comorbidity score based on the count of 5 comorbidities |
CPD |
Chronic Pulmonary Disease |
Cardio |
Cardiovascular disease |
Diabetes |
Diabetes |
Liver |
Liver disease |
Renal |
Renal disease |
We describe the data below. Note that some missing values for comorbidities were imputed with values between 0 and 1. A regression imputation model was used, with the expected value used as a single imputed value.
Code
# Import SEER data set, n=2041
<- read.csv("data/EsoSurgery.csv")
Surgery options(prType='html')
html(describe(Surgery), scroll=TRUE)
8 Variables 2041 Observations
D30
n | missing | distinct | Info | Sum | Mean | Gmd |
---|---|---|---|---|---|---|
2041 | 0 | 2 | 0.29 | 221 | 0.1083 | 0.1932 |
AGE
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2041 | 0 | 801 | 1 | 73.5 | 6.723 | 65.83 | 66.42 | 68.58 | 72.42 | 77.25 | 81.91 | 85.17 |
COMORBI
n missing distinct Info Mean Gmd .05 .10 .25 2041 0 393 0.737 0.2983 0.4589 0.0000 0.0000 0.0000 .50 .75 .90 .95 0.0000 0.3756 1.0000 1.1968lowest : 0.00 0.18 0.18 0.19 0.19 , highest: 2.14 2.16 2.16 2.22 3.00
CPD
n missing distinct Info Mean Gmd .05 .10 .25 2041 0 376 0.591 0.09873 0.1718 0.00000 0.00000 0.00000 .50 .75 .90 .95 0.00000 0.07714 0.16931 1.00000lowest : 0.000 0.062 0.065 0.065 0.067 , highest: 0.226 0.228 0.233 0.234 1.000
Value 0.00 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 Frequency 1514 1 12 23 37 23 36 35 55 35 30 25 Proportion 0.742 0.000 0.006 0.011 0.018 0.011 0.018 0.017 0.027 0.017 0.015 0.012 Value 0.17 0.18 0.19 0.20 0.21 0.22 0.23 1.00 Frequency 22 17 7 7 3 2 4 153 Proportion 0.011 0.008 0.003 0.003 0.001 0.001 0.002 0.075For the frequency table, variable is rounded to the nearest 0.01
Cardio
n missing distinct Info Mean Gmd .05 .10 .25 2041 0 374 0.593 0.09677 0.1702 0.0000 0.0000 0.0000 .50 .75 .90 .95 0.0000 0.0536 0.1692 1.0000lowest : 0.000 0.043 0.043 0.045 0.045 , highest: 0.297 0.300 0.326 0.365 1.000
Diabetes
n missing distinct Info Mean Gmd .05 .10 .25 2041 0 378 0.594 0.09295 0.1641 0.0000 0.0000 0.0000 .50 .75 .90 .95 0.0000 0.0566 0.1304 1.0000lowest : 0.000 0.046 0.046 0.046 0.048 , highest: 0.191 0.195 0.196 0.206 1.000
Value 0.00 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 Frequency 1511 10 59 60 43 43 33 26 33 29 14 9 Proportion 0.740 0.005 0.029 0.029 0.021 0.021 0.016 0.013 0.016 0.014 0.007 0.004 Value 0.16 0.17 0.18 0.19 0.20 0.21 1.00 Frequency 5 5 2 3 1 1 154 Proportion 0.002 0.002 0.001 0.001 0.000 0.000 0.075For the frequency table, variable is rounded to the nearest 0.01
Liver
n missing distinct Info Mean Gmd .05 .10 .25 2041 0 391 0.474 0.002736 0.005334 0.000000 0.000000 0.000000 .50 .75 .90 .95 0.000000 0.000000 0.001706 0.005047lowest : 0.00000 0.00028 0.00028 0.00029 0.00029 , highest: 0.02400 0.02591 0.02903 0.03643 1.00000
Value 0.00 0.01 0.02 0.03 0.04 1.00 Frequency 1937 78 19 2 1 4 Proportion 0.949 0.038 0.009 0.001 0.000 0.002For the frequency table, variable is rounded to the nearest 0.01
Renal
n missing distinct Info Mean Gmd .05 .10 .25 2041 0 391 0.48 0.007155 0.01381 0.000000 0.000000 0.000000 .50 .75 .90 .95 0.000000 0.000000 0.007221 0.012013lowest : 0.00000 0.00024 0.00041 0.00055 0.00060 , highest: 0.03916 0.04100 0.04135 0.07466 1.00000
Value 0.00 0.01 0.02 0.03 0.04 0.07 1.00 Frequency 1757 218 37 8 9 1 11 Proportion 0.861 0.107 0.018 0.004 0.004 0.000 0.005For the frequency table, variable is rounded to the nearest 0.01
Code
<- lrm(D30~COMORBI, data=Surgery)
fit1 print(fit1)
lrm(formula = D30 ~ COMORBI, data = Surgery)
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
---|---|---|---|
Obs 2041 | LR χ2 13.99 | R2 0.014 | C 0.549 |
0 1820 | d.f. 1 | R21,2041 0.006 | Dxy 0.098 |
1 221 | Pr(>χ2) 0.0002 | R21,591.2 0.022 | γ 0.162 |
max |∂log L/∂β| 3×10-9 | Brier 0.096 | τa 0.019 |
β | S.E. | Wald Z | Pr(>|Z|) | |
---|---|---|---|---|
Intercept | -2.2643 | 0.0850 | -26.65 | <0.0001 |
COMORBI | 0.4443 | 0.1129 | 3.93 | <0.0001 |
The prognostic value of the score is modest; on its own (univariate logistic regression of D30~COMORBI)
, we find a c statistic of 0.55.
Testing the equal weights assumption in a simple sumscore
Simple sums of predictors make the assumption of equal weights for each predictor. This assumption can be assessed in at least two ways
- An overall test: is a more refined coding preferable over a simple sum?
In the example of comorbidity, we consider the sum of 5 comorbidity conditions as a simple sumscore (Table 10.2). A model considering the 5 comorbidity conditions separately has 5 df and a Likelihood Ratio statistic of 18, in contrast to 14 for the simple sumscore. The difference of 3.6 with 4 df has a p-value of 0.46, far from convincing against the idea of using the simple sumscore. - Component-wise testing: is one of the comborbidities really deviant in a prognostic value?
We adding the conditions one by one in a regression model that already contains the sumscore. The coefficient of the condition added in a model indicates the deviation from the common effect based on the other conditions.
We note that the deviations from the common effect are relatively small, except for liver disease and renal disease. Renal disease even seemed to have a protective effect. Both effects were based on small numbers. The standard errors of the estimates were large, and the effects were statistically nonsignificant.
Code
# Make function that that a score plus its components
# outcome and data specified as well
<- function(data=Surgery, y="D30", sumscore="COMORBI",
test.equal.weights components=Cs(CPD, Cardio, Diabetes, Liver, Renal)) {
# results in matrix
<- matrix(nrow=(2+length(components)), ncol=7)
matrix.coefs # labels.components <- dput(as.character(components)) # to get it nice for row.labels
dimnames(matrix.coefs) <- list(c("sumscore", "ALL", dput(as.character(components))),
Cs(Coef.Sumscore, SE.Sumscore, Coef.Component, SE.Component,
-value))
LR, df, p# Make models:
# 1. sumscore
<- lrm(data[,y] ~ data[,sumscore])
fit1 1,] <- c(fit1$coef[2], sqrt(fit1$var[2,2]), NA, NA, fit1$stats[3:5])
matrix.coefs[
# 2. full model for overall comparison
<- lrm(data[,y] ~ as.matrix(data[,components]), x=T)
fit.full # compare model fits
<- pchisq(fit.full$stats[3] - fit1$stats[3],
p.anova.comparison df= fit.full$stats[4] - fit1$stats[4], lower.tail = F )
2,] <- c(NA, NA, NA, NA, fit.full$stats[3:4], p.anova.comparison)
matrix.coefs[
# 3. fit incremenal differences to sumscore
for (i in 1:length(components)) {
<- update(fit1, .~.+ fit.full$x[,i])
fiti # compare model fits
<- pchisq(fiti$stats[3] - fit1$stats[3],
p.anova.comparison df= fiti$stats[4] - fit1$stats[4], lower.tail = F )
2+i,] <- c(fiti$coef[2], sqrt(fiti$var[2,2]), fiti$coef[3], sqrt(fiti$var[3,3]),
matrix.coefs[$stats[3:4], p.anova.comparison) } # end loop
fitireturn(matrix.coefs)
# end function
}
kable(test.equal.weights(data=Surgery, y="D30", sumscore="COMORBI",
components=Cs(CPD, Cardio, Diabetes, Liver, Renal)),
caption="**Table 10.2**: Testing deviations for each condition in a sum score. Data from esophageal cancer patients who underwent surgery (2041 patients from SEER-Medicare data, 221 died by 30 days). The overall test for deviations from a simple sum score had a p-value of 0.46 (overall LR test, 4 *df*)")
## c("CPD", "Cardio", "Diabetes", "Liver", "Renal")
Coef.Sumscore | SE.Sumscore | Coef.Component | SE.Component | LR | df | p - value | |
---|---|---|---|---|---|---|---|
sumscore | 0.44 | 0.11 | NA | NA | 14 | 1 | 0.00 |
ALL | NA | NA | NA | NA | 18 | 5 | 0.46 |
CPD | 0.51 | 0.14 | -0.22 | 0.31 | 14 | 2 | 0.48 |
Cardio | 0.49 | 0.16 | -0.13 | 0.32 | 14 | 2 | 0.69 |
Diabetes | 0.35 | 0.15 | 0.32 | 0.29 | 15 | 2 | 0.27 |
Liver | 0.42 | 0.12 | 1.31 | 1.03 | 16 | 2 | 0.22 |
Renal | 0.48 | 0.12 | -1.09 | 1.11 | 15 | 2 | 0.26 |
Discussion
In the SEER data case study, we stick to our assumption of a similar effect for all comorbities. The apparently most deviant effects of liver disease and renal disease were unreliable. We hence may assume similar effects for Liver
and Renal
as for the other comorbidities.
An extension of the component-wise testing might be to apply a LASSO regression model where the sumscore effect is taken as an offset, with shrinkage of deviations from this offset. In the case study, the overall test was far from statistically significant, and hence we expect shrinkage of most or all of these deviations to zero. This idea is similar to updating of a prediction model, with shrinkage of deviating coefficients to values of a prior model: y~predictor, offset=linear.predictor
\[Stat Med 2004\].
Further discussions on robust modeling were motivated by a case study on prediction of mutations based on family history \[[JAMA 2006](https://pubmed.ncbi.nlm.nih.gov/17003395/ "Prediction of MLH1 and MSH2 mutations in Lynch syndrome")\]. A simple weighting of second degree relatives as half the effect of first degree relatives worked well in this case. And the effect for age of diagnosis in a relative could be assumed identical for the index patient (proband) and the first and second degree relatives. This simplification saved degrees of freedom in the modeling process at the expense of potentially missing specific patterns in the data \[[Stat Med 2007](https://pubmed.ncbi.nlm.nih.gov/17948867/ "Data reduction for prediction: a case study on robust coding of age and family history for the risk of having a genetic mutation")\].
A robust approach is especially attractive in relatively small data set. Indeed, a major study attempted to model family history for different cancers (colon, endometrial, other) separately for first and second degree relatives, with dichotomization of age as below or above 50 years in 870 patients with only 38 mutations identified \[[NEJM 2006](https://pubmed.ncbi.nlm.nih.gov/16807412/ "Identification and survival of carriers of mutations in DNA mismatch-repair genes in colon cancer")\]. Simulations confirmed that attempting such modeling was a bad idea \[JCE 2018\], both because of severe overfitting (38 events) and dichotomania. In contrast, the robust modeling strategy was applied in various versions of prediction models for mutation status \[[Gastroenterology 2011](https://pubmed.ncbi.nlm.nih.gov/20727894/ "The PREMM(1,2,6) model predicts risk of MLH1, MSH2, and MSH6 germline mutations based on cancer history"); [JCO 2017](https://pubmed.ncbi.nlm.nih.gov/28489507/ "Development and Validation of the PREMM5 Model for Comprehensive Risk Assessment of Lynch Syndrome")\], with satisfactory performance in a large-scale international validation study \[[JNCI 2015](https://pubmed.ncbi.nlm.nih.gov/26582061/ "Comparison of Prediction Models for Lynch Syndrome Among Individuals With Colorectal Cancer")\].
Literature
SEER data case study:
- Steyerberg EW, Neville BA, Koppert LB, Lemmens VE, Tilanus HW, Coebergh JW, Weeks JC, Earle CC. Surgical mortality in patients with esophageal cancer: development and validation of a simple risk score. J Clin Oncol. 2006 Sep 10;24(26):4277-84. doi: 10.1200/JCO.2005.05.0658. PMID: 16963730
Methods papers:
Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med. 2004 Aug 30;23(16):2567-86. doi: 10.1002/sim.1844. PMID: 15287085
Steyerberg EW, Uno H, Ioannidis JPA, van Calster B; Collaborators. Poor performance of clinical prediction models: the harm of commonly applied methods. J Clin Epidemiol. 2018 Jun;98:133-143. doi: 10.1016/j.jclinepi.2017.11.013. Epub 2017 Nov 24. PMID: 29174118
Steyerberg EW, Balmaña J, Stockwell DH, Syngal S. Data reduction for prediction: a case study on robust coding of age and family history for the risk of having a genetic mutation. Stat Med. 2007 Dec 30;26(30):5545-56. doi: 10.1002/sim.3119. PMID: 17948867
Mutation status prediction:
Balmaña J, Stockwell DH, Steyerberg EW, …, Burbidge LA, Syngal S. Prediction of MLH1 and MSH2 mutations in Lynch syndrome. JAMA. 2006 Sep 27;296(12):1469-78. doi: 10.1001/jama.296.12.1469. PMID: 17003395
Barnetson RA, Tenesa A, Farrington SM, Nicholl ID, Cetnarskyj R, Porteous ME, Campbell H, Dunlop MG. Identification and survival of carriers of mutations in DNA mismatch-repair genes in colon cancer. N Engl J Med. 2006 Jun 29;354(26):2751-63. doi: 10.1056/NEJMoa053493
Kastrinos F, Uno H, Ukaegbu C, …, Steyerberg EW, Syngal S. Development and Validation of the PREMM5 Model for Comprehensive Risk Assessment of Lynch Syndrome. J Clin Oncol. 2017 Jul 1;35(19):2165-2172. doi: 10.1200/JCO.2016.69.6120. Epub 2017 May 10. PMID: 28489507
Kastrinos F, Steyerberg EW, Mercado R, …, Wenstrup RJ, Syngal S. The PREMM(1,2,6) model predicts risk of MLH1, MSH2, and MSH6 germline mutations based on cancer history. Gastroenterology. 2011 Jan;140(1):73-81. doi: 10.1053/j.gastro.2010.08.021. Epub 2010 Aug 19. PMID: 20727894
Kastrinos F, Ojha RP, Leenen C, …, Syngal S, Steyerberg EW; Lynch Syndrome prediction model validation study group. Comparison of Prediction Models for Lynch Syndrome Among Individuals With Colorectal Cancer. J Natl Cancer Inst. 2015 Nov 18;108(2):djv308. doi: 10.1093/jnci/djv308. Print 2016 Feb.PMID: 26582061