10 Restrictions on Candidate Predictors

Case-study:

Mortality after surgery for esophageal cancer

Let’s consider the example of predicting 30 day mortality after surgery for esophageal cancer, We analyzed data from the SEER-Medicare database. Among 2041 patients who were over 65 years old and diagnosed between 1991 and 1996, 221 had died by 30 days \[JCO 2006\].

For a robust evaluation of the prognostic relevance of comorbidity, we create a simple sum score. It is based on the sum of imputed values for 5 comorbidities. The maximum is 3 comorbidities in this case.

Comorbidity variables
Variable Meaning
COMORBI Comorbidity score based on the count of 5 comorbidities
CPD Chronic Pulmonary Disease
Cardio Cardiovascular disease
Diabetes Diabetes
Liver Liver disease
Renal Renal disease

We describe the data below. Note that some missing values for comorbidities were imputed with values between 0 and 1. A regression imputation model was used, with the expected value used as a single imputed value.

Code
# Import SEER data set, n=2041
Surgery <- read.csv("data/EsoSurgery.csv")
options(prType='html')
html(describe(Surgery), scroll=TRUE)
Surgery Descriptives
Surgery

8 Variables   2041 Observations

D30
nmissingdistinctInfoSumMeanGmd
2041020.292210.10830.1932

AGE
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
20410801173.56.72365.8366.4268.5872.4277.2581.9185.17
lowest : 65 65 65 65 65 , highest: 95 95 95 98 102
COMORBI
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      393    0.737   0.2983   0.4589   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.3756   1.0000   1.1968 
 
lowest : 0.00 0.18 0.18 0.19 0.19 , highest: 2.14 2.16 2.16 2.22 3.00
CPD
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      376    0.591  0.09873   0.1718  0.00000  0.00000  0.00000 
      .50      .75      .90      .95 
  0.00000  0.07714  0.16931  1.00000 
 
lowest : 0.000 0.062 0.065 0.065 0.067 , highest: 0.226 0.228 0.233 0.234 1.000
 Value       0.00  0.06  0.07  0.08  0.09  0.10  0.11  0.12  0.13  0.14  0.15  0.16
 Frequency   1514     1    12    23    37    23    36    35    55    35    30    25
 Proportion 0.742 0.000 0.006 0.011 0.018 0.011 0.018 0.017 0.027 0.017 0.015 0.012
                                                           
 Value       0.17  0.18  0.19  0.20  0.21  0.22  0.23  1.00
 Frequency     22    17     7     7     3     2     4   153
 Proportion 0.011 0.008 0.003 0.003 0.001 0.001 0.002 0.075
 
For the frequency table, variable is rounded to the nearest 0.01
Cardio
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      374    0.593  0.09677   0.1702   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.0536   0.1692   1.0000 
 
lowest : 0.000 0.043 0.043 0.045 0.045 , highest: 0.297 0.300 0.326 0.365 1.000
Diabetes
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      378    0.594  0.09295   0.1641   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.0566   0.1304   1.0000 
 
lowest : 0.000 0.046 0.046 0.046 0.048 , highest: 0.191 0.195 0.196 0.206 1.000
 Value       0.00  0.05  0.06  0.07  0.08  0.09  0.10  0.11  0.12  0.13  0.14  0.15
 Frequency   1511    10    59    60    43    43    33    26    33    29    14     9
 Proportion 0.740 0.005 0.029 0.029 0.021 0.021 0.016 0.013 0.016 0.014 0.007 0.004
                                                     
 Value       0.16  0.17  0.18  0.19  0.20  0.21  1.00
 Frequency      5     5     2     3     1     1   154
 Proportion 0.002 0.002 0.001 0.001 0.000 0.000 0.075
 
For the frequency table, variable is rounded to the nearest 0.01
Liver
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      391    0.474 0.002736 0.005334 0.000000 0.000000 0.000000 
      .50      .75      .90      .95 
 0.000000 0.000000 0.001706 0.005047 
 
lowest : 0.00000 0.00028 0.00028 0.00029 0.00029 , highest: 0.02400 0.02591 0.02903 0.03643 1.00000
 Value       0.00  0.01  0.02  0.03  0.04  1.00
 Frequency   1937    78    19     2     1     4
 Proportion 0.949 0.038 0.009 0.001 0.000 0.002
 
For the frequency table, variable is rounded to the nearest 0.01
Renal
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      391     0.48 0.007155  0.01381 0.000000 0.000000 0.000000 
      .50      .75      .90      .95 
 0.000000 0.000000 0.007221 0.012013 
 
lowest : 0.00000 0.00024 0.00041 0.00055 0.00060 , highest: 0.03916 0.04100 0.04135 0.07466 1.00000
 Value       0.00  0.01  0.02  0.03  0.04  0.07  1.00
 Frequency   1757   218    37     8     9     1    11
 Proportion 0.861 0.107 0.018 0.004 0.004 0.000 0.005
 
For the frequency table, variable is rounded to the nearest 0.01
Code
fit1 <- lrm(D30~COMORBI, data=Surgery)
print(fit1)
Logistic Regression Model
 lrm(formula = D30 ~ COMORBI, data = Surgery)
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 2041 LR χ2 13.99 R2 0.014 C 0.549
0 1820 d.f. 1 R21,2041 0.006 Dxy 0.098
1 221 Pr(>χ2) 0.0002 R21,591.2 0.022 γ 0.162
max |∂log L/∂β| 3×10-9 Brier 0.096 τa 0.019
β S.E. Wald Z Pr(>|Z|)
Intercept  -2.2643  0.0850 -26.65 <0.0001
COMORBI   0.4443  0.1129 3.93 <0.0001

The prognostic value of the score is modest; on its own (univariate logistic regression of D30~COMORBI), we find a c statistic of 0.55.

Testing the equal weights assumption in a simple sumscore

Simple sums of predictors make the assumption of equal weights for each predictor. This assumption can be assessed in at least two ways

  1. An overall test: is a more refined coding preferable over a simple sum?
    In the example of comorbidity, we consider the sum of 5 comorbidity conditions as a simple sumscore (Table 10.2). A model considering the 5 comorbidity conditions separately has 5 df and a Likelihood Ratio statistic of 18, in contrast to 14 for the simple sumscore. The difference of 3.6 with 4 df has a p-value of 0.46, far from convincing against the idea of using the simple sumscore.
  2. Component-wise testing: is one of the comborbidities really deviant in a prognostic value?
    We adding the conditions one by one in a regression model that already contains the sumscore. The coefficient of the condition added in a model indicates the deviation from the common effect based on the other conditions.
    We note that the deviations from the common effect are relatively small, except for liver disease and renal disease. Renal disease even seemed to have a protective effect. Both effects were based on small numbers. The standard errors of the estimates were large, and the effects were statistically nonsignificant.
Code
# Make function that that a score plus its components
# outcome and data specified as well
test.equal.weights <- function(data=Surgery, y="D30", sumscore="COMORBI", 
                  components=Cs(CPD, Cardio, Diabetes, Liver, Renal)) {
# results in matrix
matrix.coefs <- matrix(nrow=(2+length(components)), ncol=7)
# labels.components <- dput(as.character(components)) # to get it nice for row.labels
dimnames(matrix.coefs) <- list(c("sumscore", "ALL", dput(as.character(components))),
                            Cs(Coef.Sumscore, SE.Sumscore, Coef.Component, SE.Component, 
                               LR, df, p-value))
# Make models:
# 1. sumscore
fit1 <- lrm(data[,y] ~ data[,sumscore])
matrix.coefs[1,]  <- c(fit1$coef[2], sqrt(fit1$var[2,2]), NA, NA, fit1$stats[3:5])

# 2. full model for overall comparison
fit.full <- lrm(data[,y] ~ as.matrix(data[,components]), x=T)
# compare model fits
p.anova.comparison <- pchisq(fit.full$stats[3] - fit1$stats[3], 
                             df= fit.full$stats[4] - fit1$stats[4], lower.tail = F )
matrix.coefs[2,]  <- c(NA, NA, NA, NA, fit.full$stats[3:4], p.anova.comparison)

# 3. fit incremenal differences to sumscore
for (i in 1:length(components)) {
fiti <- update(fit1, .~.+ fit.full$x[,i])
# compare model fits
p.anova.comparison <- pchisq(fiti$stats[3] - fit1$stats[3], 
                             df= fiti$stats[4] - fit1$stats[4], lower.tail = F )
matrix.coefs[2+i,]  <- c(fiti$coef[2], sqrt(fiti$var[2,2]), fiti$coef[3], sqrt(fiti$var[3,3]),
                          fiti$stats[3:4], p.anova.comparison) } # end loop
return(matrix.coefs)
} # end function

kable(test.equal.weights(data=Surgery, y="D30", sumscore="COMORBI", 
                  components=Cs(CPD, Cardio, Diabetes, Liver, Renal)), 
                  caption="**Table 10.2**: Testing deviations for each condition in a sum score. Data from esophageal cancer patients who underwent surgery (2041 patients from SEER-Medicare data, 221 died by 30 days). The overall test for deviations from a simple sum score had a p-value of 0.46 (overall LR test, 4 *df*)")
## c("CPD", "Cardio", "Diabetes", "Liver", "Renal")
(#tab:test.equal.weights)Table 10.2: Testing deviations for each condition in a sum score. Data from esophageal cancer patients who underwent surgery (2041 patients from SEER-Medicare data, 221 died by 30 days). The overall test for deviations from a simple sum score had a p-value of 0.46 (overall LR test, 4 df)
Coef.Sumscore SE.Sumscore Coef.Component SE.Component LR df p - value
sumscore 0.44 0.11 NA NA 14 1 0.00
ALL NA NA NA NA 18 5 0.46
CPD 0.51 0.14 -0.22 0.31 14 2 0.48
Cardio 0.49 0.16 -0.13 0.32 14 2 0.69
Diabetes 0.35 0.15 0.32 0.29 15 2 0.27
Liver 0.42 0.12 1.31 1.03 16 2 0.22
Renal 0.48 0.12 -1.09 1.11 15 2 0.26

Discussion

In the SEER data case study, we stick to our assumption of a similar effect for all comorbities. The apparently most deviant effects of liver disease and renal disease were unreliable. We hence may assume similar effects for Liver and Renal as for the other comorbidities.

An extension of the component-wise testing might be to apply a LASSO regression model where the sumscore effect is taken as an offset, with shrinkage of deviations from this offset. In the case study, the overall test was far from statistically significant, and hence we expect shrinkage of most or all of these deviations to zero. This idea is similar to updating of a prediction model, with shrinkage of deviating coefficients to values of a prior model: y~predictor, offset=linear.predictor \[Stat Med 2004\].

Further discussions on robust modeling were motivated by a case study on prediction of mutations based on family history \[[JAMA 2006](https://pubmed.ncbi.nlm.nih.gov/17003395/ "Prediction of MLH1 and MSH2 mutations in Lynch syndrome")\]. A simple weighting of second degree relatives as half the effect of first degree relatives worked well in this case. And the effect for age of diagnosis in a relative could be assumed identical for the index patient (proband) and the first and second degree relatives. This simplification saved degrees of freedom in the modeling process at the expense of potentially missing specific patterns in the data \[[Stat Med 2007](https://pubmed.ncbi.nlm.nih.gov/17948867/ "Data reduction for prediction: a case study on robust coding of age and family history for the risk of having a genetic mutation")\].
A robust approach is especially attractive in relatively small data set. Indeed, a major study attempted to model family history for different cancers (colon, endometrial, other) separately for first and second degree relatives, with dichotomization of age as below or above 50 years in 870 patients with only 38 mutations identified \[[NEJM 2006](https://pubmed.ncbi.nlm.nih.gov/16807412/ "Identification and survival of carriers of mutations in DNA mismatch-repair genes in colon cancer")\]. Simulations confirmed that attempting such modeling was a bad idea \[JCE 2018\], both because of severe overfitting (38 events) and dichotomania. In contrast, the robust modeling strategy was applied in various versions of prediction models for mutation status \[[Gastroenterology 2011](https://pubmed.ncbi.nlm.nih.gov/20727894/ "The PREMM(1,2,6) model predicts risk of MLH1, MSH2, and MSH6 germline mutations based on cancer history"); [JCO 2017](https://pubmed.ncbi.nlm.nih.gov/28489507/ "Development and Validation of the PREMM5 Model for Comprehensive Risk Assessment of Lynch Syndrome")\], with satisfactory performance in a large-scale international validation study \[[JNCI 2015](https://pubmed.ncbi.nlm.nih.gov/26582061/ "Comparison of Prediction Models for Lynch Syndrome Among Individuals With Colorectal Cancer")\].

Literature

SEER data case study:

Methods papers:

Mutation status prediction: