10  Restrictions on Candidate Predictors

Case-study:

Mortality after surgery for esophageal cancer

Let’s consider the example of predicting 30 day mortality after surgery for esophageal cancer, We analyzed data from the SEER-Medicare database. Among 2041 patients who were over 65 years old and diagnosed between 1991 and 1996, 221 had died by 30 days JCO 2006.

For a robust evaluation of the prognostic relevance of comorbidity, we create a simple sum score. It is based on the sum of imputed values for 5 comorbidities. The maximum is 3 comorbidities in this case.

Comorbidity variables
Variable Meaning
COMORBI Comorbidity score based on the count of 5 comorbidities
CPD Chronic Pulmonary Disease
Cardio Cardiovascular disease
Diabetes Diabetes
Liver Liver disease
Renal Renal disease

We describe the data below. Note that some missing values for comorbidities were imputed with values between 0 and 1. A regression imputation model was used, with the expected value used as a single imputed value.

Code show/hide
# Import SEER data set, n=2041
Surgery <- read.csv("data/EsoSurgery.csv")
options(prType='html')
html(describe(Surgery), scroll=TRUE)
Surgery Descriptives
Surgery

8 Variables   2041 Observations

D30
nmissingdistinctInfoSumMeanGmd
2041020.292210.10830.1932

AGE
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
20410801173.56.72365.8366.4268.5872.4277.2581.9185.17
lowest : 65.0021 65.076 65.0815 65.0842 65.1608 , highest: 94.6639 94.6667 95.0856 97.9986 101.744
COMORBI
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      393    0.737   0.2983   0.4589   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.3756   1.0000   1.1968  
lowest : 0 0.179426 0.182296 0.189824 0.193299 , highest: 2.14073 2.15795 2.15974 2.21896 3
CPD
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      376    0.591  0.09873   0.1718  0.00000  0.00000  0.00000 
      .50      .75      .90      .95 
  0.00000  0.07714  0.16931  1.00000  
 Value       0.00  0.06  0.07  0.08  0.09  0.10  0.11  0.12  0.13  0.14  0.15  0.16
 Frequency   1514     5    17    32    29    35    28    56    40    30    29    25
 Proportion 0.742 0.002 0.008 0.016 0.014 0.017 0.014 0.027 0.020 0.015 0.014 0.012
                                                           
 Value       0.17  0.18  0.19  0.20  0.21  0.22  0.23  1.00
 Frequency     18    11     7     5     2     3     2   153
 Proportion 0.009 0.005 0.003 0.002 0.001 0.001 0.001 0.075 
For the frequency table, variable is rounded to the nearest 0.01
Cardio
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      374    0.593  0.09677   0.1702   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.0536   0.1692   1.0000  
lowest :0 0.04250480.04332580.04475250.0448201
highest:0.296846 0.299705 0.325783 0.36526 1

Diabetes
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      378    0.594  0.09295   0.1641   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.0566   0.1304   1.0000  
 Value       0.00  0.04  0.05  0.06  0.07  0.08  0.09  0.10  0.11  0.12  0.13  0.14
 Frequency   1511     5    33    59    54    47    35    33    24    32    19    12
 Proportion 0.740 0.002 0.016 0.029 0.026 0.023 0.017 0.016 0.012 0.016 0.009 0.006
                                                     
 Value       0.15  0.16  0.17  0.18  0.19  0.20  1.00
 Frequency      7     7     3     2     3     1   154
 Proportion 0.003 0.003 0.001 0.001 0.001 0.000 0.075 
For the frequency table, variable is rounded to the nearest 0.01
Liver
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      391    0.474 0.002736 0.005334 0.000000 0.000000 0.000000 
      .50      .75      .90      .95 
 0.000000 0.000000 0.001706 0.005047  
lowest :0 0.0002774640.0002842780.0002918950.000293725
highest:0.024005 0.0259095 0.0290324 0.0364262 1

Renal
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      391     0.48 0.007155  0.01381 0.000000 0.000000 0.000000 
      .50      .75      .90      .95 
 0.000000 0.000000 0.007221 0.012013  
lowest :0 0.0002424740.0004064050.0005500730.00060003
highest:0.0391617 0.0409958 0.0413482 0.0746603 1

Code show/hide
fit1 <- lrm(D30~COMORBI, data=Surgery)
print(fit1)

Logistic Regression Model

lrm(formula = D30 ~ COMORBI, data = Surgery)
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 2041 LR χ2 13.99 R2 0.014 C 0.549
0 1820 d.f. 1 R21,2041 0.006 Dxy 0.098
1 221 Pr(>χ2) 0.0002 R21,591.2 0.022 γ 0.162
max |∂log L/∂β| 3×10-9 Brier 0.096 τa 0.019
β S.E. Wald Z Pr(>|Z|)
Intercept  -2.2643  0.0850 -26.65 <0.0001
COMORBI   0.4443  0.1129 3.93 <0.0001

The prognostic value of the score is modest; on its own (univariate logistic regression of D30~COMORBI), we find a c statistic of 0.55.

Testing the equal weights assumption in a simple sumscore

Simple sums of predictors make the assumption of equal weights for each predictor. This assumption can be assessed in at least two ways

  1. An overall test: is a more refined coding preferable over a simple sum?
    In the example of comorbidity, we consider the sum of 5 comorbidity conditions as a simple sumscore (Table 10.2). A model considering the 5 comorbidity conditions separately has 5 df and a Likelihood Ratio statistic of 18, in contrast to 14 for the simple sumscore. The difference of 3.6 with 4 df has a p-value of 0.46, far from convincing against the idea of using the simple sumscore.
  2. Component-wise testing: is one of the comborbidities really deviant in a prognostic value?
    We adding the conditions one by one in a regression model that already contains the sumscore. The coefficient of the condition added in a model indicates the deviation from the common effect based on the other conditions.
    We note that the deviations from the common effect are relatively small, except for liver disease and renal disease. Renal disease even seemed to have a protective effect. Both effects were based on small numbers. The standard errors of the estimates were large, and the effects were statistically nonsignificant.
Code show/hide
# Make function that that a score plus its components
# outcome and data specified as well
test.equal.weights <- function(data=Surgery, y="D30", sumscore="COMORBI", 
                  components=Cs(CPD, Cardio, Diabetes, Liver, Renal)) {
# results in matrix
matrix.coefs <- matrix(nrow=(2+length(components)), ncol=7)
# labels.components <- dput(as.character(components)) # to get it nice for row.labels
dimnames(matrix.coefs) <- list(c("sumscore", "ALL", dput(as.character(components))),
                            Cs(Coef.Sumscore, SE.Sumscore, Coef.Component, SE.Component, 
                               LR, df, p-value))
# Make models:
# 1. sumscore
fit1 <- lrm(data[,y] ~ data[,sumscore])
matrix.coefs[1,]  <- c(fit1$coef[2], sqrt(fit1$var[2,2]), NA, NA, fit1$stats[3:5])

# 2. full model for overall comparison
fit.full <- lrm(data[,y] ~ as.matrix(data[,components]), x=T)
# compare model fits
p.anova.comparison <- pchisq(fit.full$stats[3] - fit1$stats[3], 
                             df= fit.full$stats[4] - fit1$stats[4], lower.tail = F )
matrix.coefs[2,]  <- c(NA, NA, NA, NA, fit.full$stats[3:4], p.anova.comparison)

# 3. fit incremenal differences to sumscore
for (i in 1:length(components)) {
fiti <- update(fit1, .~.+ fit.full$x[,i])
# compare model fits
p.anova.comparison <- pchisq(fiti$stats[3] - fit1$stats[3], 
                             df= fiti$stats[4] - fit1$stats[4], lower.tail = F )
matrix.coefs[2+i,]  <- c(fiti$coef[2], sqrt(fiti$var[2,2]), fiti$coef[3], sqrt(fiti$var[3,3]),
                          fiti$stats[3:4], p.anova.comparison) } # end loop
return(matrix.coefs)
} # end function

kable(test.equal.weights(data=Surgery, y="D30", sumscore="COMORBI", 
                  components=Cs(CPD, Cardio, Diabetes, Liver, Renal)), 
                  caption="**Table 10.2**: Testing deviations for each condition in a sum score. Data from esophageal cancer patients who underwent surgery (2041 patients from SEER-Medicare data, 221 died by 30 days). The overall test for deviations from a simple sum score had a p-value of 0.46 (overall LR test, 4 *df*)")
c("CPD", "Cardio", "Diabetes", "Liver", "Renal")
Table 10.2: Testing deviations for each condition in a sum score. Data from esophageal cancer patients who underwent surgery (2041 patients from SEER-Medicare data, 221 died by 30 days). The overall test for deviations from a simple sum score had a p-value of 0.46 (overall LR test, 4 df)
Coef.Sumscore SE.Sumscore Coef.Component SE.Component LR df p - value
sumscore 0.44 0.11 NA NA 14 1 0.00
ALL NA NA NA NA 18 5 0.46
CPD 0.51 0.14 -0.22 0.31 14 2 0.48
Cardio 0.49 0.16 -0.13 0.32 14 2 0.69
Diabetes 0.35 0.15 0.32 0.29 15 2 0.27
Liver 0.42 0.12 1.31 1.03 16 2 0.22
Renal 0.48 0.12 -1.09 1.11 15 2 0.26

Discussion

In the SEER data case study, we stick to our assumption of a similar effect for all comorbities. The apparently most deviant effects of liver disease and renal disease were unreliable. We hence may assume similar effects for Liver and Renal as for the other comorbidities.

An extension of the component-wise testing might be to apply a LASSO regression model where the sumscore effect is taken as an offset, with shrinkage of deviations from this offset. In the case study, the overall test was far from statistically significant, and hence we expect shrinkage of most or all of these deviations to zero. This idea is similar to updating of a prediction model, with shrinkage of deviating coefficients to values of a prior model: y~predictor, offset=linear.predictor Stat Med 2004.

Further discussions on robust modeling were motivated by a case study on prediction of mutations based on family history JAMA 2006. A simple weighting of second degree relatives as half the effect of first degree relatives worked well in this case. And the effect for age of diagnosis in a relative could be assumed identical for the index patient (proband) and the first and second degree relatives. This simplification saved degrees of freedom in the modeling process at the expense of potentially missing specific patterns in the data Stat Med 2007.
A robust approach is especially attractive in relatively small data set. Indeed, a major study attempted to model family history for different cancers (colon, endometrial, other) separately for first and second degree relatives, with dichotomization of age as below or above 50 years in 870 patients with only 38 mutations identified NEJM 2006. Simulations confirmed that attempting such modeling was a bad idea JCE 2018, both because of severe overfitting (38 events) and dichotomania. In contrast, the robust modeling strategy was applied in various versions of prediction models for mutation status Gastroenterology 2011; JCO 2017, with satisfactory performance in a large-scale international validation study JNCI 2015.

Literature

SEER data case study:

Methods papers:

Mutation status prediction: