10 Restrictions on Candidate Predictors

Case-study:

Mortality after surgery for esophageal cancer

Let’s consider the example of predicting 30 day mortality after surgery for esophageal cancer, We analyzed data from the SEER-Medicare database. Among 2041 patients who were over 65 years old and diagnosed between 1991 and 1996, 221 had died by 30 days JCO 2006.

For a robust evaluation of the prognostic relevance of comorbidity, we create a simple sum score. It is based on the sum of imputed values for 5 comorbidities. The maximum is 3 comorbidities in this case.

Comorbidity variables
Variable	Meaning
`COMORBI`	Comorbidity score based on the count of 5 comorbidities
`CPD`	Chronic Pulmonary Disease
`Cardio`	Cardiovascular disease
`Diabetes`	Diabetes
`Liver`	Liver disease
`Renal`	Renal disease

We describe the data below. Note that some missing values for comorbidities were imputed with values between 0 and 1. A regression imputation model was used, with the expected value used as a single imputed value.

Code show/hide

# Import SEER data set, n=2041
Surgery <- read.csv("data/EsoSurgery.csv")
options(prType='html')
html(describe(Surgery), scroll=TRUE)

Surgery Descriptives

Surgery

8 Variables 2041 Observations

D30

n	missing	distinct	Info	Sum	Mean	Gmd
2041	0	2	0.29	221	0.1083	0.1932

AGE

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
2041	0	801	1	73.5	6.723	65.83	66.42	68.58	72.42	77.25	81.91	85.17

lowest : 65.0021 65.076 65.0815 65.0842 65.1608 , highest: 94.6639 94.6667 95.0856 97.9986 101.744

COMORBI

        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      393    0.737   0.2983   0.4589   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.3756   1.0000   1.1968

lowest : 0 0.179426 0.182296 0.189824 0.193299 , highest: 2.14073 2.15795 2.15974 2.21896 3

CPD

        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      376    0.591  0.09873   0.1718  0.00000  0.00000  0.00000 
      .50      .75      .90      .95 
  0.00000  0.07714  0.16931  1.00000

 Value       0.00  0.06  0.07  0.08  0.09  0.10  0.11  0.12  0.13  0.14  0.15  0.16
 Frequency   1514     5    17    32    29    35    28    56    40    30    29    25
 Proportion 0.742 0.002 0.008 0.016 0.014 0.017 0.014 0.027 0.020 0.015 0.014 0.012
                                                           
 Value       0.17  0.18  0.19  0.20  0.21  0.22  0.23  1.00
 Frequency     18    11     7     5     2     3     2   153
 Proportion 0.009 0.005 0.003 0.002 0.001 0.001 0.001 0.075

For the frequency table, variable is rounded to the nearest 0.01

Cardio

        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      374    0.593  0.09677   0.1702   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.0536   0.1692   1.0000

lowest :	0	0.0425048	0.0433258	0.0447525	0.0448201
highest:	0.296846	0.299705	0.325783	0.36526	1

Diabetes

        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      378    0.594  0.09295   0.1641   0.0000   0.0000   0.0000 
      .50      .75      .90      .95 
   0.0000   0.0566   0.1304   1.0000

 Value       0.00  0.04  0.05  0.06  0.07  0.08  0.09  0.10  0.11  0.12  0.13  0.14
 Frequency   1511     5    33    59    54    47    35    33    24    32    19    12
 Proportion 0.740 0.002 0.016 0.029 0.026 0.023 0.017 0.016 0.012 0.016 0.009 0.006
                                                     
 Value       0.15  0.16  0.17  0.18  0.19  0.20  1.00
 Frequency      7     7     3     2     3     1   154
 Proportion 0.003 0.003 0.001 0.001 0.001 0.000 0.075

For the frequency table, variable is rounded to the nearest 0.01

Liver

        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      391    0.474 0.002736 0.005334 0.000000 0.000000 0.000000 
      .50      .75      .90      .95 
 0.000000 0.000000 0.001706 0.005047

lowest :	0	0.000277464	0.000284278	0.000291895	0.000293725
highest:	0.024005	0.0259095	0.0290324	0.0364262	1

Renal

        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     2041        0      391     0.48 0.007155  0.01381 0.000000 0.000000 0.000000 
      .50      .75      .90      .95 
 0.000000 0.000000 0.007221 0.012013

lowest :	0	0.000242474	0.000406405	0.000550073	0.00060003
highest:	0.0391617	0.0409958	0.0413482	0.0746603	1

Code show/hide

fit1 <- lrm(D30~COMORBI, data=Surgery)
print(fit1)

Logistic Regression Model

lrm(formula = D30 ~ COMORBI, data = Surgery)

	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs 2041	LR χ² 13.99	R² 0.014	C 0.549
0 1820	d.f. 1	R²_1,2041 0.006	D_xy 0.098
1 221	Pr(>χ²) 0.0002	R²_1,591.2 0.022	γ 0.162
max \|∂log L/∂β\| 3×10^-9		Brier 0.096	τ_a 0.019

	β	S.E.	Wald Z	Pr(>\|Z\|)
Intercept	-2.2643	0.0850	-26.65	<0.0001
COMORBI	0.4443	0.1129	3.93	<0.0001

The prognostic value of the score is modest; on its own (univariate logistic regression of D30~COMORBI), we find a c statistic of 0.55.

Testing the equal weights assumption in a simple sumscore

Simple sums of predictors make the assumption of equal weights for each predictor. This assumption can be assessed in at least two ways

An overall test: is a more refined coding preferable over a simple sum?
In the example of comorbidity, we consider the sum of 5 comorbidity conditions as a simple sumscore (Table 10.2). A model considering the 5 comorbidity conditions separately has 5 df and a Likelihood Ratio statistic of 18, in contrast to 14 for the simple sumscore. The difference of 3.6 with 4 df has a p-value of 0.46, far from convincing against the idea of using the simple sumscore.
Component-wise testing: is one of the comborbidities really deviant in a prognostic value?
We adding the conditions one by one in a regression model that already contains the sumscore. The coefficient of the condition added in a model indicates the deviation from the common effect based on the other conditions.
We note that the deviations from the common effect are relatively small, except for liver disease and renal disease. Renal disease even seemed to have a protective effect. Both effects were based on small numbers. The standard errors of the estimates were large, and the effects were statistically nonsignificant.

Code show/hide

# Make function that that a score plus its components
# outcome and data specified as well
test.equal.weights <- function(data=Surgery, y="D30", sumscore="COMORBI", 
                  components=Cs(CPD, Cardio, Diabetes, Liver, Renal)) {
# results in matrix
matrix.coefs <- matrix(nrow=(2+length(components)), ncol=7)
# labels.components <- dput(as.character(components)) # to get it nice for row.labels
dimnames(matrix.coefs) <- list(c("sumscore", "ALL", dput(as.character(components))),
                            Cs(Coef.Sumscore, SE.Sumscore, Coef.Component, SE.Component, 
                               LR, df, p-value))
# Make models:
# 1. sumscore
fit1 <- lrm(data[,y] ~ data[,sumscore])
matrix.coefs[1,]  <- c(fit1$coef[2], sqrt(fit1$var[2,2]), NA, NA, fit1$stats[3:5])

# 2. full model for overall comparison
fit.full <- lrm(data[,y] ~ as.matrix(data[,components]), x=T)
# compare model fits
p.anova.comparison <- pchisq(fit.full$stats[3] - fit1$stats[3], 
                             df= fit.full$stats[4] - fit1$stats[4], lower.tail = F )
matrix.coefs[2,]  <- c(NA, NA, NA, NA, fit.full$stats[3:4], p.anova.comparison)

# 3. fit incremenal differences to sumscore
for (i in 1:length(components)) {
fiti <- update(fit1, .~.+ fit.full$x[,i])
# compare model fits
p.anova.comparison <- pchisq(fiti$stats[3] - fit1$stats[3], 
                             df= fiti$stats[4] - fit1$stats[4], lower.tail = F )
matrix.coefs[2+i,]  <- c(fiti$coef[2], sqrt(fiti$var[2,2]), fiti$coef[3], sqrt(fiti$var[3,3]),
                          fiti$stats[3:4], p.anova.comparison) } # end loop
return(matrix.coefs)
} # end function

kable(test.equal.weights(data=Surgery, y="D30", sumscore="COMORBI", 
                  components=Cs(CPD, Cardio, Diabetes, Liver, Renal)), 
                  caption="**Table 10.2**: Testing deviations for each condition in a sum score. Data from esophageal cancer patients who underwent surgery (2041 patients from SEER-Medicare data, 221 died by 30 days). The overall test for deviations from a simple sum score had a p-value of 0.46 (overall LR test, 4 *df*)")

c("CPD", "Cardio", "Diabetes", "Liver", "Renal")

**Table 10.2**: Testing deviations for each condition in a sum score. Data from esophageal cancer patients who underwent surgery (2041 patients from SEER-Medicare data, 221 died by 30 days). The overall test for deviations from a simple sum score had a p-value of 0.46 (overall LR test, 4 df)
	Coef.Sumscore	SE.Sumscore	Coef.Component	SE.Component	LR	df	p - value
sumscore	0.44	0.11	NA	NA	14	1	0.00
ALL	NA	NA	NA	NA	18	5	0.46
CPD	0.51	0.14	-0.22	0.31	14	2	0.48
Cardio	0.49	0.16	-0.13	0.32	14	2	0.69
Diabetes	0.35	0.15	0.32	0.29	15	2	0.27
Liver	0.42	0.12	1.31	1.03	16	2	0.22
Renal	0.48	0.12	-1.09	1.11	15	2	0.26

Discussion

In the SEER data case study, we stick to our assumption of a similar effect for all comorbities. The apparently most deviant effects of liver disease and renal disease were unreliable. We hence may assume similar effects for Liver and Renal as for the other comorbidities.

An extension of the component-wise testing might be to apply a LASSO regression model where the sumscore effect is taken as an offset, with shrinkage of deviations from this offset. In the case study, the overall test was far from statistically significant, and hence we expect shrinkage of most or all of these deviations to zero. This idea is similar to updating of a prediction model, with shrinkage of deviating coefficients to values of a prior model: y~predictor, offset=linear.predictor Stat Med 2004.

Further discussions on robust modeling were motivated by a case study on prediction of mutations based on family history JAMA 2006. A simple weighting of second degree relatives as half the effect of first degree relatives worked well in this case. And the effect for age of diagnosis in a relative could be assumed identical for the index patient (proband) and the first and second degree relatives. This simplification saved degrees of freedom in the modeling process at the expense of potentially missing specific patterns in the data Stat Med 2007.
A robust approach is especially attractive in relatively small data set. Indeed, a major study attempted to model family history for different cancers (colon, endometrial, other) separately for first and second degree relatives, with dichotomization of age as below or above 50 years in 870 patients with only 38 mutations identified NEJM 2006. Simulations confirmed that attempting such modeling was a bad idea JCE 2018, both because of severe overfitting (38 events) and dichotomania. In contrast, the robust modeling strategy was applied in various versions of prediction models for mutation status Gastroenterology 2011; JCO 2017, with satisfactory performance in a large-scale international validation study JNCI 2015.

Literature

SEER data case study:

Steyerberg EW, Neville BA, Koppert LB, Lemmens VE, Tilanus HW, Coebergh JW, Weeks JC, Earle CC. Surgical mortality in patients with esophageal cancer: development and validation of a simple risk score. J Clin Oncol. 2006 Sep 10;24(26):4277-84. doi: 10.1200/JCO.2005.05.0658. PMID: 16963730

Methods papers:

Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med. 2004 Aug 30;23(16):2567-86. doi: 10.1002/sim.1844. PMID: 15287085
Steyerberg EW, Uno H, Ioannidis JPA, van Calster B; Collaborators. Poor performance of clinical prediction models: the harm of commonly applied methods. J Clin Epidemiol. 2018 Jun;98:133-143. doi: 10.1016/j.jclinepi.2017.11.013. Epub 2017 Nov 24. PMID: 29174118
Steyerberg EW, Balmaña J, Stockwell DH, Syngal S. Data reduction for prediction: a case study on robust coding of age and family history for the risk of having a genetic mutation. Stat Med. 2007 Dec 30;26(30):5545-56. doi: 10.1002/sim.3119. PMID: 17948867

Mutation status prediction:

Balmaña J, Stockwell DH, Steyerberg EW, …, Burbidge LA, Syngal S. Prediction of MLH1 and MSH2 mutations in Lynch syndrome. JAMA. 2006 Sep 27;296(12):1469-78. doi: 10.1001/jama.296.12.1469. PMID: 17003395
Barnetson RA, Tenesa A, Farrington SM, Nicholl ID, Cetnarskyj R, Porteous ME, Campbell H, Dunlop MG. Identification and survival of carriers of mutations in DNA mismatch-repair genes in colon cancer. N Engl J Med. 2006 Jun 29;354(26):2751-63. doi: 10.1056/NEJMoa053493
Kastrinos F, Uno H, Ukaegbu C, …, Steyerberg EW, Syngal S. Development and Validation of the PREMM₅ Model for Comprehensive Risk Assessment of Lynch Syndrome. J Clin Oncol. 2017 Jul 1;35(19):2165-2172. doi: 10.1200/JCO.2016.69.6120. Epub 2017 May 10. PMID: 28489507
Kastrinos F, Steyerberg EW, Mercado R, …, Wenstrup RJ, Syngal S. The PREMM(1,2,6) model predicts risk of MLH1, MSH2, and MSH6 germline mutations based on cancer history. Gastroenterology. 2011 Jan;140(1):73-81. doi: 10.1053/j.gastro.2010.08.021. Epub 2010 Aug 19. PMID: 20727894
Kastrinos F, Ojha RP, Leenen C, …, Syngal S, Steyerberg EW; Lynch Syndrome prediction model validation study group. Comparison of Prediction Models for Lynch Syndrome Among Individuals With Colorectal Cancer. J Natl Cancer Inst. 2015 Nov 18;108(2):djv308. doi: 10.1093/jnci/djv308. Print 2016 Feb.PMID: 26582061