Lab 7: Kaplan–Meier estimate of the survival functions and Cox regression model

1 Reminder about data frames

1.1

Download hm.csv, store it in a variable df, and display the first rows of the dataset.

  ID time age drug status    entdate    enddate
1  1    5  46    0      1 15/05/2015 14/10/2015
2  2    6  35    1      0 19/09/2014 20/03/2015
3  3    8  30    1      1 21/04/2016 20/12/2016
4  4    3  30    1      1 03/01/2016 04/04/2016
5  5   22  36    0      1 18/09/2014 19/07/2016
6  6    1  32    1      0 18/03/2016 17/04/2016

This data frame contains information about patients participating in a clinical trial.

time: duration of follow-up (in months)
age: age of the participant
drug: treatment indicator (0 = no drug, 1 = drug)
status: event indicator:
- 1 = event of interest observed
- 0 = censored observation

In this dataset, the event of interest is death related to the condition under study.

If a patient dies from an unrelated cause, or leaves the study early, the observation is censored: we only know that the patient survived at least up to that time.

1.2

Find the mean age of the participants.

[1] 36.07

1.3

Find the number of observations (rows) in the dataset.

[1] 100

2 Kaplan–Meier estimate

We use the package survival (you may need to install it, see Lab 2 on how to install a package).

library(survival)

To work with survival data, we first create a survival object using Surv():

sf <- Surv(df$time, df$status)
sf

  [1]  5   6+  8   3  22   1+  7   9   3  12   2+ 12   1  15  34   1   4  19+
 [19]  3+  2   2+  6  60+ 11   2+  5   4+  1  13   3   2   1  30   7   4   8 
 [37]  5  10   2   9  36   3   9   3  35   8  11  56+  2+  3  15   1  10   1 
 [55]  7   3   3   2  32   3  10+ 11   3   7   5  31   5  58   1   3+ 43   1 
 [73]  6  53  14   4  54   1   1   8   5   1   1   2   7+  1+ 10  24+  7  12+
 [91]  4  57   1  12+  7   1   5  60+  2+  1

Censored observations are marked with a +. This means that the event was not observed at that time.

We now compute the Kaplan–Meier estimator:

km <- survfit(sf ~ 1)
km

Call: survfit(formula = sf ~ 1)

       n events median 0.95LCL 0.95UCL
[1,] 100     80      7       5      10

The formula ~ 1 means that we estimate a single survival curve for the whole sample, without using any covariates.

Plotting the estimator:

plot(km)

The solid line is the Kaplan–Meier estimate \(\hat{S}(t)\)
The dashed lines represent a \(95\)% confidence interval

Roughly speaking, if we repeated the study many times, the true survival probabilities would lie within these bounds in about \(95\)% of cases.
To change the confidence level, we the paramter conf.int=value of function plot where value is a decimal between \(0\) and \(1\).

Higher confidence level → wider interval.
Numerical values:

summary(km)

Call: survfit(formula = sf ~ 1)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    100      15   0.8500  0.0357       0.7828        0.923
    2     83       5   0.7988  0.0402       0.7237        0.882
    3     73      10   0.6894  0.0473       0.6026        0.789
    4     61       4   0.6442  0.0493       0.5544        0.748
    5     56       7   0.5636  0.0517       0.4709        0.675
    6     49       2   0.5406  0.0521       0.4476        0.653
    7     46       6   0.4701  0.0526       0.3775        0.586
    8     39       4   0.4219  0.0525       0.3306        0.538
    9     35       3   0.3857  0.0520       0.2962        0.502
   10     32       3   0.3496  0.0511       0.2625        0.466
   11     28       3   0.3121  0.0500       0.2280        0.427
   12     25       2   0.2872  0.0490       0.2055        0.401
   13     21       1   0.2735  0.0486       0.1931        0.387
   14     20       1   0.2598  0.0480       0.1809        0.373
   15     19       2   0.2325  0.0467       0.1568        0.345
   22     16       1   0.2179  0.0460       0.1441        0.330
   30     14       1   0.2024  0.0453       0.1305        0.314
   31     13       1   0.1868  0.0444       0.1173        0.298
   32     12       1   0.1712  0.0433       0.1043        0.281
   34     11       1   0.1557  0.0421       0.0916        0.264
   35     10       1   0.1401  0.0407       0.0793        0.247
   36      9       1   0.1245  0.0390       0.0674        0.230
   43      8       1   0.1090  0.0371       0.0559        0.212
   53      7       1   0.0934  0.0349       0.0449        0.194
   54      6       1   0.0778  0.0324       0.0344        0.176
   57      4       1   0.0584  0.0296       0.0216        0.157
   58      3       1   0.0389  0.0253       0.0109        0.139

* `time`: event times

* `n.risk`: individuals at risk just before time

* `n.event`: number of events at that time

* `surv`: estimated survival probability

Recall that the Kaplan–Meier estimator is

\[ \hat{S}(t) = \prod_{t_j \le t} \left(1 - \frac{d_j}{n_j}\right). \]

Extracting survival probabilities

Instead of reading values manually, we can query the model:

summary(km, times = c(13, 42))$surv

[1] 0.2734790 0.1245306

2.1

Compute the ratio \[ \frac{\hat{S}(17)}{\hat{S}(51)}. \]

[1] 2.133333

2.2

Find the estimated probability of the death in the next \(2\) years.

[1] 0.7820714

Median survival time

km

Call: survfit(formula = sf ~ 1)

       n events median 0.95LCL 0.95UCL
[1,] 100     80      7       5      10

The output includes the median survival time, i.e. the time when \(\hat{S}(t) = 0.5\).

3 Kaplan–Meier with covariates

We now compare groups:

kmdrug <- survfit(sf ~ df$drug)
kmdrug

Call: survfit(formula = sf ~ df$drug)

           n events median 0.95LCL 0.95UCL
df$drug=0 51     42     11       8      30
df$drug=1 49     38      5       3       7

plot(kmdrug, col = c("blue", "red"))
legend("topright", legend = c("No drug", "Drug"),
       col = c("blue", "red"), lty = c(1,1))

3.1

Modify the plot so that:

“No drug” is dashed seagreen
“Drug” is solid orange

Embedded data. The dataframe lung is embedded into survival package. You can get information about this data frame by typing in the console ?lung.

kmlung <- survfit(Surv(time, status == 2) ~ sex, data = lung)
kmlung

Call: survfit(formula = Surv(time, status == 2) ~ sex, data = lung)

        n events median 0.95LCL 0.95UCL
sex=1 138    112    270     212     310
sex=2  90     53    426     348     550

Here we explicitly define the event as status == 2 (death), which is clearer.

plot(kmlung, col = c("blue", "red"))
legend("topright", legend = c("Male", "Female"),
       col = c("blue", "red"), lty = c(1,1))

Surely, there are numerous R packages which can make plots more nice. For example, one can install and load survminer package:

library(survminer)
ggsurvplot(kmlung, conf.int = 0.95)

4 Cox regression model

The Cox model assumes:

\[ \lambda(t, z) = \lambda_0(t) e^{\beta z}. \]

It assumes proportional hazards, meaning hazard ratios do not depend on time.

One covariate case

fit <- coxph(Surv(time, status == 2) ~ sex, data = lung)
summary(fit)

Call:
coxph(formula = Surv(time, status == 2) ~ sex, data = lung)

  n= 228, number of events= 165 

       coef exp(coef) se(coef)      z Pr(>|z|)   
sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    exp(coef) exp(-coef) lower .95 upper .95
sex     0.588      1.701    0.4237     0.816

Concordance= 0.579  (se = 0.021 )
Likelihood ratio test= 10.63  on 1 df,   p=0.001
Wald test            = 10.09  on 1 df,   p=0.001
Score (logrank) test = 10.33  on 1 df,   p=0.001

coef: estimate of \(\beta\)
exp(coef): hazard ratio

Interpretation:

If exp(coef) = 0.588, then females have about 0.588 times the hazard of males (i.e. about 41% lower hazard).

Multiple covariates case

fit2 <- coxph(Surv(time, status == 2) ~ sex + age, data = lung)
summary(fit2)

Call:
coxph(formula = Surv(time, status == 2) ~ sex + age, data = lung)

  n= 228, number of events= 165 

         coef exp(coef)  se(coef)      z Pr(>|z|)   
sex -0.513219  0.598566  0.167458 -3.065  0.00218 **
age  0.017045  1.017191  0.009223  1.848  0.06459 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    exp(coef) exp(-coef) lower .95 upper .95
sex    0.5986     1.6707    0.4311    0.8311
age    1.0172     0.9831    0.9990    1.0357

Concordance= 0.603  (se = 0.025 )
Likelihood ratio test= 14.12  on 2 df,   p=9e-04
Wald test            = 13.47  on 2 df,   p=0.001
Score (logrank) test = 13.72  on 2 df,   p=0.001

For age:

b <- coef(fit2)
exp(b["age"])

     age 
1.017191

This gives the multiplicative increase per year.

4.1

Compute the hazard ratio for:

a male aged 70
a female aged 80

     sex 
1.408838

Likelihood ratio test

A natural question is: when we used both age and gender covariates, we get different answers comparing with using eitehr of covariates alone. Does adding new covariates improve the model? More formally, let \(L_p\) be the likelihood function with \(p\in\mathbb{N}\) covariates and let \(L_{p+q}\) be the likelihood function with \(p+q\in\mathbb{N}\) covariates. How to estimate if \(q\) additional covariates were significant to add?

For this, one can use the statistical test. The test statistic is

\[ \lambda:= 2(l_{p+q}-l_p)=2(\ln L_{p+q}-\ln L_{p})=2\ln\frac{L_{p+q}}{L_p}>0, \]

where all functions are calculated at the points of their maximums. Note \(L_{p+q}>L_p\) because \(L_{p+q}\) is calculated at maximum over all possible \(p+q\) values, while \(L_p\) may be understood as the maximum of the function of \(p+q\) variables over the first \(p\) variables with fixed last \(q\) variables equal to \(0\):

\[ \beta_1 \cdot z_1+ \ldots +\beta_p \cdot z_p + 0\cdot z_{p+1} + \ldots + 0\cdot z_{p+q} =\beta_1 \cdot z_1+ \ldots +\beta_p \cdot z_p. \]

The \(H_0\) hypotehsis is then that \(\beta_{p+1}+\ldots+\beta_{p+q}=0\). The test is \(\chi^2\) test with \(q\) degrees of freedom:

fit1 <- coxph(Surv(time, status == 2) ~ sex, data = lung)
fit2 <- coxph(Surv(time, status == 2) ~ sex + age, data = lung)

anova(fit1, fit2, test = "LRT")

Analysis of Deviance Table
 Cox model: response is  Surv(time, status == 2)
 Model 1: ~ sex
 Model 2: ~ sex + age
   loglik  Chisq Df Pr(>|Chi|)  
1 -744.59                       
2 -742.85 3.4895  1    0.06176 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The number \(0.06176\) is the \(p\)-value. Hence, if we need the (standard) \(5\)% significance level, we do not regect the null hypothesis (as \(0.06176>0.05\)), so there is some evidence that age affects the hazard (after accounting for sex factor), but it is not strong enough at the \(5\)% level.

4.2

Check if sex affects the hazard here, at the \(5\)% level.

Check you answer: the \(p\)-value is \(0.001669<0.05\), hence, we reject the null hypothesis that sex does not affect the hazard, i.e. after accounting for age, sex does significantly improve the model.