Creating Count Process Data Set With TimeVarying Covariates in R
I am working through Chapter 15 of Applied Longitudinal DataAnalysis by Singer and Willett, on Extending the Cox Regression model, but the UCLA website here has no example R code for this chapter. I am trying to recreate the section on timevarying covariates and am stuck on how to create a count process dataset from the personlevel dataframe provided. I have looked through the survival
package vignette but am having trouble creating my own dataset. Here is toy data modelled on the Singer and Willett example of predictors of time to first cocaine use in men between 17 and 40 years of age. id
is ID, ageInit
is age of either first cocaine use or censoring, used
is the event log, indicating first cocaine use (1
) or censoring (0
). There is one time invariant predictor earlyMJ
, indicating marijuana use prior to 17 years of age. There are also three timevarying covariates that need to inform the creation of the dataset: timeMJ
, which is age of first marijuana use after 17, sellMJ
which is age of selling marijuana, and odFirst
, which is age of first using some other drug. NA
s in these predictors indicate the participant did not perform the action in question at any time.
set.seed(1356)
df < data.frame(id = 1:6,
ageInit = c(25,34,40,29,27,40),
used = c(0,1,1,0,1,1),
earlyMJ = c(0,0,1,0,1,1),
timeMJ = c(18,27,22,21,22,19),
sellMJ = c(NA,NA,25,NA,35,NA),
odFirst = c(19,22,35,NA,22,34))
Following Therneau et al.'s process from the survival
vignette we create a second dataset, in this case reexpressing the outcome variable ageInit
, and the timevarying predictors as number of years from study initiation (i.e. 17 years old).
tdata < with(df, data.frame (id = id,
usedTime = ageInit  17,
timeToFirstMJ = timeMJ  17,
timeToSellMJ = sellMJ  17,
timeToFirstOD = odFirst  17,
usedCocaine = used))
The we merge this new data with the original dataset, creating, via the event()
call, two new columns expressing time intervals for each of the covariates. We also create, via the tdc()
call, a new row for each participant for each time they experienced one of the covariate events.
sdata < tmerge(df, tdata,
id=id,
firstUse = event(futime, usedCocaine),
t1MJ = tdc(timeToFirstMJ),
t1SMJ = tdc(timeToSellMJ),
t1OD = tdc(timeToFirstOD),
options= list(idname="subject"))
attr(sdata, "tcount")
The issue is that when I run the Cox regression, using both the timeinvariant, and a few of the timevarying covariates, I cannot make the model work.
coxph(Surv(tstart, tstop, firstUse) ~ earlyMJ + t1SMJ + t1OD, data= sdata, ties="breslow")
and get nonsensical coefficient and the warning message
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1,3 ; beta may be infinite.
Furthermore I do not even know if this is a proper count process dataframe, because in Singer and Willett they suggest that each participant needs to have a row for every time anyone in the dataset experiences the event.
Any guidance in these matters would be much appreciated.
1 answer

A good start is the Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model vignette in the
survival
package.The issue is that when I run the Cox regression, using both the timeinvariant, and a few of the timevarying covariates, I cannot make the model work.
What you do seems correct at first glance but it might just be the fact that you only have 6 persons and three parameters to estimate?
get nonsensical coefficient and the warning message
The warning makes good sense. The standard error on two of your parameter estimates is greater than 1000. See below.
Furthermore do not even know if this is a proper count process dataframe, because in Singer and Willett they suggest that each participant needs to have a row for every time anyone in the dataset experiences the event.
This is handled internally in
coxph
.
Here is my code to reproduce your results
##### # setup data df < data.frame(id = 1:6, ageInit = c(25,34,40,29,27,40), used = c( 0, 1, 1, 0, 1, 1), earlyMJ = c( 0, 0, 1, 0, 1, 1), timeMJ = c(18,27,22,21,22,19), sellMJ = c(NA,NA,25,NA,35,NA), odFirst = c(19,22,35,NA,22,34)) shift_cols < c("ageInit", "timeMJ", "sellMJ", "odFirst") df[shift_cols] < lapply(df[shift_cols], "", 17) library(survival) est_df < df[, c("id", "ageInit", "earlyMJ", "used")] est_df < tmerge( est_df, est_df, id = id, start_using = event(ageInit, used)) est_df < tmerge( est_df, df, id = id, t1MJ = tdc(timeMJ), t1SMJ = tdc(sellMJ), t1OD = tdc(odFirst)) ##### # fit model fit < coxph( Surv(tstart, tstop, start_using) ~ earlyMJ + t1SMJ + t1OD, data = est_df) #R> Warning message: #R> In fitter(X, Y, strats, offset, init, control, weights = weights, : #R> Loglik converged before variable 1,3 ; beta may be infinite. summary(fit) #R> Call: #R> coxph(formula = Surv(tstart, tstop, start_using) ~ earlyMJ + #R> t1SMJ + t1OD, data = est_df) #R> #R> n= 17, number of events= 4 #R> #R> coef exp(coef) se(coef) z Pr(>z) #R> earlyMJ 2.047e+01 7.792e+08 2.791e+04 0.001 0.999 #R> t1SMJ 4.744e16 1.000e+00 1.414e+00 0.000 1.000 #R> t1OD 4.157e+01 1.136e+18 3.883e+04 0.001 0.999 #R> #R> exp(coef) exp(coef) lower .95 upper .95 #R> earlyMJ 7.792e+08 1.283e09 0.00000 Inf #R> t1SMJ 1.000e+00 1.000e+00 0.06255 15.99 #R> t1OD 1.136e+18 8.806e19 0.00000 Inf #R> #R> Concordance= 1 (se = 0.258 ) #R> Rsquare= 0.273 (max possible= 0.33 ) #R> Likelihood ratio test= 5.42 on 3 df, p=0.1437 #R> Wald test = 0 on 3 df, p=1 #R> Score (logrank) test = 4.14 on 3 df, p=0.2463