Cox regression with huge data sets.


Göran Broström
Department of Statistics, University of Umeå
S-90187 Umeå, Sweden

Abstract

In demographic and epidemiologic research today, huge data sets are very common. There are several reasons for this. Most obvious is the fact that large data bases and population registers are built all over the world. Researchers have an easy access to huge numbers of individual life histories in different contexts. Second, information on each individual tends to be more and more detailed. Third, statistical methods, like Cox regression, are developed to take care of complicated data structures like time varying covariates, truncation and censoring. Their drawback is that computing time tends to grow faster than linearly with sample size.

A common technique to deal with time varying covariates is to immediately cut a spell into two pieces at the duration where a change in the covariate value occurs. The first piece is right censored with the first value of the covariate, while the second piece is left truncated with the new value of the covariate. Thus, with lots of time varying covariates, even moderately sized data sets may grow to be huge, with lots of left truncated and right censored spells, causing the same kind of problems as with originally huge data sets. Typically, data sets may contain five to ten times more spells after this splitting procedure has taken place.

Some techniques for dealing with huge data sets will be discussed and illustrated. Sampling (and resampling) is an obvious method to reduce (and increase again!) the computational burden, for instance sampling in ([1], [3], [4]) and of risk sets. Simple random sampling of individual life histories may be less satisfactory, if the terminal event is rare. In this case a matching approach may be fruitful ([2]).

The technique of viewing the development of the cohort as a marked (two types of marks: indication of new entry, terminal event or censoring, plus covariate information) point process is very efficient computationally, when there are no external time dependent covariates and censoring is not too heavy. However, with time varying internal and external covariates, a static approach based on risk sets is faster if enough computer memory is available.

References

[1] Borgan, Ö. and Langholz, B. (1993). Non-parametric estimation of relative mortality from nested case-control studies. Biometrics 49, 593-602.

[2] Broström, G. (1987). The influence of mother's mortality on infant mortality: A case study in matched data survival analysis. Scandinavian Journal of Statistics 14, 113-123.

[3] Cox, D.R. and Oakes, D. (1984). Analysis of Survival Data. Chapman and Hall, London.

[4] Liddell, F.D.K., McDonald, J.C. and Thomas, D.C. (1977). Methods of cohort analysis: Appraisal by application to asbestos mining. Journal of the Royal Statistical Society A 140, 469-491.