Survival Analysis
The main function for building a Cox-regression-ready dataset is build_survival_dataset(). It handles prevalent/incident classification, follow-up time calculation, and supports separate source definitions for baseline exclusion and outcome ascertainment. It also supports participant-flow reporting (show_flow) and temporary data.table thread tuning (dt_threads) for large datasets.
Quick start
library(UKBAnalytica)
library(data.table)
ukb_data <- fread("population.csv")
diseases <- get_predefined_diseases()[
c("AA", "Hypertension", "Diabetes")
]
analysis_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = diseases,
primary_disease = "AA",
censor_date = as.Date("2023-10-31"),
show_flow = TRUE,
dt_threads = 8
)Output columns
The wide-format output contains the following columns for each disease:
| Column | Description |
|---|---|
{Disease}_history |
1 if prevalent (diagnosed at or before baseline) |
{Disease}_incident |
1 if incident (diagnosed after baseline) |
outcome_status |
Primary disease event indicator (1 = event, 0 = censored, NA = prevalent) |
outcome_surv_time |
Follow-up time in years (NA for prevalent cases) |
Prevalent cases of the primary disease receive NA for both outcome_status and outcome_surv_time because they are not at risk for incident disease. In a Cox model you should exclude or handle them explicitly.
Participant flow reporting
When show_flow = TRUE and output = "wide", the function prints a step-by-step participant attrition table in terminal and stores it in the returned object:
flow_dt <- attr(analysis_dt, "participant_flow")
if (!is.null(flow_dt)) {
print(flow_dt)
}Case classification logic
The function classifies each participant as follows:
- Prevalent case – earliest diagnosis date (from
prevalent_sources) is on or before the baseline date. The participant already had the disease at enrollment. - Incident case – earliest diagnosis date (from
outcome_sources) is after the baseline date. This is a new event during follow-up. - Censored – no diagnosis by the end of follow-up. Follow-up ends at death or the administrative censor date, whichever comes first.
Recommended workflow for users
For most prospective UKB analyses, use build_survival_dataset() as the only disease-phenotyping entry point. The user-facing workflow is:
- Choose diseases with
get_predefined_diseases()or create one custom definition withcreate_disease_definition(). - Choose
primary_disease, which controlsoutcome_statusandoutcome_surv_time. - Use
prevalent_sourcesfor baseline exclusion/history covariates. - Use
outcome_sourcesfor incident endpoint ascertainment. - After the function returns, keep participants with non-missing
outcome_statusand valid non-negativeoutcome_surv_timefor Cox models.
Example for an arrhythmia endpoint using ICD-10 and death registry records:
rhythm_defs <- get_predefined_diseases()[
c("Arrhythmia", "Atrial_Fibrillation", "Ventricular_Arrhythmia")
]
analysis_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = rhythm_defs,
primary_disease = "Arrhythmia",
baseline_col = "p53_i0",
censor_date = as.Date("2023-10-31"),
prevalent_sources = c("ICD10", "ICD9", "Self-report"),
outcome_sources = c("ICD10", "ICD9", "Death"),
show_flow = TRUE,
dt_threads = 8
)
cox_dt <- analysis_dt[
!is.na(outcome_status) &
!is.na(outcome_surv_time) &
outcome_surv_time >= 0
]This keeps the full cohort in analysis_dt for auditing and covariate merges, while cox_dt is the at-risk analysis subset. Prevalent primary-disease cases are encoded as outcome_status = NA and outcome_surv_time = NA because they were not at risk for an incident event.
Follow-up time
Follow-up time for the primary disease is calculated as:
- Prevalent case:
NA(not at risk). - Incident case:
(diagnosis_date - baseline_date) / 365.25. - Censored:
(min(death_date, censor_date) - baseline_date) / 365.25.
Dual-source design
The prevalent_sources and outcome_sources arguments let you use different data sources for baseline exclusion versus outcome ascertainment. This is recommended because:
| Aspect | Prevalent (history) | Outcome (incident) |
|---|---|---|
| Purpose | Exclude baseline cases / define history | Define endpoint |
| Self-report | Include (captures pre-existing) | Exclude (imprecise dates) |
| Date precision | Less critical | Critical for survival time |
| Death records | Usually not required for history | Useful for fatal events and censoring |
"Algorithm" can be added to either source list when your disease definition includes algo_date_field (and optionally algo_source_field). The parser accepts both UKB column naming styles: p{field}_i0 and p{field}.
"CancerRegistry" can be added for malignant neoplasm endpoints when the extracted dataset contains cancer registry fields (p40006_i*, p40005_i*, and optionally p40011_i* / p40012_i*). For example:
lung_defs <- get_predefined_diseases()["Lung_Cancer"]
lung_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = lung_defs,
prevalent_sources = c("CancerRegistry", "ICD10"),
outcome_sources = c("CancerRegistry", "ICD10", "Death"),
primary_disease = "Lung_Cancer"
)"FirstOccurrence" can also be added when your dataset contains UKB Category 1712 fields. Predefined diseases include common First Occurrence field IDs, so most users only need to add the source name:
analysis_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = get_predefined_diseases()[c("MI", "Stroke", "Diabetes")],
prevalent_sources = c("ICD10", "ICD9", "Self-report", "FirstOccurrence"),
outcome_sources = c("ICD10", "ICD9", "Death", "FirstOccurrence"),
primary_disease = "MI"
)First Occurrence dates with UKB special date coding 819 are treated as missing and excluded before prevalent/incident classification.
Default settings reflect this design:
# prevalent_sources includes Self-report
# outcome_sources excludes Self-report
analysis_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = diseases,
prevalent_sources = c("ICD10", "ICD9",
"Self-report"),
outcome_sources = c("ICD10", "ICD9", "Death"),
primary_disease = "AA",
show_flow = TRUE
)Performance tuning (large data)
For very large cohorts, you can limit or increase the threads used by data.table during this function call:
analysis_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = diseases,
primary_disease = "AA",
dt_threads = 8
)dt_threads only applies inside this call. The previous data.table thread setting is restored automatically when the function exits.
Date parsing in this pipeline is robust to malformed values: non-standard dates are converted to NA with warnings instead of stopping the full run.
Running a Cox model
After building the survival dataset, exclude prevalent primary-disease cases and fit a Cox model:
library(survival)
# Exclude prevalent AA cases (they have NA outcome)
cox_data <- analysis_dt[!is.na(outcome_status)]
cox_model <- coxph(
Surv(outcome_surv_time, outcome_status) ~
Hypertension_history + Diabetes_history,
data = cox_data
)
summary(cox_model)Sensitivity analyses
You can run sensitivity analyses by varying the source definitions:
# Sensitivity 1: hospital records only (strictest)
strict_dt <- build_survival_dataset(
ukb_data, diseases,
prevalent_sources = c("ICD10", "ICD9"),
outcome_sources = c("ICD10", "ICD9"),
primary_disease = "AA"
)
# Sensitivity 2: broad baseline history and fatal events
broad_dt <- build_survival_dataset(
ukb_data, diseases,
prevalent_sources = c("ICD10", "ICD9",
"Self-report"),
outcome_sources = c("ICD10", "ICD9",
"Death"),
primary_disease = "AA"
)Long format output
If you need a case-level (long) table instead of the wide cohort table, set output = "long":
long_dt <- build_survival_dataset(
ukb_data, diseases,
primary_disease = "AA",
output = "long",
include_all = TRUE
)
head(long_dt)With output = "long" and include_all = TRUE, the function returns one row per participant per disease, including non-cases with status = 0. With include_all = FALSE, it returns only extracted case records from the selected outcome sources.