Survival Analysis

The main function for building a Cox-regression-ready dataset is build_survival_dataset(). It handles prevalent/incident classification, follow-up time calculation, and supports separate source definitions for baseline exclusion and outcome ascertainment. It also supports participant-flow reporting (show_flow) and temporary data.table thread tuning (dt_threads) for large datasets.

Quick start

library(UKBAnalytica)
library(data.table)

ukb_data <- fread("population.csv")

diseases <- get_predefined_diseases()[
  c("AA", "Hypertension", "Diabetes")
]

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  primary_disease = "AA",
  censor_date = as.Date("2023-10-31"),
  show_flow = TRUE,
  dt_threads = 8
)

Output columns

The wide-format output contains the following columns for each disease:

Column	Description
`{Disease}_history`	1 if prevalent (diagnosed at or before baseline)
`{Disease}_incident`	1 if incident (diagnosed after baseline)
`outcome_status`	Primary disease event indicator (1 = event, 0 = censored, NA = prevalent)
`outcome_surv_time`	Follow-up time in years (NA for prevalent cases)

Prevalent cases of the primary disease receive NA for both outcome_status and outcome_surv_time because they are not at risk for incident disease. In a Cox model you should exclude or handle them explicitly.

Participant flow reporting

When show_flow = TRUE and output = "wide", the function prints a step-by-step participant attrition table in terminal and stores it in the returned object:

flow_dt <- attr(analysis_dt, "participant_flow")
if (!is.null(flow_dt)) {
  print(flow_dt)
}

Case classification logic

The function classifies each participant as follows:

Prevalent case – earliest diagnosis date (from prevalent_sources) is on or before the baseline date. The participant already had the disease at enrollment.
Incident case – earliest diagnosis date (from outcome_sources) is after the baseline date. This is a new event during follow-up.
Censored – no diagnosis by the end of follow-up. Follow-up ends at death or the administrative censor date, whichever comes first.

Recommended workflow for users

For most prospective UKB analyses, use build_survival_dataset() as the only disease-phenotyping entry point. The user-facing workflow is:

Choose diseases with get_predefined_diseases() or create one custom definition with create_disease_definition().
Choose primary_disease, which controls outcome_status and outcome_surv_time.
Use prevalent_sources for baseline exclusion/history covariates.
Use outcome_sources for incident endpoint ascertainment.
After the function returns, keep participants with non-missing outcome_status and valid non-negative outcome_surv_time for Cox models.

Example for an arrhythmia endpoint using ICD-10 and death registry records:

rhythm_defs <- get_predefined_diseases()[
  c("Arrhythmia", "Atrial_Fibrillation", "Ventricular_Arrhythmia")
]

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = rhythm_defs,
  primary_disease = "Arrhythmia",
  baseline_col = "p53_i0",
  censor_date = as.Date("2023-10-31"),
  prevalent_sources = c("ICD10", "ICD9", "Self-report"),
  outcome_sources = c("ICD10", "ICD9", "Death"),
  show_flow = TRUE,
  dt_threads = 8
)

cox_dt <- analysis_dt[
  !is.na(outcome_status) &
    !is.na(outcome_surv_time) &
    outcome_surv_time >= 0
]

This keeps the full cohort in analysis_dt for auditing and covariate merges, while cox_dt is the at-risk analysis subset. Prevalent primary-disease cases are encoded as outcome_status = NA and outcome_surv_time = NA because they were not at risk for an incident event.

Follow-up time

Follow-up time for the primary disease is calculated as:

Prevalent case: NA (not at risk).
Incident case: (diagnosis_date - baseline_date) / 365.25.
Censored: (min(death_date, censor_date) - baseline_date) / 365.25.

Dual-source design

The prevalent_sources and outcome_sources arguments let you use different data sources for baseline exclusion versus outcome ascertainment. This is recommended because:

Aspect	Prevalent (history)	Outcome (incident)
Purpose	Exclude baseline cases / define history	Define endpoint
Self-report	Include (captures pre-existing)	Exclude (imprecise dates)
Date precision	Less critical	Critical for survival time
Death records	Usually not required for history	Useful for fatal events and censoring

"Algorithm" can be added to either source list when your disease definition includes algo_date_field (and optionally algo_source_field). The parser accepts both UKB column naming styles: p{field}_i0 and p{field}.

"CancerRegistry" can be added for malignant neoplasm endpoints when the extracted dataset contains cancer registry fields (p40006_i*, p40005_i*, and optionally p40011_i* / p40012_i*). For example:

lung_defs <- get_predefined_diseases()["Lung_Cancer"]

lung_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = lung_defs,
  prevalent_sources = c("CancerRegistry", "ICD10"),
  outcome_sources = c("CancerRegistry", "ICD10", "Death"),
  primary_disease = "Lung_Cancer"
)

"FirstOccurrence" can also be added when your dataset contains UKB Category 1712 fields. Predefined diseases include common First Occurrence field IDs, so most users only need to add the source name:

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = get_predefined_diseases()[c("MI", "Stroke", "Diabetes")],
  prevalent_sources = c("ICD10", "ICD9", "Self-report", "FirstOccurrence"),
  outcome_sources = c("ICD10", "ICD9", "Death", "FirstOccurrence"),
  primary_disease = "MI"
)

First Occurrence dates with UKB special date coding 819 are treated as missing and excluded before prevalent/incident classification.

Default settings reflect this design:

# prevalent_sources includes Self-report
# outcome_sources excludes Self-report
analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  prevalent_sources = c("ICD10", "ICD9",
                        "Self-report"),
  outcome_sources = c("ICD10", "ICD9", "Death"),
  primary_disease = "AA",
  show_flow = TRUE
)

Performance tuning (large data)

For very large cohorts, you can limit or increase the threads used by data.table during this function call:

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  primary_disease = "AA",
  dt_threads = 8
)

dt_threads only applies inside this call. The previous data.table thread setting is restored automatically when the function exits.

Date parsing in this pipeline is robust to malformed values: non-standard dates are converted to NA with warnings instead of stopping the full run.

Running a Cox model

After building the survival dataset, exclude prevalent primary-disease cases and fit a Cox model:

library(survival)

# Exclude prevalent AA cases (they have NA outcome)
cox_data <- analysis_dt[!is.na(outcome_status)]

cox_model <- coxph(
  Surv(outcome_surv_time, outcome_status) ~
    Hypertension_history + Diabetes_history,
  data = cox_data
)
summary(cox_model)

Sensitivity analyses

You can run sensitivity analyses by varying the source definitions:

# Sensitivity 1: hospital records only (strictest)
strict_dt <- build_survival_dataset(
  ukb_data, diseases,
  prevalent_sources = c("ICD10", "ICD9"),
  outcome_sources   = c("ICD10", "ICD9"),
  primary_disease   = "AA"
)

# Sensitivity 2: broad baseline history and fatal events
broad_dt <- build_survival_dataset(
  ukb_data, diseases,
  prevalent_sources = c("ICD10", "ICD9",
                        "Self-report"),
  outcome_sources   = c("ICD10", "ICD9",
                        "Death"),
  primary_disease   = "AA"
)

Long format output

If you need a case-level (long) table instead of the wide cohort table, set output = "long":

long_dt <- build_survival_dataset(
  ukb_data, diseases,
  primary_disease = "AA",
  output = "long",
  include_all = TRUE
)
head(long_dt)

With output = "long" and include_all = TRUE, the function returns one row per participant per disease, including non-cases with status = 0. With include_all = FALSE, it returns only extracted case records from the selected outcome sources.