Survival Analysis

The main function for building a Cox-regression-ready dataset is build_survival_dataset(). It handles prevalent/incident classification, follow-up time calculation, and supports separate source definitions for baseline exclusion and outcome ascertainment. It also supports participant-flow reporting (show_flow) and temporary data.table thread tuning (dt_threads) for large datasets.

Quick start

library(UKBAnalytica)
library(data.table)

ukb_data <- fread("population.csv")

diseases <- get_predefined_diseases()[
  c("AA", "Hypertension", "Diabetes")
]

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  primary_disease = "AA",
  censor_date = as.Date("2023-10-31"),
  show_flow = TRUE,
  dt_threads = 8
)

Output columns

The wide-format output contains the following columns for each disease:

Column Description
{Disease}_history 1 if prevalent (diagnosed at or before baseline)
{Disease}_incident 1 if incident (diagnosed after baseline)
outcome_status Primary disease event indicator (1 = event, 0 = censored, NA = prevalent)
outcome_surv_time Follow-up time in years (NA for prevalent cases)

Prevalent cases of the primary disease receive NA for both outcome_status and outcome_surv_time because they are not at risk for incident disease. In a Cox model you should exclude or handle them explicitly.

Participant flow reporting

When show_flow = TRUE and output = "wide", the function prints a step-by-step participant attrition table in terminal and stores it in the returned object:

flow_dt <- attr(analysis_dt, "participant_flow")
if (!is.null(flow_dt)) {
  print(flow_dt)
}

Case classification logic

The function classifies each participant as follows:

  1. Prevalent case – earliest diagnosis date (from prevalent_sources) is on or before the baseline date. The participant already had the disease at enrollment.
  2. Incident case – earliest diagnosis date (from outcome_sources) is after the baseline date. This is a new event during follow-up.
  3. Censored – no diagnosis by the end of follow-up. Follow-up ends at death or the administrative censor date, whichever comes first.

Follow-up time

Follow-up time for the primary disease is calculated as:

  • Prevalent case: NA (not at risk).
  • Incident case: (diagnosis_date - baseline_date) / 365.25.
  • Censored: (min(death_date, censor_date) - baseline_date) / 365.25.

Dual-source design

The prevalent_sources and outcome_sources arguments let you use different data sources for baseline exclusion versus outcome ascertainment. This is recommended because:

Aspect Prevalent (history) Outcome (incident)
Purpose Exclude baseline cases / define history Define endpoint
Self-report Include (captures pre-existing) Exclude (imprecise dates)
Date precision Less critical Critical for survival time
Death records Usually not required for history Useful for fatal events and censoring

"Algorithm" can be added to either source list when your disease definition includes algo_date_field (and optionally algo_source_field). The parser accepts both UKB column naming styles: p{field}_i0 and p{field}.

"CancerRegistry" can be added for malignant neoplasm endpoints when the extracted dataset contains cancer registry fields (p40006_i*, p40005_i*, and optionally p40011_i* / p40012_i*). For example:

lung_defs <- get_predefined_diseases()["Lung_Cancer"]

lung_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = lung_defs,
  prevalent_sources = c("CancerRegistry", "ICD10"),
  outcome_sources = c("CancerRegistry", "ICD10", "Death"),
  primary_disease = "Lung_Cancer"
)

"FirstOccurrence" can also be added when your dataset contains UKB Category 1712 fields. Predefined diseases include common First Occurrence field IDs, so most users only need to add the source name:

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = get_predefined_diseases()[c("MI", "Stroke", "Diabetes")],
  prevalent_sources = c("ICD10", "ICD9", "Self-report", "FirstOccurrence"),
  outcome_sources = c("ICD10", "ICD9", "Death", "FirstOccurrence"),
  primary_disease = "MI"
)

First Occurrence dates with UKB special date coding 819 are treated as missing and excluded before prevalent/incident classification.

Default settings reflect this design:

# prevalent_sources includes Self-report
# outcome_sources excludes Self-report
analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  prevalent_sources = c("ICD10", "ICD9",
                        "Self-report"),
  outcome_sources = c("ICD10", "ICD9", "Death"),
  primary_disease = "AA",
  show_flow = TRUE
)

Performance tuning (large data)

For very large cohorts, you can limit or increase the threads used by data.table during this function call:

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  primary_disease = "AA",
  dt_threads = 8
)

dt_threads only applies inside this call. The previous data.table thread setting is restored automatically when the function exits.

Date parsing in this pipeline is robust to malformed values: non-standard dates are converted to NA with warnings instead of stopping the full run.

Running a Cox model

After building the survival dataset, exclude prevalent primary-disease cases and fit a Cox model:

library(survival)

# Exclude prevalent AA cases (they have NA outcome)
cox_data <- analysis_dt[!is.na(outcome_status)]

cox_model <- coxph(
  Surv(outcome_surv_time, outcome_status) ~
    Hypertension_history + Diabetes_history,
  data = cox_data
)
summary(cox_model)

Sensitivity analyses

You can run sensitivity analyses by varying the source definitions:

# Sensitivity 1: hospital records only (strictest)
strict_dt <- build_survival_dataset(
  ukb_data, diseases,
  prevalent_sources = c("ICD10", "ICD9"),
  outcome_sources   = c("ICD10", "ICD9"),
  primary_disease   = "AA"
)

# Sensitivity 2: broad baseline history and fatal events
broad_dt <- build_survival_dataset(
  ukb_data, diseases,
  prevalent_sources = c("ICD10", "ICD9",
                        "Self-report"),
  outcome_sources   = c("ICD10", "ICD9",
                        "Death"),
  primary_disease   = "AA"
)

Long format output

If you need a case-level (long) table instead of the wide cohort table, set output = "long":

long_dt <- build_survival_dataset(
  ukb_data, diseases,
  primary_disease = "AA",
  output = "long",
  include_all = TRUE
)
head(long_dt)

With output = "long" and include_all = TRUE, the function returns one row per participant per disease, including non-cases with status = 0. With include_all = FALSE, it returns only extracted case records from the selected outcome sources.