Chapter 1 Introduction

package-overview UKBAnalytica is a high-performance R package for processing UK Biobank (UKB) Research Analysis Platform (RAP) data exports. It provides efficient extraction of diagnosis records from multiple sources and generates Cox regression-ready survival datasets.

1.1 Key features

  • Built on data.table for efficient processing of large-scale biobank data.
  • Supports ICD-10, ICD-9, self-reported illness, and death registry data.
  • Supports optional UKB algorithm-defined outcomes (Category 42).
  • Flexible dual-source case definitions for main and sensitivity analyses.
  • Generates survival datasets with proper prevalent/incident case handling.
  • Supports participant-flow reporting and optional thread tuning in survival dataset construction.
  • Includes predefined definitions for common cardiovascular and metabolic diseases.
  • Provides standardized preprocessing for common UKB baseline variables.
  • Advanced analysis modules: subgroup analysis, propensity score methods, mediation analysis, and multiple imputation pooling.
  • Includes sensitivity preprocessing helpers for restricted-dataset regression analyses.

1.2 Recent survival updates (v0.6.2)

  • build_survival_dataset() supports show_flow to print participant attrition and attach attr(result, "participant_flow").
  • build_survival_dataset() supports optional dt_threads for temporary data.table thread control in large datasets.
  • Algorithm fields now support both p{field}_i0 and p{field} naming styles for Category 42 data.
  • Date parsing is more robust to malformed values in ICD, self-report, and death records.

1.3 Installation

Install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("Hinna0818/UKBAnalytica_v2")

1.4 Package overview

The package covers the following workflow:

  1. Data download – Python scripts for downloading data from the RAP platform.
  2. Variable preprocessing – Standardized cleaning and recoding of baseline characteristics.
  3. Disease phenotyping – Predefined and custom disease definitions from multiple coding systems.
  4. Survival analysis – Build Cox-ready datasets with proper prevalent/incident classification.
  5. Baseline tables – Create Table 1 summaries stratified by case/control status.
  6. Multiple imputation – Impute missing covariates and merge results back to the full dataset.
  7. Main analysis – Correlation analysis, regression models, and sensitivity preprocessing.
  8. Advanced analysis – Subgroup analysis, propensity score, mediation, MI pooling.

The remaining chapters walk through each step with code examples.

1.5 Core functions

Function Description
build_survival_dataset() Build wide-format survival dataset
extract_cases_by_source() Extract cases from specified data sources
get_predefined_diseases() Get predefined disease definitions
create_disease_definition() Create a custom disease definition
preprocess_baseline() Preprocess baseline characteristics
create_baseline_table() Create a Table 1 (via tableone)
run_imputation() Multiple imputation with mice
sensitivity_exclude_early_events() Remove early events before time-to-event regression
sensitivity_exclude_missing_covariates() Remove rows with missing specified covariates

1.6 Advanced analysis functions

Function Description
run_subgroup_analysis() Stratified analysis with interaction p-values
run_multi_subgroup() Batch subgroup analysis across multiple variables
estimate_propensity_score() Estimate propensity scores
match_propensity() Propensity score matching
calculate_weights() IPTW weight calculation
run_mediation() Causal mediation analysis
pool_mi_models() Pool results from multiple imputation
plot_forest() Forest plot visualization
plot_km_curve() Kaplan-Meier curves

1.7 Machine learning functions

Function Description
ukb_ml_model() Train ML models (RF, XGBoost, GLMNet, SVM, NNet)
ukb_ml_predict() Generate predictions from ML model
ukb_ml_cv() Cross-validation for ML models
ukb_ml_compare() Compare multiple ML models
ukb_ml_metrics() Calculate performance metrics
ukb_ml_roc() ROC curve analysis
ukb_ml_calibration() Calibration curve analysis
ukb_ml_confusion() Confusion matrix
ukb_ml_importance() Variable importance
ukb_shap() Compute SHAP values
ukb_ml_survival() Survival ML models (RSF, GBM, CoxNet)
plot_shap_summary() SHAP summary plot
plot_shap_dependence() SHAP dependence plot