UKBAnalytica: A Guide to UK Biobank Data Analysis
Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical Universitynanh302311@gmail.com
2026-04-03
Chapter 1 Introduction
UKBAnalytica is a high-performance R package for processing UK Biobank (UKB)
Research Analysis Platform (RAP) data exports. It provides efficient extraction
of diagnosis records from multiple sources and generates Cox regression-ready
survival datasets.
1.1 Key features
- Built on
data.tablefor efficient processing of large-scale biobank data. - Supports ICD-10, ICD-9, self-reported illness, and death registry data.
- Flexible dual-source case definitions for main and sensitivity analyses.
- Generates survival datasets with proper prevalent/incident case handling.
- Includes predefined definitions for common cardiovascular and metabolic diseases.
- Provides standardized preprocessing for common UKB baseline variables.
- Advanced analysis modules: subgroup analysis, propensity score methods, mediation analysis, and multiple imputation pooling.
1.3 Package overview
The package covers the following workflow:
- Data download – Python scripts for downloading data from the RAP platform.
- Variable preprocessing – Standardized cleaning and recoding of baseline characteristics.
- Disease phenotyping – Predefined and custom disease definitions from multiple coding systems.
- Survival analysis – Build Cox-ready datasets with proper prevalent/incident classification.
- Baseline tables – Create Table 1 summaries stratified by case/control status.
- Multiple imputation – Impute missing covariates and merge results back to the full dataset.
- Main analysis – Correlation analysis and regression models (linear, logistic, Cox).
- Advanced analysis – Subgroup analysis, propensity score, mediation, MI pooling.
The remaining chapters walk through each step with code examples.
1.4 Core functions
| Function | Description |
|---|---|
build_survival_dataset() |
Build wide-format survival dataset |
extract_cases_by_source() |
Extract cases from specified data sources |
get_predefined_diseases() |
Get predefined disease definitions |
create_disease_definition() |
Create a custom disease definition |
preprocess_baseline() |
Preprocess baseline characteristics |
create_baseline_table() |
Create a Table 1 (via tableone) |
run_imputation() |
Multiple imputation with mice |
1.5 Advanced analysis functions
| Function | Description |
|---|---|
run_subgroup_analysis() |
Stratified analysis with interaction p-values |
run_multi_subgroup() |
Batch subgroup analysis across multiple variables |
estimate_propensity_score() |
Estimate propensity scores |
match_propensity() |
Propensity score matching |
calculate_weights() |
IPTW weight calculation |
run_mediation() |
Causal mediation analysis |
pool_mi_models() |
Pool results from multiple imputation |
plot_forest() |
Forest plot visualization |
plot_km_curve() |
Kaplan-Meier curves |
1.6 Machine learning functions
| Function | Description |
|---|---|
ukb_ml_model() |
Train ML models (RF, XGBoost, GLMNet, SVM, NNet) |
ukb_ml_predict() |
Generate predictions from ML model |
ukb_ml_cv() |
Cross-validation for ML models |
ukb_ml_compare() |
Compare multiple ML models |
ukb_ml_metrics() |
Calculate performance metrics |
ukb_ml_roc() |
ROC curve analysis |
ukb_ml_calibration() |
Calibration curve analysis |
ukb_ml_confusion() |
Confusion matrix |
ukb_ml_importance() |
Variable importance |
ukb_shap() |
Compute SHAP values |
ukb_ml_survival() |
Survival ML models (RSF, GBM, CoxNet) |
plot_shap_summary() |
SHAP summary plot |
plot_shap_dependence() |
SHAP dependence plot |