Chapter 1 Introduction

package-overview UKBAnalytica is a high-performance R package for processing UK Biobank (UKB) Research Analysis Platform (RAP) data exports. It provides efficient extraction of diagnosis records from multiple sources and generates Cox regression-ready survival datasets.

1.1 Key features

  • Built on data.table for efficient processing of large-scale biobank data.
  • Supports ICD-10, ICD-9, self-reported illness, and death registry data.
  • Flexible dual-source case definitions for main and sensitivity analyses.
  • Generates survival datasets with proper prevalent/incident case handling.
  • Includes predefined definitions for common cardiovascular and metabolic diseases.
  • Provides standardized preprocessing for common UKB baseline variables.
  • Advanced analysis modules: subgroup analysis, propensity score methods, mediation analysis, and multiple imputation pooling.

1.2 Installation

Install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("Hinna0818/UKBAnalytica")

1.3 Package overview

The package covers the following workflow:

  1. Data download – Python scripts for downloading data from the RAP platform.
  2. Variable preprocessing – Standardized cleaning and recoding of baseline characteristics.
  3. Disease phenotyping – Predefined and custom disease definitions from multiple coding systems.
  4. Survival analysis – Build Cox-ready datasets with proper prevalent/incident classification.
  5. Baseline tables – Create Table 1 summaries stratified by case/control status.
  6. Multiple imputation – Impute missing covariates and merge results back to the full dataset.
  7. Main analysis – Correlation analysis and regression models (linear, logistic, Cox).
  8. Advanced analysis – Subgroup analysis, propensity score, mediation, MI pooling.

The remaining chapters walk through each step with code examples.

1.4 Core functions

Function Description
build_survival_dataset() Build wide-format survival dataset
extract_cases_by_source() Extract cases from specified data sources
get_predefined_diseases() Get predefined disease definitions
create_disease_definition() Create a custom disease definition
preprocess_baseline() Preprocess baseline characteristics
create_baseline_table() Create a Table 1 (via tableone)
run_imputation() Multiple imputation with mice

1.5 Advanced analysis functions

Function Description
run_subgroup_analysis() Stratified analysis with interaction p-values
run_multi_subgroup() Batch subgroup analysis across multiple variables
estimate_propensity_score() Estimate propensity scores
match_propensity() Propensity score matching
calculate_weights() IPTW weight calculation
run_mediation() Causal mediation analysis
pool_mi_models() Pool results from multiple imputation
plot_forest() Forest plot visualization
plot_km_curve() Kaplan-Meier curves

1.6 Machine learning functions

Function Description
ukb_ml_model() Train ML models (RF, XGBoost, GLMNet, SVM, NNet)
ukb_ml_predict() Generate predictions from ML model
ukb_ml_cv() Cross-validation for ML models
ukb_ml_compare() Compare multiple ML models
ukb_ml_metrics() Calculate performance metrics
ukb_ml_roc() ROC curve analysis
ukb_ml_calibration() Calibration curve analysis
ukb_ml_confusion() Confusion matrix
ukb_ml_importance() Variable importance
ukb_shap() Compute SHAP values
ukb_ml_survival() Survival ML models (RSF, GBM, CoxNet)
plot_shap_summary() SHAP summary plot
plot_shap_dependence() SHAP dependence plot