Machine Learning Workflow

This chapter describes the recommended non-survival machine learning workflow in UKBAnalytica. The main user-facing function is ukb_ml_workflow(). It is designed to keep the API simple while enforcing a standard analysis structure:

Freeze the test set before model development.
Use only training and validation data for feature selection, tuning, and threshold learning.
Refit the final model on the training-development data.
Evaluate once on the frozen test set.

This chapter covers binary classification, multiclass classification, continuous outcomes, and time-to-event prediction. For survival ML, use ukb_ml_survival_workflow(); the legacy ukb_ml_survival() interface remains available only for backward compatibility.

What users usually need

For most projects, users only need one function:

library(UKBAnalytica)

ml <- ukb_ml_workflow(
  outcome ~ age + sex + bmi + smoking + sbp,
  data = analysis_dt,
  model = "logistic",
  seed = 42
)

ml$final_test_metrics
head(ml$final_test_predictions)

This minimal call is useful for a quick baseline model. For a manuscript-ready binary classification workflow, use an explicit train/validation/test split and learn the classification threshold on validation data:

ml <- ukb_ml_workflow(
  outcome_status ~ age + sex + bmi + smoking + sbp + dbp,
  data = analysis_dt,
  model = "rf",
  split_params = list(
    split = "train_valid_test",
    train_ratio = 0.70,
    validation_ratio = 0.10,
    test_ratio = 0.20
  ),
  feature_select = "filter",
  feature_params = list(max_features = 50),
  tune = TRUE,
  tune_params = list(
    resampling = "validation",
    metric = "auc",
    param_grid = data.frame(
      num.trees = c(300, 500),
      mtry = c(3, 5),
      min.node.size = c(5, 10)
    )
  ),
  threshold_method = "youden",
  seed = 42
)

ml$final_test_metrics
head(ml$final_test_predictions)

The test set is not used during feature selection, tuning, or threshold learning. It is used only by ukb_ml_evaluate_test() inside the final step.

Prepare an ML-ready dataset

Start from an analysis-ready table. For incident disease prediction, this is usually the wide table returned by build_survival_dataset(). Keep one row per participant and include only variables available at prediction time.

library(data.table)
library(UKBAnalytica)

ml_dt <- as.data.table(analysis_dt)[
  ,
  .(
    eid,
    outcome_status,
    age,
    sex,
    bmi,
    smoking,
    sbp,
    dbp,
    diabetes_history,
    hypertension_history
  )
]

ml_dt <- ukb_clean_missing(ml_dt)
ukb_snapshot(ml_dt, label = "ml_input")

Keep the outcome and predictors conceptually separated:

Role	Example	Rule
Outcome	`outcome_status`, `diabetes`, `biomarker_value`	The target to predict
Predictors	`age`, `sex`, `bmi`, biomarkers	Must be known before the prediction time
ID	`eid`	Used to check overlap; do not use as a predictor
Leakage variables	follow-up time, diagnosis date after baseline	Do not include as predictors

If you use outcome_status from build_survival_dataset(), make sure the predictor set does not include follow-up information or post-baseline diagnosis variables.

Model choices

ukb_ml_workflow() supports the following non-survival models:

`model`	Outcome type	Best use
`"logistic"`	Binary	Baseline, interpretable classifier
`"linear"`	Continuous	Baseline regression
`"rf"`	Binary, multiclass, continuous	Robust nonlinear tabular model
`"xgboost"`	Binary, multiclass, continuous	Gradient boosting for stronger prediction
`"glmnet"`	Binary, multiclass, continuous	Regularized model and sparse features
`"svm"`	Binary, multiclass, continuous	Nonlinear boundaries in smaller datasets
`"nnet"`	Binary, multiclass, continuous	Simple neural network or multinomial model
`"rpart"`	Binary, multiclass, continuous	Lightweight decision tree baseline
`"naive_bayes"`	Binary, multiclass	Fast probabilistic classifier

Outcome type is inferred automatically in common cases:

Outcome values	Inferred `outcome_type`
Two-level factor or character	`"binary"`
Numeric 0/1	`"binary"`
Factor or character with 3+ levels	`"multiclass"`
Numeric with many values	`"continuous"`

If the outcome is ambiguous, set outcome_type explicitly.

Binary classification

Binary classification is the most common use case. Use a factor with the negative class first and the positive class second when you want full control.

ml_dt$outcome_status <- factor(
  ml_dt$outcome_status,
  levels = c(0, 1),
  labels = c("control", "case")
)

ml <- ukb_ml_workflow(
  outcome_status ~ age + sex + bmi + smoking + sbp + dbp,
  data = ml_dt,
  model = "logistic",
  split_params = list(split = "train_valid_test"),
  tune = TRUE,
  tune_params = list(resampling = "validation", metric = "auc"),
  threshold_method = "youden",
  seed = 42
)

ml$final_test_metrics[c("auc", "accuracy", "sensitivity", "specificity", "f1")]
ml$threshold_result

For binary outcomes, final_test_predictions contains:

Column	Meaning
`truth`	Observed class in the test set
`prob`	Predicted probability for the positive class
`pred_class`	Class after applying the fixed threshold

Continuous outcomes

Use model = "linear" for a baseline model or model = "rf"/"xgboost" for nonlinear prediction.

ml_reg <- ukb_ml_workflow(
  biomarker_value ~ age + sex + bmi + smoking + sbp,
  data = ml_dt,
  model = "linear",
  outcome_type = "continuous",
  split_params = list(split = "train_test", train_ratio = 0.75),
  threshold_method = "none",
  seed = 42
)

ml_reg$final_test_metrics[c("rmse", "mae", "rsquared")]
head(ml_reg$final_test_predictions)

Continuous predictions are reported in the prediction column.

Multiclass outcomes

For multiclass classification, set threshold_method = "none" because a single binary threshold is not meaningful.

ml_multi <- ukb_ml_workflow(
  disease_subtype ~ age + sex + bmi + smoking + sbp,
  data = ml_dt,
  model = "nnet",
  outcome_type = "multiclass",
  split_params = list(split = "train_valid_test"),
  tune = TRUE,
  tune_params = list(
    resampling = "validation",
    metric = "macro_f1",
    param_grid = data.frame(decay = c(0, 0.01, 0.05))
  ),
  threshold_method = "none",
  seed = 42
)

ml_multi$final_test_metrics[c("accuracy", "macro_f1", "logloss")]
head(ml_multi$final_test_predictions)

Multiclass predictions include pred_class plus one probability column per class.

When train/test files already exist

If another workflow has already created train_df, valid_df, and test_df, register them with ukb_ml_as_split(). This function checks duplicated or overlapping IDs when id_col is supplied.

split_obj <- ukb_ml_as_split(
  train_data = train_df,
  validation_data = valid_df,
  test_data = test_df,
  id_col = "eid",
  outcome = "outcome_status",
  outcome_type = "binary"
)

ml <- ukb_ml_workflow(
  outcome_status ~ age + sex + bmi + smoking + sbp,
  split = split_obj,
  model = "rf",
  threshold_method = "youden",
  seed = 42
)

This is useful when the cohort split must be fixed across multiple packages, teams, or external validation analyses.

Feature selection

Feature selection is optional and is always performed without touching the test set.

`feature_select`	Meaning	Notes
`"none"`	Use all formula predictors	Recommended for small predictor sets
`"filter"`	Rank features by simple univariate signal	Fast baseline for many predictors
`"glmnet"`	Select nonzero regularized coefficients	Good for sparse linear signals
`"boruta"`	Boruta all-relevant feature selection	Requires `Boruta`; can be slower

Example:

ml <- ukb_ml_workflow(
  outcome_status ~ .,
  data = ml_dt[, setdiff(names(ml_dt), "eid"), with = FALSE],
  model = "rf",
  feature_select = "glmnet",
  feature_params = list(max_features = 100),
  threshold_method = "youden",
  seed = 42
)

ml$feature_result$selected_features

Use feature_select = "none" when you already have a clinically defined predictor set and do not want data-driven feature selection.

Hyperparameter tuning

ukb_ml_tune() is called internally when tune = TRUE. The easiest standard setup is validation tuning:

ml <- ukb_ml_workflow(
  outcome_status ~ age + sex + bmi + smoking + sbp,
  data = ml_dt,
  model = "xgboost",
  split_params = list(split = "train_valid_test"),
  tune = TRUE,
  tune_params = list(
    resampling = "validation",
    metric = "auc",
    param_grid = data.frame(
      nrounds = c(50, 100),
      max_depth = c(2, 4),
      eta = c(0.05, 0.10)
    )
  ),
  threshold_method = "youden",
  seed = 42
)

If you do not create a validation set, use cross-validation inside the training set:

ml <- ukb_ml_workflow(
  outcome_status ~ age + sex + bmi + smoking + sbp,
  data = ml_dt,
  model = "rf",
  split_params = list(split = "train_test"),
  tune = TRUE,
  tune_params = list(resampling = "cv", folds = 5, metric = "auc"),
  threshold_method = "youden",
  seed = 42
)

Do not use test-set metrics to choose hyperparameters.

Output object

ukb_ml_workflow() returns a list-like ukb_ml_workflow object:

Element	Content
`split`	Frozen `ukb_ml_split` object
`feature_result`	Selected features and feature-selection metadata
`tune_result`	Candidate grid, best parameters, best development-set score
`threshold_result`	Binary threshold learned from validation, CV OOF, or train data
`final_model`	Refit final model
`final_test_metrics`	Test-set metrics computed once
`final_test_predictions`	Row-level test-set predictions

Common inspection commands:

print(ml)

ml$feature_result$selected_features
ml$tune_result$best_params
ml$threshold_result
ml$final_test_metrics

test_pred <- ml$final_test_predictions

Predict on a new RAP-side dataset with the final model:

new_prob <- predict(ml$final_model, newdata = new_participants, type = "prob")
new_class <- predict(ml$final_model, newdata = new_participants, type = "class")

For continuous outcomes, use type = "response".

SHAP interpretation

Use ukb_shap() directly on the workflow object. If data is not supplied, the frozen test set stored in ml$split$test is used, so the interpretation is aligned with the final reported performance.

shap <- ukb_shap(
  ml,
  sample_n = 500,
  nsim = 100,
  seed = 42
)

ukb_shap_summary(shap)
plot_shap_summary(shap)

You can also explain a final model on a specific external dataset:

external_shap <- ukb_shap(
  ml$final_model,
  data = external_validation_df,
  sample_n = 500,
  seed = 42
)

For multiclass models, choose the class probability to explain:

shap_case <- ukb_shap(ml_multi, class_level = "case", sample_n = 500)

Advanced: run the workflow step by step

Most users should use ukb_ml_workflow(). The lower-level functions are useful when you need to save intermediate objects or compare multiple models on the same frozen split.

split_obj <- ukb_ml_split_data(
  ml_dt,
  outcome = "outcome_status",
  split = "train_valid_test",
  seed = 42
)

features <- ukb_ml_feature_select(
  split_obj,
  outcome_status ~ age + sex + bmi + smoking + sbp,
  method = "filter",
  max_features = 50,
  seed = 42
)

tuned <- ukb_ml_tune(
  split_obj,
  formula = features$formula,
  model = "rf",
  resampling = "validation",
  metric = "auc",
  seed = 42
)

final <- ukb_ml_fit_final(
  split_obj,
  formula = features$formula,
  model = "rf",
  best_params = tuned$best_params,
  feature_spec = features,
  seed = 42
)

test_eval <- ukb_ml_evaluate_test(final, split_obj)
test_eval$metrics

Legacy quick-model interface

ukb_ml_model(), ukb_ml_cv(), ukb_ml_metrics(), ukb_ml_roc(), and ukb_ml_compare() remain available for backward compatibility, but they now emit a deprecation warning. For new analyses intended for reporting, use ukb_ml_workflow() because it makes the frozen-test design explicit.

ukb_shap() is not deprecated. It now works with both new workflow objects and legacy ukb_ml objects.

ml_rf <- ukb_ml_model(
  outcome_status ~ age + sex + bmi + smoking,
  data = ml_dt,
  model = "rf",
  task = "classification",
  split_ratio = 0.8,
  seed = 42
)

ukb_ml_metrics(ml_rf, ci = TRUE)
ukb_ml_roc(ml_rf)

Survival machine learning

For time-to-event prediction, use ukb_ml_survival_workflow(). It follows the same structure as ukb_ml_workflow():

Freeze the survival test set.
Select features and tune hyperparameters using only training/validation data.
Refit the final model on training plus validation data.
Evaluate once on the frozen test set.

The simplest baseline uses a Cox model and requires only the survival backend:

surv_ml <- ukb_ml_survival_workflow(
  Surv(outcome_surv_time, outcome_status) ~ age + sex + bmi + smoking,
  data = analysis_dt,
  model = "cox",
  split_params = list(
    split = "train_valid_test",
    train_ratio = 0.70,
    validation_ratio = 0.10,
    test_ratio = 0.20
  ),
  feature_select = "filter",
  feature_params = list(max_features = 50),
  tune = TRUE,
  tune_params = list(resampling = "validation"),
  evaluation_times = c(1, 3, 5, 10),
  seed = 42
)

surv_ml$final_test_metrics
head(surv_ml$final_test_predictions)

Supported survival models:

`model`	Backend	Best use
`"cox"`	`survival`	Interpretable baseline survival model
`"coxnet"`	`glmnet`	Regularized high-dimensional Cox model
`"rsf"`	`randomForestSRC`	Nonlinear random survival forest
`"gbm_surv"`	`gbm`	Gradient-boosted Cox model

The final metrics include Harrell’s C-index and naive time-specific Brier scores at evaluation_times. The Brier scores are not IPCW-adjusted; use them as quick model checks, not as the only censoring-aware performance statistic.

Predict survival probabilities or risk scores from the final model:

surv_prob <- predict(
  surv_ml,
  newdata = external_validation_df,
  times = c(1, 3, 5),
  type = "survival"
)

risk_score <- predict(surv_ml$final_model, newdata = external_validation_df, type = "risk")

If train/validation/test data already exist, register them explicitly:

surv_split <- ukb_ml_survival_as_split(
  train_data = train_df,
  validation_data = valid_df,
  test_data = test_df,
  time = "outcome_surv_time",
  event = "outcome_status",
  id_col = "eid"
)

surv_ml <- ukb_ml_survival_workflow(
  Surv(outcome_surv_time, outcome_status) ~ age + sex + bmi + smoking,
  split = surv_split,
  model = "cox",
  evaluation_times = c(1, 3, 5),
  seed = 42
)

Lower-level survival workflow helpers mirror the non-survival API: ukb_ml_survival_split_data(), ukb_ml_survival_as_split(), ukb_ml_survival_feature_select(), ukb_ml_survival_tune(), ukb_ml_survival_fit_final(), and ukb_ml_survival_evaluate_test().

ukb_ml_survival() now emits a deprecation warning. Use ukb_ml_survival_workflow() for new analyses.

Practical checklist

Before reporting ML results, check the following:

The test set was created before feature selection and tuning.
eid, diagnosis dates, follow-up time, and post-baseline variables are not used as predictors.
The outcome type is correct (binary, multiclass, or continuous).
Binary positive class is the second factor level if you set factor levels manually.
Hyperparameters and thresholds are chosen on train/validation or CV only.
final_test_metrics are reported as final frozen-test performance.
For survival ML, outcome_surv_time and outcome_status are used only as the left-hand side of Surv(time, event), not as predictors.

Optional packages

Install only the model backends you plan to use. UKBAnalytica checks optional packages only when the selected model or helper needs them. It does not install extra ML dependencies unless you explicitly opt in:

options(UKBAnalytica.auto_install_ml = TRUE)

The default is FALSE, which is safer on RAP and HPC systems because users can control the package library and CRAN/Bioconductor mirror themselves.

install.packages(c(
  "ranger",      # random forest
  "xgboost",     # gradient boosting
  "glmnet",      # regularized regression and glmnet feature selection
  "e1071",       # SVM and naive Bayes
  "nnet",        # neural network / multinomial regression
  "rpart",       # decision tree
  "Boruta",      # Boruta feature selection
  "pROC",        # legacy ROC helpers
  "survival",    # Cox and survival metrics
  "gbm",         # boosted survival model
  "randomForestSRC"  # random survival forest
))

logistic and linear models use base R and do not require extra model backend packages. naive_bayes reuses e1071, so no separate package is needed if you already installed it for SVM. For survival ML, model = "cox" uses survival, model = "coxnet" uses glmnet, model = "gbm_surv" uses gbm, and model = "rsf" uses randomForestSRC.