library(UKBAnalytica)
ml <- ukb_ml_workflow(
outcome ~ age + sex + bmi + smoking + sbp,
data = analysis_dt,
model = "logistic",
seed = 42
)
ml$final_test_metrics
head(ml$final_test_predictions)Machine Learning Workflow
This chapter describes the recommended non-survival machine learning workflow in UKBAnalytica. The main user-facing function is ukb_ml_workflow(). It is designed to keep the API simple while enforcing a standard analysis structure:
- Freeze the test set before model development.
- Use only training and validation data for feature selection, tuning, and threshold learning.
- Refit the final model on the training-development data.
- Evaluate once on the frozen test set.
This chapter covers binary classification, multiclass classification, continuous outcomes, and time-to-event prediction. For survival ML, use ukb_ml_survival_workflow(); the legacy ukb_ml_survival() interface remains available only for backward compatibility.
What users usually need
For most projects, users only need one function:
This minimal call is useful for a quick baseline model. For a manuscript-ready binary classification workflow, use an explicit train/validation/test split and learn the classification threshold on validation data:
ml <- ukb_ml_workflow(
outcome_status ~ age + sex + bmi + smoking + sbp + dbp,
data = analysis_dt,
model = "rf",
split_params = list(
split = "train_valid_test",
train_ratio = 0.70,
validation_ratio = 0.10,
test_ratio = 0.20
),
feature_select = "filter",
feature_params = list(max_features = 50),
tune = TRUE,
tune_params = list(
resampling = "validation",
metric = "auc",
param_grid = data.frame(
num.trees = c(300, 500),
mtry = c(3, 5),
min.node.size = c(5, 10)
)
),
threshold_method = "youden",
seed = 42
)
ml$final_test_metrics
head(ml$final_test_predictions)The test set is not used during feature selection, tuning, or threshold learning. It is used only by ukb_ml_evaluate_test() inside the final step.
Prepare an ML-ready dataset
Start from an analysis-ready table. For incident disease prediction, this is usually the wide table returned by build_survival_dataset(). Keep one row per participant and include only variables available at prediction time.
library(data.table)
library(UKBAnalytica)
ml_dt <- as.data.table(analysis_dt)[
,
.(
eid,
outcome_status,
age,
sex,
bmi,
smoking,
sbp,
dbp,
diabetes_history,
hypertension_history
)
]
ml_dt <- ukb_clean_missing(ml_dt)
ukb_snapshot(ml_dt, label = "ml_input")Keep the outcome and predictors conceptually separated:
| Role | Example | Rule |
|---|---|---|
| Outcome | outcome_status, diabetes, biomarker_value |
The target to predict |
| Predictors | age, sex, bmi, biomarkers |
Must be known before the prediction time |
| ID | eid |
Used to check overlap; do not use as a predictor |
| Leakage variables | follow-up time, diagnosis date after baseline | Do not include as predictors |
If you use outcome_status from build_survival_dataset(), make sure the predictor set does not include follow-up information or post-baseline diagnosis variables.
Model choices
ukb_ml_workflow() supports the following non-survival models:
model |
Outcome type | Best use |
|---|---|---|
"logistic" |
Binary | Baseline, interpretable classifier |
"linear" |
Continuous | Baseline regression |
"rf" |
Binary, multiclass, continuous | Robust nonlinear tabular model |
"xgboost" |
Binary, multiclass, continuous | Gradient boosting for stronger prediction |
"glmnet" |
Binary, multiclass, continuous | Regularized model and sparse features |
"svm" |
Binary, multiclass, continuous | Nonlinear boundaries in smaller datasets |
"nnet" |
Binary, multiclass, continuous | Simple neural network or multinomial model |
"rpart" |
Binary, multiclass, continuous | Lightweight decision tree baseline |
"naive_bayes" |
Binary, multiclass | Fast probabilistic classifier |
Outcome type is inferred automatically in common cases:
| Outcome values | Inferred outcome_type |
|---|---|
| Two-level factor or character | "binary" |
| Numeric 0/1 | "binary" |
| Factor or character with 3+ levels | "multiclass" |
| Numeric with many values | "continuous" |
If the outcome is ambiguous, set outcome_type explicitly.
Binary classification
Binary classification is the most common use case. Use a factor with the negative class first and the positive class second when you want full control.
ml_dt$outcome_status <- factor(
ml_dt$outcome_status,
levels = c(0, 1),
labels = c("control", "case")
)
ml <- ukb_ml_workflow(
outcome_status ~ age + sex + bmi + smoking + sbp + dbp,
data = ml_dt,
model = "logistic",
split_params = list(split = "train_valid_test"),
tune = TRUE,
tune_params = list(resampling = "validation", metric = "auc"),
threshold_method = "youden",
seed = 42
)
ml$final_test_metrics[c("auc", "accuracy", "sensitivity", "specificity", "f1")]
ml$threshold_resultFor binary outcomes, final_test_predictions contains:
| Column | Meaning |
|---|---|
truth |
Observed class in the test set |
prob |
Predicted probability for the positive class |
pred_class |
Class after applying the fixed threshold |
Continuous outcomes
Use model = "linear" for a baseline model or model = "rf"/"xgboost" for nonlinear prediction.
ml_reg <- ukb_ml_workflow(
biomarker_value ~ age + sex + bmi + smoking + sbp,
data = ml_dt,
model = "linear",
outcome_type = "continuous",
split_params = list(split = "train_test", train_ratio = 0.75),
threshold_method = "none",
seed = 42
)
ml_reg$final_test_metrics[c("rmse", "mae", "rsquared")]
head(ml_reg$final_test_predictions)Continuous predictions are reported in the prediction column.
Multiclass outcomes
For multiclass classification, set threshold_method = "none" because a single binary threshold is not meaningful.
ml_multi <- ukb_ml_workflow(
disease_subtype ~ age + sex + bmi + smoking + sbp,
data = ml_dt,
model = "nnet",
outcome_type = "multiclass",
split_params = list(split = "train_valid_test"),
tune = TRUE,
tune_params = list(
resampling = "validation",
metric = "macro_f1",
param_grid = data.frame(decay = c(0, 0.01, 0.05))
),
threshold_method = "none",
seed = 42
)
ml_multi$final_test_metrics[c("accuracy", "macro_f1", "logloss")]
head(ml_multi$final_test_predictions)Multiclass predictions include pred_class plus one probability column per class.
When train/test files already exist
If another workflow has already created train_df, valid_df, and test_df, register them with ukb_ml_as_split(). This function checks duplicated or overlapping IDs when id_col is supplied.
split_obj <- ukb_ml_as_split(
train_data = train_df,
validation_data = valid_df,
test_data = test_df,
id_col = "eid",
outcome = "outcome_status",
outcome_type = "binary"
)
ml <- ukb_ml_workflow(
outcome_status ~ age + sex + bmi + smoking + sbp,
split = split_obj,
model = "rf",
threshold_method = "youden",
seed = 42
)This is useful when the cohort split must be fixed across multiple packages, teams, or external validation analyses.
Feature selection
Feature selection is optional and is always performed without touching the test set.
feature_select |
Meaning | Notes |
|---|---|---|
"none" |
Use all formula predictors | Recommended for small predictor sets |
"filter" |
Rank features by simple univariate signal | Fast baseline for many predictors |
"glmnet" |
Select nonzero regularized coefficients | Good for sparse linear signals |
"boruta" |
Boruta all-relevant feature selection | Requires Boruta; can be slower |
Example:
ml <- ukb_ml_workflow(
outcome_status ~ .,
data = ml_dt[, setdiff(names(ml_dt), "eid"), with = FALSE],
model = "rf",
feature_select = "glmnet",
feature_params = list(max_features = 100),
threshold_method = "youden",
seed = 42
)
ml$feature_result$selected_featuresUse feature_select = "none" when you already have a clinically defined predictor set and do not want data-driven feature selection.
Hyperparameter tuning
ukb_ml_tune() is called internally when tune = TRUE. The easiest standard setup is validation tuning:
ml <- ukb_ml_workflow(
outcome_status ~ age + sex + bmi + smoking + sbp,
data = ml_dt,
model = "xgboost",
split_params = list(split = "train_valid_test"),
tune = TRUE,
tune_params = list(
resampling = "validation",
metric = "auc",
param_grid = data.frame(
nrounds = c(50, 100),
max_depth = c(2, 4),
eta = c(0.05, 0.10)
)
),
threshold_method = "youden",
seed = 42
)If you do not create a validation set, use cross-validation inside the training set:
ml <- ukb_ml_workflow(
outcome_status ~ age + sex + bmi + smoking + sbp,
data = ml_dt,
model = "rf",
split_params = list(split = "train_test"),
tune = TRUE,
tune_params = list(resampling = "cv", folds = 5, metric = "auc"),
threshold_method = "youden",
seed = 42
)Do not use test-set metrics to choose hyperparameters.
Output object
ukb_ml_workflow() returns a list-like ukb_ml_workflow object:
| Element | Content |
|---|---|
split |
Frozen ukb_ml_split object |
feature_result |
Selected features and feature-selection metadata |
tune_result |
Candidate grid, best parameters, best development-set score |
threshold_result |
Binary threshold learned from validation, CV OOF, or train data |
final_model |
Refit final model |
final_test_metrics |
Test-set metrics computed once |
final_test_predictions |
Row-level test-set predictions |
Common inspection commands:
print(ml)
ml$feature_result$selected_features
ml$tune_result$best_params
ml$threshold_result
ml$final_test_metrics
test_pred <- ml$final_test_predictionsPredict on a new RAP-side dataset with the final model:
new_prob <- predict(ml$final_model, newdata = new_participants, type = "prob")
new_class <- predict(ml$final_model, newdata = new_participants, type = "class")For continuous outcomes, use type = "response".
SHAP interpretation
Use ukb_shap() directly on the workflow object. If data is not supplied, the frozen test set stored in ml$split$test is used, so the interpretation is aligned with the final reported performance.
shap <- ukb_shap(
ml,
sample_n = 500,
nsim = 100,
seed = 42
)
ukb_shap_summary(shap)
plot_shap_summary(shap)You can also explain a final model on a specific external dataset:
external_shap <- ukb_shap(
ml$final_model,
data = external_validation_df,
sample_n = 500,
seed = 42
)For multiclass models, choose the class probability to explain:
shap_case <- ukb_shap(ml_multi, class_level = "case", sample_n = 500)Advanced: run the workflow step by step
Most users should use ukb_ml_workflow(). The lower-level functions are useful when you need to save intermediate objects or compare multiple models on the same frozen split.
split_obj <- ukb_ml_split_data(
ml_dt,
outcome = "outcome_status",
split = "train_valid_test",
seed = 42
)
features <- ukb_ml_feature_select(
split_obj,
outcome_status ~ age + sex + bmi + smoking + sbp,
method = "filter",
max_features = 50,
seed = 42
)
tuned <- ukb_ml_tune(
split_obj,
formula = features$formula,
model = "rf",
resampling = "validation",
metric = "auc",
seed = 42
)
final <- ukb_ml_fit_final(
split_obj,
formula = features$formula,
model = "rf",
best_params = tuned$best_params,
feature_spec = features,
seed = 42
)
test_eval <- ukb_ml_evaluate_test(final, split_obj)
test_eval$metricsLegacy quick-model interface
ukb_ml_model(), ukb_ml_cv(), ukb_ml_metrics(), ukb_ml_roc(), and ukb_ml_compare() remain available for backward compatibility, but they now emit a deprecation warning. For new analyses intended for reporting, use ukb_ml_workflow() because it makes the frozen-test design explicit.
ukb_shap() is not deprecated. It now works with both new workflow objects and legacy ukb_ml objects.
ml_rf <- ukb_ml_model(
outcome_status ~ age + sex + bmi + smoking,
data = ml_dt,
model = "rf",
task = "classification",
split_ratio = 0.8,
seed = 42
)
ukb_ml_metrics(ml_rf, ci = TRUE)
ukb_ml_roc(ml_rf)Survival machine learning
For time-to-event prediction, use ukb_ml_survival_workflow(). It follows the same structure as ukb_ml_workflow():
- Freeze the survival test set.
- Select features and tune hyperparameters using only training/validation data.
- Refit the final model on training plus validation data.
- Evaluate once on the frozen test set.
The simplest baseline uses a Cox model and requires only the survival backend:
surv_ml <- ukb_ml_survival_workflow(
Surv(outcome_surv_time, outcome_status) ~ age + sex + bmi + smoking,
data = analysis_dt,
model = "cox",
split_params = list(
split = "train_valid_test",
train_ratio = 0.70,
validation_ratio = 0.10,
test_ratio = 0.20
),
feature_select = "filter",
feature_params = list(max_features = 50),
tune = TRUE,
tune_params = list(resampling = "validation"),
evaluation_times = c(1, 3, 5, 10),
seed = 42
)
surv_ml$final_test_metrics
head(surv_ml$final_test_predictions)Supported survival models:
model |
Backend | Best use |
|---|---|---|
"cox" |
survival |
Interpretable baseline survival model |
"coxnet" |
glmnet |
Regularized high-dimensional Cox model |
"rsf" |
randomForestSRC |
Nonlinear random survival forest |
"gbm_surv" |
gbm |
Gradient-boosted Cox model |
The final metrics include Harrell’s C-index and naive time-specific Brier scores at evaluation_times. The Brier scores are not IPCW-adjusted; use them as quick model checks, not as the only censoring-aware performance statistic.
Predict survival probabilities or risk scores from the final model:
surv_prob <- predict(
surv_ml,
newdata = external_validation_df,
times = c(1, 3, 5),
type = "survival"
)
risk_score <- predict(surv_ml$final_model, newdata = external_validation_df, type = "risk")If train/validation/test data already exist, register them explicitly:
surv_split <- ukb_ml_survival_as_split(
train_data = train_df,
validation_data = valid_df,
test_data = test_df,
time = "outcome_surv_time",
event = "outcome_status",
id_col = "eid"
)
surv_ml <- ukb_ml_survival_workflow(
Surv(outcome_surv_time, outcome_status) ~ age + sex + bmi + smoking,
split = surv_split,
model = "cox",
evaluation_times = c(1, 3, 5),
seed = 42
)Lower-level survival workflow helpers mirror the non-survival API: ukb_ml_survival_split_data(), ukb_ml_survival_as_split(), ukb_ml_survival_feature_select(), ukb_ml_survival_tune(), ukb_ml_survival_fit_final(), and ukb_ml_survival_evaluate_test().
ukb_ml_survival() now emits a deprecation warning. Use ukb_ml_survival_workflow() for new analyses.
Practical checklist
Before reporting ML results, check the following:
- The test set was created before feature selection and tuning.
eid, diagnosis dates, follow-up time, and post-baseline variables are not used as predictors.- The outcome type is correct (
binary,multiclass, orcontinuous). - Binary positive class is the second factor level if you set factor levels manually.
- Hyperparameters and thresholds are chosen on train/validation or CV only.
final_test_metricsare reported as final frozen-test performance.- For survival ML,
outcome_surv_timeandoutcome_statusare used only as the left-hand side ofSurv(time, event), not as predictors.
Optional packages
Install only the model backends you plan to use. UKBAnalytica checks optional packages only when the selected model or helper needs them. It does not install extra ML dependencies unless you explicitly opt in:
options(UKBAnalytica.auto_install_ml = TRUE)The default is FALSE, which is safer on RAP and HPC systems because users can control the package library and CRAN/Bioconductor mirror themselves.
install.packages(c(
"ranger", # random forest
"xgboost", # gradient boosting
"glmnet", # regularized regression and glmnet feature selection
"e1071", # SVM and naive Bayes
"nnet", # neural network / multinomial regression
"rpart", # decision tree
"Boruta", # Boruta feature selection
"pROC", # legacy ROC helpers
"survival", # Cox and survival metrics
"gbm", # boosted survival model
"randomForestSRC" # random survival forest
))logistic and linear models use base R and do not require extra model backend packages. naive_bayes reuses e1071, so no separate package is needed if you already installed it for SVM. For survival ML, model = "cox" uses survival, model = "coxnet" uses glmnet, model = "gbm_surv" uses gbm, and model = "rsf" uses randomForestSRC.