Chapter 7 Multiple Imputation
Missing data are common in large cohort studies. UKBAnalytica provides
run_imputation(), a wrapper around the
mice package that handles the
subset-impute-merge workflow in a single call.
7.1 Motivation
A typical imputation pipeline for UKB data involves:
- Selecting the covariates to impute.
- Running
mice::mice(). - Extracting each completed dataset.
- Merging the imputed covariates back to the full data (exposures, outcomes, etc.).
- Optionally joining additional omics datasets (proteomics, metabolomics).
run_imputation() automates all five steps.
7.2 Basic usage
library(UKBAnalytica)
library(data.table)
dt <- fread("population.csv")
# Variables to impute
imp_vars <- c(
"sex", "age", "bmi",
"smoking", "drinking",
"education", "income",
"exercise_intensity", "sleep_duration",
"ethnicity",
"ldl", "hdl", "hba1c",
"sbp", "dbp"
)
# Categorical variables (will be coerced to factor)
cat_vars <- c(
"sex", "smoking", "drinking",
"education", "income", "ethnicity",
"exercise_intensity"
)
res <- run_imputation(
data = dt,
id_col = "eid",
vars = imp_vars,
factor_vars = cat_vars,
method = "pmm",
m = 5,
maxit = 10,
seed = 1234,
print = FALSE
)7.3 Working with the output
run_imputation() returns a list with two elements:
| Element | Description |
|---|---|
imp |
The mice mids object (for diagnostics, pooling, etc.) |
data_list |
A list of m completed datasets with imputed covariates merged back |
Access individual datasets:
# First completed dataset
imputed_dt1 <- res$data_list[[1]]
dim(imputed_dt1)
# Second completed dataset
imputed_dt2 <- res$data_list[[2]]All columns that were not in vars (exposures, outcomes, follow-up time,
etc.) are preserved as-is.
7.4 Merging additional datasets
If you have separate omics datasets (e.g., proteomics, metabolomics), you can merge them automatically:
pro_df <- fread("protein_imputed.csv")
meta_df <- fread("metabolomics_imputed.csv")
res <- run_imputation(
data = dt,
id_col = "eid",
vars = imp_vars,
factor_vars = cat_vars,
m = 5,
seed = 1234,
print = FALSE,
additional_data = list(
protein = pro_df,
metabolomics = meta_df
),
additional_join = "inner"
)
# Each element of data_list now includes the
# omics columns as well
dim(res$data_list[[1]])Set additional_join = "left" if you want to keep all participants even when
they are missing from the omics data.
7.6 Pooled analysis
For analyses that account for imputation uncertainty, use mice::with() and
mice::pool():
7.8 Parameters
| Argument | Default | Description |
|---|---|---|
data |
– | Input data.frame or data.table |
id_col |
"eid" |
Name of the ID column |
vars |
– | Variables to impute |
factor_vars |
NULL |
Subset of vars to treat as factors |
method |
"pmm" |
Imputation method (passed to mice) |
m |
5 |
Number of imputations |
maxit |
10 |
Maximum iterations |
seed |
1234 |
Random seed |
print |
TRUE |
Show iteration logs |
additional_data |
NULL |
Named list of extra datasets to merge |
additional_join |
"inner" |
Join type for additional datasets |