Chapter 7 Multiple Imputation

Missing data are common in large cohort studies. UKBAnalytica provides run_imputation(), a wrapper around the mice package that handles the subset-impute-merge workflow in a single call.

7.1 Motivation

A typical imputation pipeline for UKB data involves:

Selecting the covariates to impute.
Running mice::mice().
Extracting each completed dataset.
Merging the imputed covariates back to the full data (exposures, outcomes, etc.).
Optionally joining additional omics datasets (proteomics, metabolomics).

run_imputation() automates all five steps.

7.2 Basic usage

library(UKBAnalytica)
library(data.table)

dt <- fread("population.csv")

# Variables to impute
imp_vars <- c(
  "sex", "age", "bmi",
  "smoking", "drinking",
  "education", "income",
  "exercise_intensity", "sleep_duration",
  "ethnicity",
  "ldl", "hdl", "hba1c",
  "sbp", "dbp"
)

# Categorical variables (will be coerced to factor)
cat_vars <- c(
  "sex", "smoking", "drinking",
  "education", "income", "ethnicity",
  "exercise_intensity"
)

res <- run_imputation(
  data        = dt,
  id_col      = "eid",
  vars        = imp_vars,
  factor_vars = cat_vars,
  method      = "pmm",
  m           = 5,
  maxit       = 10,
  seed        = 1234,
  print       = FALSE
)

7.3 Working with the output

run_imputation() returns a list with two elements:

Element	Description
`imp`	The `mice` `mids` object (for diagnostics, pooling, etc.)
`data_list`	A list of `m` completed datasets with imputed covariates merged back

Access individual datasets:

# First completed dataset
imputed_dt1 <- res$data_list[[1]]
dim(imputed_dt1)

# Second completed dataset
imputed_dt2 <- res$data_list[[2]]

All columns that were not in vars (exposures, outcomes, follow-up time, etc.) are preserved as-is.

7.4 Merging additional datasets

If you have separate omics datasets (e.g., proteomics, metabolomics), you can merge them automatically:

pro_df  <- fread("protein_imputed.csv")
meta_df <- fread("metabolomics_imputed.csv")

res <- run_imputation(
  data        = dt,
  id_col      = "eid",
  vars        = imp_vars,
  factor_vars = cat_vars,
  m           = 5,
  seed        = 1234,
  print       = FALSE,
  additional_data = list(
    protein      = pro_df,
    metabolomics = meta_df
  ),
  additional_join = "inner"
)

# Each element of data_list now includes the
# omics columns as well
dim(res$data_list[[1]])

Set additional_join = "left" if you want to keep all participants even when they are missing from the omics data.

7.5 Diagnostics

Because the raw mids object is returned, you can use standard mice diagnostics:

library(mice)

# Convergence plots
plot(res$imp)

# Strip plots for a specific variable
stripplot(res$imp, bmi ~ .imp)

# Density plots
densityplot(res$imp, ~ bmi)

7.6 Pooled analysis

For analyses that account for imputation uncertainty, use mice::with() and mice::pool():

library(mice)
library(survival)

fit <- with(res$imp, lm(bmi ~ age + sex))
pooled <- pool(fit)
summary(pooled)

7.7 Saving completed datasets

for (i in seq_along(res$data_list)) {
  saveRDS(
    res$data_list[[i]],
    file = sprintf("imputed_dataset_%d.rds", i)
  )
}

7.8 Parameters

Argument	Default	Description
`data`	–	Input data.frame or data.table
`id_col`	`"eid"`	Name of the ID column
`vars`	–	Variables to impute
`factor_vars`	`NULL`	Subset of `vars` to treat as factors
`method`	`"pmm"`	Imputation method (passed to `mice`)
`m`	`5`	Number of imputations
`maxit`	`10`	Maximum iterations
`seed`	`1234`	Random seed
`print`	`TRUE`	Show iteration logs
`additional_data`	`NULL`	Named list of extra datasets to merge
`additional_join`	`"inner"`	Join type for additional datasets