RAP-internal Data Extraction

UKBAnalytica provides R-native helpers for extracting phenotype data inside the UK Biobank Research Analysis Platform (RAP). The recommended workflow calls the DNAnexus dx CLI from R, using dx extract_dataset for small synchronous extracts and the RAP table-exporter app for large cloud-side extracts.

The older Python scripts are still shipped in inst/python/ as legacy/helper entry points for users who already rely on those command-line workflows.

File structure

inst/
  python/
    ukb_data_loader.py          # Demographics & metabolites (Spark)
    protein_loader.py           # Proteomics (dx commands)
    field_ids_demographic.txt   # Example demographic field IDs
  extdata/
    metabolites_non_ratio.txt   # Non-ratio metabolite reference (170 fields)

R-native phenotype extraction

Run these functions inside a RAP R session, or in a local environment where the DNAnexus dx CLI is installed and authenticated. For most users, the intended workflow is RAP JupyterLab -> Terminal -> R.

Start an R session on RAP

  1. Log in to the UK Biobank RAP.
  2. Open the project that contains your approved UKB .dataset file.
  3. Start a JupyterLab or R-capable workstation.
  4. Use a small instance such as mem1_ssd1_v2_x4 for field checks and dry-runs. Use rap_submit_extract() for large jobs instead of pulling many fields into the active R session.
  5. In JupyterLab, open Terminal.
  6. Start R:
R

If the package is not installed yet:

#install.packages("remotes")
remotes::install_github("Hinna0818/UKBAnalytica")
library(UKBAnalytica)

RAP project files are usually mounted under /mnt/project. Files written by an interactive session may also appear in the current working directory. Check both locations when looking for outputs:

getwd()
list.files()
list.files("/mnt/project")

Confirm the dataset

The package expects a .dataset file in the current RAP project root. Detect it with:

dataset <- rap_find_dataset()
dataset

A successful result looks like:

"app12345_20260401.dataset" (random id)

If no dataset is found, confirm that you opened the correct RAP project and that the .dataset file is visible in the project root.

Download the official RAP data dictionary

Before extracting data, it is useful to create an official project-specific data dictionary from the RAP dataset. UKBAnalytica does not ship a third-party field dictionary. Instead, it can call the DNAnexus dx extract_dataset -ddd command inside RAP and use the generated official dictionary for field lookup.

dict_path <- ukb_download_rap_dictionary(
  dataset = dataset,
  output_dir = "ukb_metadata"
)

dict_path

ukb_download_rap_dictionary() is a RAP-protected function. If it is called outside a RAP-like environment, it stops and asks the user to run ukb_check_rap_env() first. This keeps dictionary generation tied to the approved RAP project rather than a local copy of participant-level data.

Search fields using English or Chinese terms

ukb_query_dictionary() combines two sources:

  • the official RAP data dictionary generated from your approved .dataset;
  • the built-in UKBAnalytica Chinese field-path dictionary.

English field names, UKB field IDs, and RAP column names are searched directly against the official dictionary:

ukb_query_dictionary(
  query = c("sex", "age at recruitment", "21022", "p41270"),
  official_dict = dict_path
)

Chinese queries are first matched against the built-in Chinese dictionary and then converted into English candidate terms for official-dictionary matching:

bp_query <- ukb_query_dictionary(
  query = c("收缩压", "舒张压", "吸烟", "教育"),
  official_dict = dict_path
)

bp_query$chinese_matches
bp_query$official_matches

The returned object contains:

  • query_info: detected language and English candidate terms;
  • chinese_matches: matched Chinese dictionary entries;
  • official_matches: official UKB field IDs, RAP/UKB column names, value types, units, coding names, and matching scores.

Chinese-to-English matching is intentionally conservative. Results with lower matching scores are marked as needs_review, so users can verify the selected field IDs before extraction.

Validate requested columns

After querying fields or reading an extracted dataset, check that all required columns are present:

needed_cols <- c("eid", "participant.p31", "participant.p21022", "participant.p21001")
available_cols <- c("eid", "p31", "p21022", "p21001", "p20116")

ukb_validate_columns(
  data = available_cols,
  columns = needed_cols,
  ignore_entity_prefix = TRUE
)

ignore_entity_prefix = TRUE treats participant.p31 and p31 as equivalent. For analysis scripts where missing columns should stop the workflow, use:

ukb_validate_columns(
  data = extracted_data,
  columns = needed_cols,
  error = TRUE
)

Inspect available fields

Use rap_list_fields() to search fields that are approved for your project:

rap_list_fields(pattern = "^participant\\.p31|^participant\\.p53|^participant\\.p21022|^participant\\.p21001")

rap_list_fields(pattern = "blood pressure|smoking|alcohol")

Common demographic and baseline field IDs:

Field ID Meaning
31 Sex
53 Date of attending assessment centre
21022 Age at recruitment
21001 Body mass index
21000 Ethnic background
20116 Smoking status
20117 Alcohol drinker status
738 Household income
6138 Qualifications
4080 Automated systolic blood pressure
4079 Automated diastolic blood pressure
93 Manual systolic blood pressure
94 Manual diastolic blood pressure

You can also use the predefined variable names used by preprocess_baseline():

get_variable_info("demographics")
get_variable_info("anthropometrics")
get_variable_info("lifestyle")
get_variable_info("socioeconomic")
get_variable_info("blood_pressure")

Get structured field metadata

The simplest current workflow is:

metadata setup -> search/info -> extract -> decode

Use ukb_metadata_setup() to combine the RAP-approved field list with any optional UKB data dictionary or coding/encoding table you already have.

If you only need the RAP-approved field list:

fields <- rap_list_fields(
  pattern = "blood pressure|^participant\\.p31|^participant\\.p21022"
)

meta <- ukb_metadata_setup(
  source = "files",
  fields_df = fields
)

If you also have UKB metadata files, add them. data_dict can be a RAP data_dictionary.csv generated by dx extract_dataset -ddd, an older Data_Dictionary_Showcase.tsv, or an equivalent field metadata table. codings can be an older Codings.tsv or an equivalent coding/encoding table.

meta <- ukb_metadata_setup(
  source = "files",
  fields_df = fields,
  data_dict = "Data_Dictionary_Showcase.tsv",
  codings = "Codings.tsv"
)

Inspect one official field definition or search by keyword:

ukb_field_info(4080, metadata = meta)

bp_fields <- ukb_search_fields("blood pressure", metadata = meta)
bp_fields[, c("field_id", "title", "rap_field_names", "n_rap_columns")]

UKBAnalytica also includes curated extraction-oriented variable sets. These are useful when you want a coherent group of field IDs without manually typing every field:

unique(get_variable_sets()$set)

clinical_core <- get_variable_set("clinical_core")
clinical_core[, c("variable", "field_id", "ukb_col", "category")]

air_ids <- get_variable_set("air_pollution", output = "field_id")

Preview before extracting

Always dry-run large jobs before submission:

plan <- rap_submit_extract(
  variables = c(
    "sex", "age", "ethnicity",
    "bmi", "height", "weight",
    "smoking", "drinking",
    "education", "income",
    "sbp_auto_1", "sbp_auto_2",
    "dbp_auto_1", "dbp_auto_2"
  ),
  file = "baseline_demographics",
  dry_run = TRUE
)

plan$fields
plan$matched
plan$command

Use variables for exact columns defined by UKBAnalytica. Use field_id when you want all instances and arrays for a UKB field:

plan <- rap_submit_extract(
  field_id = c(31, 53, 21022, 21001, 21000, 20116, 20117, 738, 6138, 4080, 4079),
  file = "baseline_demographics",
  dry_run = TRUE
)

You can also extract directly from field IDs or a search result with the new metadata-aware wrapper:

baseline <- ukb_extract_fields(
  field_id = c(31, 21022, 21001, 4080, 4079),
  metadata = meta,
  mode = "sync",
  strip_entity_prefix = FALSE
)

baseline_decoded <- ukb_decode(baseline, metadata = meta)

The same wrapper can extract a curated variable set:

air_pollution <- ukb_extract_fields(
  field_id = get_variable_set("air_pollution", output = "field_id"),
  metadata = meta,
  mode = "sync",
  strip_entity_prefix = FALSE
)

Small synchronous extraction

For a few fields, rap_extract_pheno() can extract data with dx extract_dataset and read the CSV directly into R:

baseline_small <- rap_extract_pheno(
  variables = c("sex", "age", "bmi", "smoking", "drinking"),
  strip_entity_prefix = TRUE
)

dim(baseline_small)
head(baseline_small)

Write the result to the current RAP working directory:

data.table::fwrite(baseline_small, "baseline_small.csv")

If you want a stable local path in the active RAP session:

baseline_small <- rap_extract_pheno(
  variables = c("sex", "age", "bmi", "smoking", "drinking"),
  output = "baseline_small.csv",
  strip_entity_prefix = TRUE
)

For large extractions, do not use synchronous mode. Submit a table-exporter job instead.

Monitor the table-exporter job

From the same JupyterLab Terminal, use the returned job ID:

dx describe job-xxxx

List recent jobs:

dx find jobs --brief

After completion, check that the output file exists:

dx ls
dx ls /mnt/project

Depending on the RAP session and project mount, the completed output can be visible in the project file browser, the project root, or /mnt/project.

Read extracted data back into R

If the file is in the current working directory:

baseline <- data.table::fread("baseline_demographics.csv")

If the file is visible under /mnt/project:

baseline <- data.table::fread("/mnt/project/baseline_demographics.csv")

Then continue with UKBAnalytica preprocessing:

baseline <- preprocess_baseline(
  baseline,
  variables = c(
    "sex", "age", "ethnicity",
    "bmi", "smoking", "drinking",
    "education", "income"
  )
)

baseline <- calculate_blood_pressure(baseline, type = "sbp")
baseline <- calculate_blood_pressure(baseline, type = "dbp")

Minimal complete demographic workflow

library(UKBAnalytica)

rap_find_dataset()

plan <- rap_submit_extract(
  variables = c(
    "sex", "age", "ethnicity",
    "bmi", "height", "weight",
    "smoking", "drinking",
    "education", "income",
    "sbp_auto_1", "sbp_auto_2",
    "dbp_auto_1", "dbp_auto_2"
  ),
  file = "baseline_demographics",
  dry_run = TRUE
)

plan$fields

job <- rap_submit_extract(
  variables = c(
    "sex", "age", "ethnicity",
    "bmi", "height", "weight",
    "smoking", "drinking",
    "education", "income",
    "sbp_auto_1", "sbp_auto_2",
    "dbp_auto_1", "dbp_auto_2"
  ),
  file = "baseline_demographics",
  priority = "low"
)

job$job_id

Legacy Python helpers

This is some Python scripts for data extracting, it is useful and it will return a .csv file containing the query data. For the consistency of this R package, the Python helpers are supplementary.

Demographic data

Extract any combination of UKB fields by specifying their field IDs. The loader uses Spark via dxdata under the hood.

# Pass IDs directly
python ukb_data_loader.py demographic \
  --ids 31,53,21022,21001 \
  -o population.csv

# Or read IDs from a file (recommended for many fields)
python ukb_data_loader.py demographic \
  --id-file field_ids_demographic.txt \
  -o population.csv

The ID file supports comments (#), comma-separated and space-separated formats:

# Demographics
31      # Sex
53      # Date of assessment
21022   # Age at recruitment

# Comma separated
20116, 20117, 1289

Metabolomics data (NMR)

Extract NMR metabolomics data. You can retrieve all 251 metabolite fields or restrict to the curated non-ratio subset of 170 fields.

# All metabolite fields
python ukb_data_loader.py metabolites -o metabolites_all.csv

# Non-ratio subset only
python ukb_data_loader.py metabolites \
  --non-ratio \
  -o metabolites_non_ratio.csv

Common UKB field IDs

The table below lists commonly used field IDs for reference:

Category Field IDs Description
Basic demographics 31, 53, 21022, 21001 Sex, assessment date, age, BMI
Lifestyle 20116, 20117, 1160 Smoking, alcohol, sleep
Blood pressure 93, 94, 4079, 4080 Systolic/diastolic BP
Biomarkers 30870, 30780, 30760, 30750 Triglycerides, LDL, HDL, glucose
Hospital records 41270, 41280, 41271, 41281 ICD-10/9 diagnoses + dates
Death registry 40000, 40001, 40002 Death date, causes

Typical output sizes

Data type File Approx. size Rows Columns
Demographics population.csv 10–500 MB ~500 K variable
Metabolites (all) metabolites.csv ~300 MB ~120 K 251
Metabolites (non-ratio) metabolites_non_ratio.csv ~200 MB ~120 K 170
Proteomics protein_all_merged.csv ~800 MB ~50 K ~3000

Troubleshooting

  • dx is not found: start the workflow inside a RAP JupyterLab Terminal or use an environment where DNAnexus dx-toolkit is installed and authenticated.
  • No .dataset file is found: confirm that you opened the correct RAP project and that the dataset file is visible in the project root.
  • Cannot find the exported CSV: check the project file browser, dx ls, list.files(), and list.files("/mnt/project"). Table-exporter outputs are written to the RAP project rather than directly into a local R object.
  • Spark initialisation fails: make sure you are running inside a RAP JupyterLab session.
  • Field not found: some fields require instance/array suffixes (e.g., _i0, _a0).
  • Memory issues: reduce --batch-size for protein extraction outputs.
  • Timeout: increase --delay.