RAP-internal Data Extraction
UKBAnalytica provides R-native helpers for extracting phenotype data inside the UK Biobank Research Analysis Platform (RAP). The recommended workflow calls the DNAnexus dx CLI from R, using dx extract_dataset for small synchronous extracts and the RAP table-exporter app for large cloud-side extracts.
The older Python scripts are still shipped in inst/python/ as legacy/helper entry points for users who already rely on those command-line workflows.
File structure
inst/
python/
ukb_data_loader.py # Demographics & metabolites (Spark)
protein_loader.py # Proteomics (dx commands)
field_ids_demographic.txt # Example demographic field IDs
extdata/
metabolites_non_ratio.txt # Non-ratio metabolite reference (170 fields)
R-native phenotype extraction
Run these functions inside a RAP R session, or in a local environment where the DNAnexus dx CLI is installed and authenticated. For most users, the intended workflow is RAP JupyterLab -> Terminal -> R.
Start an R session on RAP
- Log in to the UK Biobank RAP.
- Open the project that contains your approved UKB
.datasetfile. - Start a JupyterLab or R-capable workstation.
- Use a small instance such as
mem1_ssd1_v2_x4for field checks and dry-runs. Userap_submit_extract()for large jobs instead of pulling many fields into the active R session. - In JupyterLab, open
Terminal. - Start R:
RIf the package is not installed yet:
#install.packages("remotes")
remotes::install_github("Hinna0818/UKBAnalytica")
library(UKBAnalytica)RAP project files are usually mounted under /mnt/project. Files written by an interactive session may also appear in the current working directory. Check both locations when looking for outputs:
getwd()
list.files()
list.files("/mnt/project")Confirm the dataset
The package expects a .dataset file in the current RAP project root. Detect it with:
dataset <- rap_find_dataset()
datasetA successful result looks like:
"app12345_20260401.dataset" (random id)
If no dataset is found, confirm that you opened the correct RAP project and that the .dataset file is visible in the project root.
Download the official RAP data dictionary
Before extracting data, it is useful to create an official project-specific data dictionary from the RAP dataset. UKBAnalytica does not ship a third-party field dictionary. Instead, it can call the DNAnexus dx extract_dataset -ddd command inside RAP and use the generated official dictionary for field lookup.
dict_path <- ukb_download_rap_dictionary(
dataset = dataset,
output_dir = "ukb_metadata"
)
dict_pathukb_download_rap_dictionary() is a RAP-protected function. If it is called outside a RAP-like environment, it stops and asks the user to run ukb_check_rap_env() first. This keeps dictionary generation tied to the approved RAP project rather than a local copy of participant-level data.
Search fields using English or Chinese terms
ukb_query_dictionary() combines two sources:
- the official RAP data dictionary generated from your approved
.dataset; - the built-in UKBAnalytica Chinese field-path dictionary.
English field names, UKB field IDs, and RAP column names are searched directly against the official dictionary:
ukb_query_dictionary(
query = c("sex", "age at recruitment", "21022", "p41270"),
official_dict = dict_path
)Chinese queries are first matched against the built-in Chinese dictionary and then converted into English candidate terms for official-dictionary matching:
bp_query <- ukb_query_dictionary(
query = c("收缩压", "舒张压", "吸烟", "教育"),
official_dict = dict_path
)
bp_query$chinese_matches
bp_query$official_matchesThe returned object contains:
query_info: detected language and English candidate terms;chinese_matches: matched Chinese dictionary entries;official_matches: official UKB field IDs, RAP/UKB column names, value types, units, coding names, and matching scores.
Chinese-to-English matching is intentionally conservative. Results with lower matching scores are marked as needs_review, so users can verify the selected field IDs before extraction.
Validate requested columns
After querying fields or reading an extracted dataset, check that all required columns are present:
needed_cols <- c("eid", "participant.p31", "participant.p21022", "participant.p21001")
available_cols <- c("eid", "p31", "p21022", "p21001", "p20116")
ukb_validate_columns(
data = available_cols,
columns = needed_cols,
ignore_entity_prefix = TRUE
)ignore_entity_prefix = TRUE treats participant.p31 and p31 as equivalent. For analysis scripts where missing columns should stop the workflow, use:
ukb_validate_columns(
data = extracted_data,
columns = needed_cols,
error = TRUE
)Inspect available fields
Use rap_list_fields() to search fields that are approved for your project:
rap_list_fields(pattern = "^participant\\.p31|^participant\\.p53|^participant\\.p21022|^participant\\.p21001")
rap_list_fields(pattern = "blood pressure|smoking|alcohol")Common demographic and baseline field IDs:
| Field ID | Meaning |
|---|---|
| 31 | Sex |
| 53 | Date of attending assessment centre |
| 21022 | Age at recruitment |
| 21001 | Body mass index |
| 21000 | Ethnic background |
| 20116 | Smoking status |
| 20117 | Alcohol drinker status |
| 738 | Household income |
| 6138 | Qualifications |
| 4080 | Automated systolic blood pressure |
| 4079 | Automated diastolic blood pressure |
| 93 | Manual systolic blood pressure |
| 94 | Manual diastolic blood pressure |
You can also use the predefined variable names used by preprocess_baseline():
get_variable_info("demographics")
get_variable_info("anthropometrics")
get_variable_info("lifestyle")
get_variable_info("socioeconomic")
get_variable_info("blood_pressure")Get structured field metadata
The simplest current workflow is:
metadata setup -> search/info -> extract -> decode
Use ukb_metadata_setup() to combine the RAP-approved field list with any optional UKB data dictionary or coding/encoding table you already have.
If you only need the RAP-approved field list:
fields <- rap_list_fields(
pattern = "blood pressure|^participant\\.p31|^participant\\.p21022"
)
meta <- ukb_metadata_setup(
source = "files",
fields_df = fields
)If you also have UKB metadata files, add them. data_dict can be a RAP data_dictionary.csv generated by dx extract_dataset -ddd, an older Data_Dictionary_Showcase.tsv, or an equivalent field metadata table. codings can be an older Codings.tsv or an equivalent coding/encoding table.
meta <- ukb_metadata_setup(
source = "files",
fields_df = fields,
data_dict = "Data_Dictionary_Showcase.tsv",
codings = "Codings.tsv"
)Inspect one official field definition or search by keyword:
ukb_field_info(4080, metadata = meta)
bp_fields <- ukb_search_fields("blood pressure", metadata = meta)
bp_fields[, c("field_id", "title", "rap_field_names", "n_rap_columns")]UKBAnalytica also includes curated extraction-oriented variable sets. These are useful when you want a coherent group of field IDs without manually typing every field:
unique(get_variable_sets()$set)
clinical_core <- get_variable_set("clinical_core")
clinical_core[, c("variable", "field_id", "ukb_col", "category")]
air_ids <- get_variable_set("air_pollution", output = "field_id")Preview before extracting
Always dry-run large jobs before submission:
plan <- rap_submit_extract(
variables = c(
"sex", "age", "ethnicity",
"bmi", "height", "weight",
"smoking", "drinking",
"education", "income",
"sbp_auto_1", "sbp_auto_2",
"dbp_auto_1", "dbp_auto_2"
),
file = "baseline_demographics",
dry_run = TRUE
)
plan$fields
plan$matched
plan$commandUse variables for exact columns defined by UKBAnalytica. Use field_id when you want all instances and arrays for a UKB field:
plan <- rap_submit_extract(
field_id = c(31, 53, 21022, 21001, 21000, 20116, 20117, 738, 6138, 4080, 4079),
file = "baseline_demographics",
dry_run = TRUE
)You can also extract directly from field IDs or a search result with the new metadata-aware wrapper:
baseline <- ukb_extract_fields(
field_id = c(31, 21022, 21001, 4080, 4079),
metadata = meta,
mode = "sync",
strip_entity_prefix = FALSE
)
baseline_decoded <- ukb_decode(baseline, metadata = meta)The same wrapper can extract a curated variable set:
air_pollution <- ukb_extract_fields(
field_id = get_variable_set("air_pollution", output = "field_id"),
metadata = meta,
mode = "sync",
strip_entity_prefix = FALSE
)Small synchronous extraction
For a few fields, rap_extract_pheno() can extract data with dx extract_dataset and read the CSV directly into R:
baseline_small <- rap_extract_pheno(
variables = c("sex", "age", "bmi", "smoking", "drinking"),
strip_entity_prefix = TRUE
)
dim(baseline_small)
head(baseline_small)Write the result to the current RAP working directory:
data.table::fwrite(baseline_small, "baseline_small.csv")If you want a stable local path in the active RAP session:
baseline_small <- rap_extract_pheno(
variables = c("sex", "age", "bmi", "smoking", "drinking"),
output = "baseline_small.csv",
strip_entity_prefix = TRUE
)For large extractions, do not use synchronous mode. Submit a table-exporter job instead.
Recommended demographic extraction with table-exporter
Use rap_submit_extract() for regular baseline datasets:
job <- rap_submit_extract(
variables = c(
"sex", "age", "ethnicity",
"bmi", "height", "weight",
"smoking", "drinking",
"sleep_duration",
"education", "income",
"sbp_auto_1", "sbp_auto_2",
"dbp_auto_1", "dbp_auto_2",
"triglycerides", "ldl", "hdl", "hba1c", "glucose"
),
file = "baseline_demographics",
priority = "low"
)
job$job_id
job$outputThis submits a RAP table-exporter job and creates baseline_demographics.csv in the RAP project when the job finishes.
Monitor the table-exporter job
From the same JupyterLab Terminal, use the returned job ID:
dx describe job-xxxxList recent jobs:
dx find jobs --briefAfter completion, check that the output file exists:
dx ls
dx ls /mnt/projectDepending on the RAP session and project mount, the completed output can be visible in the project file browser, the project root, or /mnt/project.
Read extracted data back into R
If the file is in the current working directory:
baseline <- data.table::fread("baseline_demographics.csv")If the file is visible under /mnt/project:
baseline <- data.table::fread("/mnt/project/baseline_demographics.csv")Then continue with UKBAnalytica preprocessing:
baseline <- preprocess_baseline(
baseline,
variables = c(
"sex", "age", "ethnicity",
"bmi", "smoking", "drinking",
"education", "income"
)
)
baseline <- calculate_blood_pressure(baseline, type = "sbp")
baseline <- calculate_blood_pressure(baseline, type = "dbp")Minimal complete demographic workflow
library(UKBAnalytica)
rap_find_dataset()
plan <- rap_submit_extract(
variables = c(
"sex", "age", "ethnicity",
"bmi", "height", "weight",
"smoking", "drinking",
"education", "income",
"sbp_auto_1", "sbp_auto_2",
"dbp_auto_1", "dbp_auto_2"
),
file = "baseline_demographics",
dry_run = TRUE
)
plan$fields
job <- rap_submit_extract(
variables = c(
"sex", "age", "ethnicity",
"bmi", "height", "weight",
"smoking", "drinking",
"education", "income",
"sbp_auto_1", "sbp_auto_2",
"dbp_auto_1", "dbp_auto_2"
),
file = "baseline_demographics",
priority = "low"
)
job$job_idLegacy Python helpers
This is some Python scripts for data extracting, it is useful and it will return a .csv file containing the query data. For the consistency of this R package, the Python helpers are supplementary.
Demographic data
Extract any combination of UKB fields by specifying their field IDs. The loader uses Spark via dxdata under the hood.
# Pass IDs directly
python ukb_data_loader.py demographic \
--ids 31,53,21022,21001 \
-o population.csv
# Or read IDs from a file (recommended for many fields)
python ukb_data_loader.py demographic \
--id-file field_ids_demographic.txt \
-o population.csvThe ID file supports comments (#), comma-separated and space-separated formats:
# Demographics
31 # Sex
53 # Date of assessment
21022 # Age at recruitment
# Comma separated
20116, 20117, 1289Metabolomics data (NMR)
Extract NMR metabolomics data. You can retrieve all 251 metabolite fields or restrict to the curated non-ratio subset of 170 fields.
# All metabolite fields
python ukb_data_loader.py metabolites -o metabolites_all.csv
# Non-ratio subset only
python ukb_data_loader.py metabolites \
--non-ratio \
-o metabolites_non_ratio.csvProteomics data (Olink)
Extract Olink protein expression data via dx commands. The loader handles batching, merging, and progress tracking automatically.
# Default settings
python protein_loader.py -o ./protein_data
# Custom batch size and delay
python protein_loader.py \
-o ./protein_data \
--batch-size 100 \
--delay 3
# Skip merging batch files
python protein_loader.py -o ./protein_data --no-mergeCommon UKB field IDs
The table below lists commonly used field IDs for reference:
| Category | Field IDs | Description |
|---|---|---|
| Basic demographics | 31, 53, 21022, 21001 | Sex, assessment date, age, BMI |
| Lifestyle | 20116, 20117, 1160 | Smoking, alcohol, sleep |
| Blood pressure | 93, 94, 4079, 4080 | Systolic/diastolic BP |
| Biomarkers | 30870, 30780, 30760, 30750 | Triglycerides, LDL, HDL, glucose |
| Hospital records | 41270, 41280, 41271, 41281 | ICD-10/9 diagnoses + dates |
| Death registry | 40000, 40001, 40002 | Death date, causes |
Typical output sizes
| Data type | File | Approx. size | Rows | Columns |
|---|---|---|---|---|
| Demographics | population.csv | 10–500 MB | ~500 K | variable |
| Metabolites (all) | metabolites.csv | ~300 MB | ~120 K | 251 |
| Metabolites (non-ratio) | metabolites_non_ratio.csv | ~200 MB | ~120 K | 170 |
| Proteomics | protein_all_merged.csv | ~800 MB | ~50 K | ~3000 |
Troubleshooting
dxis not found: start the workflow inside a RAP JupyterLab Terminal or use an environment where DNAnexusdx-toolkitis installed and authenticated.- No
.datasetfile is found: confirm that you opened the correct RAP project and that the dataset file is visible in the project root. - Cannot find the exported CSV: check the project file browser,
dx ls,list.files(), andlist.files("/mnt/project"). Table-exporter outputs are written to the RAP project rather than directly into a local R object. - Spark initialisation fails: make sure you are running inside a RAP JupyterLab session.
- Field not found: some fields require instance/array suffixes (e.g.,
_i0,_a0). - Memory issues: reduce
--batch-sizefor protein extraction outputs. - Timeout: increase
--delay.