Chapter 4 Disease Definitions
UKBAnalytica identifies disease cases from up to five data sources:
| Source | Code field | Date field |
|---|---|---|
| ICD-10 (hospital) | p41270 |
p41280_a* |
| ICD-9 (hospital) | p41271 |
p41281_a* |
| Self-report | p20002_i*_a* |
p20008_i*_a* |
| Death registry | p40001, p40002 |
p40000 |
| Algorithm (Category 42, optional) | p{algo_source_field} or p{algo_source_field}_i0 |
p{algo_date_field} or p{algo_date_field}_i0 |
4.1 Predefined diseases
The package ships a curated library of definitions for common cardiovascular
and metabolic conditions. Retrieve them with get_predefined_diseases():
Select a subset by name:
Each definition is a list specifying ICD-10 codes, ICD-9 codes, self-report codes, death-cause codes, and optional algorithm field IDs.
4.2 Inspecting a definition
A definition typically contains:
icd10– character vector of ICD-10 codes (prefix-matched).icd9– character vector of ICD-9 codes.self_report– integer vector of self-report coding IDs.death_icd10– character vector of death-cause ICD-10 codes.algo_date_field– integer UKB field ID for algorithm-defined diagnosis date (optional).algo_source_field– integer UKB field ID for algorithm-defined diagnosis source (optional).
4.3 Creating custom definitions
Use create_disease_definition() to define a new disease:
my_disease <- create_disease_definition(
icd10 = c("K70", "K71", "K72", "K73", "K74"),
icd9 = c("571"),
self_report = c(1604),
death_icd10 = c("K70", "K74")
)For diseases with UKB algorithm-defined outcomes (Category 42), you can also add algorithm date/source fields:
copd_def <- create_disease_definition(
icd10 = c("J41", "J42", "J43", "J44"),
icd9 = c("490", "491", "492", "496"),
self_report = c(1112),
death_icd10 = c("J41", "J42", "J43", "J44"),
algo_date_field = 42016,
algo_source_field = 42017
)When parsing algorithm fields, both column naming styles are supported:
p{field}_i0 and p{field}.
You can combine it with predefined diseases:
4.4 Composite endpoints
Create composite endpoints by merging multiple definitions:
This merges ICD-10, ICD-9, self-report, and death codes from all input definitions into a single combined definition.
4.5 Comparing data sources
To check how many cases each source contributes, use
compare_data_sources():
library(data.table)
ukb_data <- fread("population.csv")
diseases <- get_predefined_diseases()[c("AA", "Hypertension")]
comparison <- compare_data_sources(ukb_data, diseases)
comparisonThis returns a summary table showing case counts per source per disease, which is useful for justifying source selection in your manuscript.