Chapter 4 Disease Definitions
UKBAnalytica identifies disease cases from up to four data sources:
| Source | Code field | Date field |
|---|---|---|
| ICD-10 (hospital) | p41270 |
p41280_a* |
| ICD-9 (hospital) | p41271 |
p41281_a* |
| Self-report | p20002_i*_a* |
p20008_i*_a* |
| Death registry | p40001, p40002 |
p40000 |
4.1 Predefined diseases
The package ships a curated library of definitions for common cardiovascular
and metabolic conditions. Retrieve them with get_predefined_diseases():
Select a subset by name:
Each definition is a list specifying ICD-10 codes, ICD-9 codes, self-report codes, and death-cause codes.
4.2 Inspecting a definition
A definition typically contains:
icd10– character vector of ICD-10 codes (prefix-matched).icd9– character vector of ICD-9 codes.self_report– integer vector of self-report coding IDs.death_icd10– character vector of death-cause ICD-10 codes.
4.3 Creating custom definitions
Use create_disease_definition() to define a new disease:
my_disease <- create_disease_definition(
icd10 = c("K70", "K71", "K72", "K73", "K74"),
icd9 = c("571"),
self_report = c(1604),
death_icd10 = c("K70", "K74")
)You can combine it with predefined diseases:
4.4 Composite endpoints
Create composite endpoints by merging multiple definitions:
This merges ICD-10, ICD-9, self-report, and death codes from all input definitions into a single combined definition.
4.5 Comparing data sources
To check how many cases each source contributes, use
compare_data_sources():
library(data.table)
ukb_data <- fread("population.csv")
diseases <- get_predefined_diseases()[c("AA", "Hypertension")]
comparison <- compare_data_sources(ukb_data, diseases)
comparisonThis returns a summary table showing case counts per source per disease, which is useful for justifying source selection in your manuscript.