Disease Definitions
UKBAnalytica identifies disease cases from up to eight data sources:
| Source | Code field | Date field |
|---|---|---|
| ICD-10 (hospital) | p41270 |
p41280_a* |
| ICD-9 (hospital) | p41271 |
p41281_a* |
| OPCS-4 procedures (hospital) | p41272 |
p41282_a* |
| Self-report | p20002_i*_a* |
p20008_i*_a* |
| Death registry | p40001, p40002 |
p40000 |
| Cancer registry | p40006_i*, optional p40011_i*, p40012_i* |
p40005_i* |
| First Occurrence | p{source_field} or p{source_field}_i0 |
p{date_field} or p{date_field}_i0 |
| Algorithm (Category 42, optional) | p{algo_source_field} or p{algo_source_field}_i0 |
p{algo_date_field} or p{algo_date_field}_i0 |
Predefined diseases
The package now provides two disease-definition layers:
- Curated definitions: 86 manually maintained UKBAnalytica definitions for commonly used endpoints, including literature-checked subtype and composite endpoints.
- Pomegranate-derived definitions: 313 definitions converted from the public Pomegranate GitHub YAML algorithms and stored in the built-in disease catalog. The large Pomegranate portal long-form audit table is not required for endpoint construction and is not shipped in the package build.
The default behavior is unchanged: get_predefined_diseases() returns the curated definitions.
library(UKBAnalytica)
diseases <- get_predefined_diseases()
> names(diseases) |> head()
[1] "AA" "TAA" "AAA" "CVD" "MI" "STEMI"Select a subset by name:
my_diseases <- get_predefined_diseases()[
c("AA", "Arrhythmia", "Atrial_Fibrillation")
]To inspect the raw disease-code catalog before defining an endpoint, use get_disease_catalog():
get_disease_catalog(disease = "COPD")
get_disease_catalog(disease = "asthma", source = "pomegranate")
get_disease_catalog(disease = "diabetes", code_system = "ICD-10")
get_pomegranate_source_manifest()The catalog is a code-level table with columns such as definition_id, disease_name, source, code_system, field_id, code, code_label, match_rule, and validation_status. It is useful for checking exactly which ICD-10, self-report, OPCS4, cancer-registry, or other source codes support a target disease. By default, source = "pomegranate" uses the GitHub YAML catalog pinned in the source manifest. If users keep a local portal audit CSV, they can load it explicitly with load_pomegranate_portal_coding(path = ...); this optional audit table is not needed for ordinary endpoint construction.
Choosing a definition source
get_predefined_diseases() can return different definition sets:
# 86 manually curated definitions
curated <- get_predefined_diseases(source = "curated")
# 313 Pomegranate-derived definitions that can be converted to UKBAnalytica
# disease-definition objects using currently supported parsers
pomegranate <- get_predefined_diseases(source = "pomegranate")
# 56 diseases matched between curated and Pomegranate; code-level intersection
both_intersection <- get_predefined_diseases(
source = "both",
merge_type = "intersection"
)
# 331 diseases in the union library:
# matched diseases use curated names with unioned codes;
# unmatched curated and unmatched Pomegranate definitions are retained
both_union <- get_predefined_diseases(
source = "both",
merge_type = "union"
)The source = "both" mode standardizes matched disease names to the curated UKBAnalytica keys. For example, COPD remains COPD, asthma remains Asthma, and type 2 diabetes remains T2DM. This keeps the output compatible with build_survival_dataset().
The merge strategy controls the code set for matched diseases:
merge_type |
Disease-level behavior | Code-level behavior | Typical use |
|---|---|---|---|
"intersection" |
Keep diseases matched between curated and Pomegranate | Keep overlapping codes only | Conservative sensitivity analysis |
"union" |
Keep matched diseases plus unmatched diseases from both sources | Combine curated and Pomegranate codes for matched diseases | Expanded endpoint discovery |
You can also filter during retrieval:
copd_expanded <- get_predefined_diseases(
source = "both",
merge_type = "union",
disease = "COPD"
)Each definition is a list specifying ICD-10 codes, ICD-9 codes, optional OPCS-4 procedure codes, self-report codes, death-cause codes, optional First Occurrence fields, optional cancer registry filters, and optional algorithm field IDs.
Inspecting a definition
diseases <- get_predefined_diseases()
> str(diseases[["COPD"]])
List of 13
$ name : chr "Chronic Obstructive Pulmonary Disease"
$ icd10_pattern : chr "^(J40|J41|J42|J43|J44)"
$ icd9_pattern : chr "^(491|492|4932|496)"
$ sr_codes : num [1:3] 1112 1113 1472
$ death_icd10 : chr "^(J40|J41|J42|J43|J44)"
$ opcs4_pattern : NULL
$ first_occurrence_fields : int [1:5] 131484 131486 131488 131490 131492
$ first_occurrence_source_fields: NULL
$ cancer_icd10_pattern : NULL
$ cancer_histology : NULL
$ cancer_behaviour : NULL
$ algo_date_field : num 42016
$ algo_source_field : num 42017A definition typically contains:
icd10_pattern– regular expression for ICD-10 codes.icd9_pattern– regular expression for ICD-9 codes.opcs4_pattern– optional regular expression for OPCS4 operative procedure codes.sr_codes– integer vector of self-report coding IDs.death_icd10– optional regular expression for death-cause ICD-10 codes (if omitted,icd10_patternis used).first_occurrence_fields– optional UKB Category 1712 date field IDs for 3-character ICD-10 First Occurrence outcomes.first_occurrence_source_fields– optional source field IDs. If omitted, the package usesfirst_occurrence_fields + 1.cancer_icd10_pattern– optional regular expression for cancer registry ICD-10 fieldp40006_i*.cancer_histology– optional ICD-O-3 histology code filter fromp40011_i*.cancer_behaviour– optional tumour behaviour code filter fromp40012_i*; use3Lfor malignant tumours.algo_date_field– integer UKB field ID for algorithm-defined diagnosis date (optional).algo_source_field– integer UKB field ID for algorithm-defined diagnosis source (optional).
Creating custom definitions
Use create_disease_definition() to define a new disease. Surgical or procedure-defined phenotypes can add opcs4_pattern:
arrhythmia_def <- create_disease_definition(
name = "Cardiac Arrhythmia",
icd10_pattern = "^(I44|I45|I46|I47|I48|I49)",
opcs4_pattern = "^(K576|K59|K60|K61|K62|K641|K72|K73|K74)"
)Code matching rules
ICD-10, ICD-9, OPCS4, death-cause, and cancer-registry code fields are matched with regular expressions. Internally, UKBAnalytica parses each code into a long-format table and keeps rows where grepl(pattern, code, perl = TRUE) is true.
UKB hospital ICD and OPCS codes are usually stored without decimal points. For example, ICD-10 I47.2 should be written as I472, and OPCS4 K57.6 should be written as K576.
Use different regular-expression styles depending on how broad the endpoint should be:
| Goal | Example pattern | What it matches |
|---|---|---|
| Broad 3-character ICD block | "^I47" |
I47, I470, I471, I472, … |
| Specific 4-character ICD codes only | "^(I470|I472)$" |
only I470 or I472 |
| Several exact ICD codes | "^(I470|I472|I490|I493)$" |
only the listed ICD codes |
| Broad OPCS4 family | "^K62" |
all K62* procedure codes |
| Specific OPCS4 procedure codes only | "^(K576|K641)$" |
only K576 or K641 |
The start anchor ^ avoids matching a code fragment in the middle of a string. The end anchor $ avoids overmatching child codes. For example, ^I49 is appropriate for a broad I49 arrhythmia endpoint, but ^I493$ is safer when the intended endpoint is premature depolarization only.
For ventricular arrhythmia, a narrow ICD-10 + OPCS4 definition can be written as:
ventricular_arrhythmia_def <- create_disease_definition(
name = "Ventricular Arrhythmia",
icd10_pattern = "^(I470|I472|I490|I493)$",
opcs4_pattern = "^(K576|K641)$"
)This definition matches the requested ICD-10 codes I47.0, I47.2, I49.0, and I49.3, plus OPCS4 procedure codes K57.6 and K64.1. It does not match other I47*, I49*, K57*, or K64* codes.
For diseases with UKB algorithm-defined outcomes (Category 42), you can also add algorithm date/source fields:
copd_def <- create_disease_definition(
name = "COPD",
icd10_pattern = "^(J41|J42|J43|J44)",
icd9_pattern = "^(490|491|492|496)",
sr_codes = c(1112),
death_icd10 = "^(J41|J42|J43|J44)",
algo_date_field = 42016,
algo_source_field = 42017
)For diseases with UKB First Occurrence fields, add the date field IDs directly. The corresponding source fields are inferred automatically because UKB stores them as the next field ID:
mi_def <- create_disease_definition(
name = "Myocardial Infarction",
icd10_pattern = "^(I21|I22)",
first_occurrence_fields = c(131298, 131300)
)First Occurrence fields are 3-character ICD-10 outcomes. For example, 131298 is “Date I21 first reported”; 131299 is its source field. This means they are excellent for broad ICD-10 outcomes, but they should not be used to distinguish 4-character subtypes such as I714 versus I713.
For cancer endpoints, use the cancer registry source. UKB stores cancer type as ICD-10 in field 40006, diagnosis date in 40005, tumour histology in 40011, and tumour behaviour in 40012:
lung_cancer_def <- create_disease_definition(
name = "Lung Cancer",
icd10_pattern = "^C34",
death_icd10 = "^C34",
cancer_icd10_pattern = "^C34",
cancer_behaviour = 3L
)Deprecated aliases icd10, icd9, and self_report are still accepted for backward compatibility.
If opcs4_pattern is not supplied, operative procedure data are ignored even if OPCS4 is included in sources.
When parsing algorithm and First Occurrence fields, both column naming styles are supported: p{field}_i0 and p{field}.
Arrhythmia example with ICD-10 + OPCS4
The predefined library now includes broad and subtype-specific arrhythmia endpoints such as Arrhythmia, Atrial_Fibrillation, Ventricular_Arrhythmia, AV_Block, Intraventricular_Block, and SVT.
rhythm_defs <- get_predefined_diseases()[
c("Arrhythmia", "Atrial_Fibrillation", "Ventricular_Arrhythmia")
]
arrhythmia_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = rhythm_defs,
prevalent_sources = c("ICD10", "OPCS4"),
outcome_sources = c("ICD10", "OPCS4"),
primary_disease = "Arrhythmia"
)You can combine it with predefined diseases:
all_diseases <- c(
get_predefined_diseases()[c("Hypertension", "Arrhythmia")],
list(CustomArrhythmia = arrhythmia_def)
)Composite endpoints
Create composite endpoints by merging multiple definitions:
mace <- combine_disease_definitions(
get_predefined_diseases()[c("MI", "Stroke", "HF")]
)This merges ICD-10, ICD-9, OPCS4, cancer registry, First Occurrence, self-report, and death codes from all input definitions into a single combined definition.
Comparing data sources
To check how many cases each source contributes, use compare_data_sources():
library(data.table)
ukb_data <- fread("population.csv")
diseases <- get_predefined_diseases()[c("AA", "Hypertension")]
comparison <- compare_data_sources(ukb_data, diseases)
comparisonThis returns a summary table showing case counts per source per disease, which is useful for justifying source selection in your manuscript.