Disease Definitions

UKBAnalytica identifies disease cases from up to eight data sources:

Source Code field Date field
ICD-10 (hospital) p41270 p41280_a*
ICD-9 (hospital) p41271 p41281_a*
OPCS-4 procedures (hospital) p41272 p41282_a*
Self-report p20002_i*_a* p20008_i*_a*
Death registry p40001, p40002 p40000
Cancer registry p40006_i*, optional p40011_i*, p40012_i* p40005_i*
First Occurrence p{source_field} or p{source_field}_i0 p{date_field} or p{date_field}_i0
Algorithm (Category 42, optional) p{algo_source_field} or p{algo_source_field}_i0 p{algo_date_field} or p{algo_date_field}_i0

Predefined diseases

The package now provides two disease-definition layers:

  • Curated definitions: 86 manually maintained UKBAnalytica definitions for commonly used endpoints, including literature-checked subtype and composite endpoints.
  • Pomegranate-derived definitions: 313 definitions converted from the public Pomegranate GitHub YAML algorithms and stored in the built-in disease catalog. The large Pomegranate portal long-form audit table is not required for endpoint construction and is not shipped in the package build.

The default behavior is unchanged: get_predefined_diseases() returns the curated definitions.

library(UKBAnalytica)

diseases <- get_predefined_diseases()

> names(diseases) |> head()
[1] "AA"    "TAA"   "AAA"   "CVD"   "MI"    "STEMI"

Select a subset by name:

my_diseases <- get_predefined_diseases()[
  c("AA", "Arrhythmia", "Atrial_Fibrillation")
]

To inspect the raw disease-code catalog before defining an endpoint, use get_disease_catalog():

get_disease_catalog(disease = "COPD")
get_disease_catalog(disease = "asthma", source = "pomegranate")
get_disease_catalog(disease = "diabetes", code_system = "ICD-10")
get_pomegranate_source_manifest()

The catalog is a code-level table with columns such as definition_id, disease_name, source, code_system, field_id, code, code_label, match_rule, and validation_status. It is useful for checking exactly which ICD-10, self-report, OPCS4, cancer-registry, or other source codes support a target disease. By default, source = "pomegranate" uses the GitHub YAML catalog pinned in the source manifest. If users keep a local portal audit CSV, they can load it explicitly with load_pomegranate_portal_coding(path = ...); this optional audit table is not needed for ordinary endpoint construction.

Choosing a definition source

get_predefined_diseases() can return different definition sets:

# 86 manually curated definitions
curated <- get_predefined_diseases(source = "curated")

# 313 Pomegranate-derived definitions that can be converted to UKBAnalytica
# disease-definition objects using currently supported parsers
pomegranate <- get_predefined_diseases(source = "pomegranate")

# 56 diseases matched between curated and Pomegranate; code-level intersection
both_intersection <- get_predefined_diseases(
  source = "both",
  merge_type = "intersection"
)

# 331 diseases in the union library:
# matched diseases use curated names with unioned codes;
# unmatched curated and unmatched Pomegranate definitions are retained
both_union <- get_predefined_diseases(
  source = "both",
  merge_type = "union"
)

The source = "both" mode standardizes matched disease names to the curated UKBAnalytica keys. For example, COPD remains COPD, asthma remains Asthma, and type 2 diabetes remains T2DM. This keeps the output compatible with build_survival_dataset().

The merge strategy controls the code set for matched diseases:

merge_type Disease-level behavior Code-level behavior Typical use
"intersection" Keep diseases matched between curated and Pomegranate Keep overlapping codes only Conservative sensitivity analysis
"union" Keep matched diseases plus unmatched diseases from both sources Combine curated and Pomegranate codes for matched diseases Expanded endpoint discovery

You can also filter during retrieval:

copd_expanded <- get_predefined_diseases(
  source = "both",
  merge_type = "union",
  disease = "COPD"
)

Each definition is a list specifying ICD-10 codes, ICD-9 codes, optional OPCS-4 procedure codes, self-report codes, death-cause codes, optional First Occurrence fields, optional cancer registry filters, and optional algorithm field IDs.

Inspecting a definition

diseases <- get_predefined_diseases()

> str(diseases[["COPD"]])
List of 13
 $ name                          : chr "Chronic Obstructive Pulmonary Disease"
 $ icd10_pattern                 : chr "^(J40|J41|J42|J43|J44)"
 $ icd9_pattern                  : chr "^(491|492|4932|496)"
 $ sr_codes                      : num [1:3] 1112 1113 1472
 $ death_icd10                   : chr "^(J40|J41|J42|J43|J44)"
 $ opcs4_pattern                 : NULL
 $ first_occurrence_fields       : int [1:5] 131484 131486 131488 131490 131492
 $ first_occurrence_source_fields: NULL
 $ cancer_icd10_pattern          : NULL
 $ cancer_histology              : NULL
 $ cancer_behaviour              : NULL
 $ algo_date_field               : num 42016
 $ algo_source_field             : num 42017

A definition typically contains:

  • icd10_pattern – regular expression for ICD-10 codes.
  • icd9_pattern – regular expression for ICD-9 codes.
  • opcs4_pattern – optional regular expression for OPCS4 operative procedure codes.
  • sr_codes – integer vector of self-report coding IDs.
  • death_icd10 – optional regular expression for death-cause ICD-10 codes (if omitted, icd10_pattern is used).
  • first_occurrence_fields – optional UKB Category 1712 date field IDs for 3-character ICD-10 First Occurrence outcomes.
  • first_occurrence_source_fields – optional source field IDs. If omitted, the package uses first_occurrence_fields + 1.
  • cancer_icd10_pattern – optional regular expression for cancer registry ICD-10 field p40006_i*.
  • cancer_histology – optional ICD-O-3 histology code filter from p40011_i*.
  • cancer_behaviour – optional tumour behaviour code filter from p40012_i*; use 3L for malignant tumours.
  • algo_date_field – integer UKB field ID for algorithm-defined diagnosis date (optional).
  • algo_source_field – integer UKB field ID for algorithm-defined diagnosis source (optional).

Creating custom definitions

Use create_disease_definition() to define a new disease. Surgical or procedure-defined phenotypes can add opcs4_pattern:

arrhythmia_def <- create_disease_definition(
  name = "Cardiac Arrhythmia",
  icd10_pattern = "^(I44|I45|I46|I47|I48|I49)",
  opcs4_pattern = "^(K576|K59|K60|K61|K62|K641|K72|K73|K74)"
)

Code matching rules

ICD-10, ICD-9, OPCS4, death-cause, and cancer-registry code fields are matched with regular expressions. Internally, UKBAnalytica parses each code into a long-format table and keeps rows where grepl(pattern, code, perl = TRUE) is true.

UKB hospital ICD and OPCS codes are usually stored without decimal points. For example, ICD-10 I47.2 should be written as I472, and OPCS4 K57.6 should be written as K576.

Use different regular-expression styles depending on how broad the endpoint should be:

Goal Example pattern What it matches
Broad 3-character ICD block "^I47" I47, I470, I471, I472, …
Specific 4-character ICD codes only "^(I470|I472)$" only I470 or I472
Several exact ICD codes "^(I470|I472|I490|I493)$" only the listed ICD codes
Broad OPCS4 family "^K62" all K62* procedure codes
Specific OPCS4 procedure codes only "^(K576|K641)$" only K576 or K641

The start anchor ^ avoids matching a code fragment in the middle of a string. The end anchor $ avoids overmatching child codes. For example, ^I49 is appropriate for a broad I49 arrhythmia endpoint, but ^I493$ is safer when the intended endpoint is premature depolarization only.

For ventricular arrhythmia, a narrow ICD-10 + OPCS4 definition can be written as:

ventricular_arrhythmia_def <- create_disease_definition(
  name = "Ventricular Arrhythmia",
  icd10_pattern = "^(I470|I472|I490|I493)$",
  opcs4_pattern = "^(K576|K641)$"
)

This definition matches the requested ICD-10 codes I47.0, I47.2, I49.0, and I49.3, plus OPCS4 procedure codes K57.6 and K64.1. It does not match other I47*, I49*, K57*, or K64* codes.

For diseases with UKB algorithm-defined outcomes (Category 42), you can also add algorithm date/source fields:

copd_def <- create_disease_definition(
  name = "COPD",
  icd10_pattern = "^(J41|J42|J43|J44)",
  icd9_pattern = "^(490|491|492|496)",
  sr_codes = c(1112),
  death_icd10 = "^(J41|J42|J43|J44)",
  algo_date_field = 42016,
  algo_source_field = 42017
)

For diseases with UKB First Occurrence fields, add the date field IDs directly. The corresponding source fields are inferred automatically because UKB stores them as the next field ID:

mi_def <- create_disease_definition(
  name = "Myocardial Infarction",
  icd10_pattern = "^(I21|I22)",
  first_occurrence_fields = c(131298, 131300)
)

First Occurrence fields are 3-character ICD-10 outcomes. For example, 131298 is “Date I21 first reported”; 131299 is its source field. This means they are excellent for broad ICD-10 outcomes, but they should not be used to distinguish 4-character subtypes such as I714 versus I713.

For cancer endpoints, use the cancer registry source. UKB stores cancer type as ICD-10 in field 40006, diagnosis date in 40005, tumour histology in 40011, and tumour behaviour in 40012:

lung_cancer_def <- create_disease_definition(
  name = "Lung Cancer",
  icd10_pattern = "^C34",
  death_icd10 = "^C34",
  cancer_icd10_pattern = "^C34",
  cancer_behaviour = 3L
)

Deprecated aliases icd10, icd9, and self_report are still accepted for backward compatibility.

If opcs4_pattern is not supplied, operative procedure data are ignored even if OPCS4 is included in sources.

When parsing algorithm and First Occurrence fields, both column naming styles are supported: p{field}_i0 and p{field}.

Arrhythmia example with ICD-10 + OPCS4

The predefined library now includes broad and subtype-specific arrhythmia endpoints such as Arrhythmia, Atrial_Fibrillation, Ventricular_Arrhythmia, AV_Block, Intraventricular_Block, and SVT.

rhythm_defs <- get_predefined_diseases()[
  c("Arrhythmia", "Atrial_Fibrillation", "Ventricular_Arrhythmia")
]

arrhythmia_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = rhythm_defs,
  prevalent_sources = c("ICD10", "OPCS4"),
  outcome_sources = c("ICD10", "OPCS4"),
  primary_disease = "Arrhythmia"
)

You can combine it with predefined diseases:

all_diseases <- c(
  get_predefined_diseases()[c("Hypertension", "Arrhythmia")],
  list(CustomArrhythmia = arrhythmia_def)
)

Composite endpoints

Create composite endpoints by merging multiple definitions:

mace <- combine_disease_definitions(
  get_predefined_diseases()[c("MI", "Stroke", "HF")]
)

This merges ICD-10, ICD-9, OPCS4, cancer registry, First Occurrence, self-report, and death codes from all input definitions into a single combined definition.

Comparing data sources

To check how many cases each source contributes, use compare_data_sources():

library(data.table)
ukb_data <- fread("population.csv")
diseases <- get_predefined_diseases()[c("AA", "Hypertension")]

comparison <- compare_data_sources(ukb_data, diseases)
comparison

This returns a summary table showing case counts per source per disease, which is useful for justifying source selection in your manuscript.