Chapter 4 Disease Definitions

UKBAnalytica identifies disease cases from up to five data sources:

Source Code field Date field
ICD-10 (hospital) p41270 p41280_a*
ICD-9 (hospital) p41271 p41281_a*
Self-report p20002_i*_a* p20008_i*_a*
Death registry p40001, p40002 p40000
Algorithm (Category 42, optional) p{algo_source_field} or p{algo_source_field}_i0 p{algo_date_field} or p{algo_date_field}_i0

4.1 Predefined diseases

The package ships a curated library of definitions for common cardiovascular and metabolic conditions. Retrieve them with get_predefined_diseases():

library(UKBAnalytica)

diseases <- get_predefined_diseases()
names(diseases)

Select a subset by name:

my_diseases <- get_predefined_diseases()[
  c("AA", "Hypertension", "Diabetes")
]

Each definition is a list specifying ICD-10 codes, ICD-9 codes, self-report codes, death-cause codes, and optional algorithm field IDs.

4.2 Inspecting a definition

diseases <- get_predefined_diseases()
str(diseases[["Hypertension"]])

A definition typically contains:

  • icd10 – character vector of ICD-10 codes (prefix-matched).
  • icd9 – character vector of ICD-9 codes.
  • self_report – integer vector of self-report coding IDs.
  • death_icd10 – character vector of death-cause ICD-10 codes.
  • algo_date_field – integer UKB field ID for algorithm-defined diagnosis date (optional).
  • algo_source_field – integer UKB field ID for algorithm-defined diagnosis source (optional).

4.3 Creating custom definitions

Use create_disease_definition() to define a new disease:

my_disease <- create_disease_definition(
  icd10       = c("K70", "K71", "K72", "K73", "K74"),
  icd9        = c("571"),
  self_report = c(1604),
  death_icd10 = c("K70", "K74")
)

For diseases with UKB algorithm-defined outcomes (Category 42), you can also add algorithm date/source fields:

copd_def <- create_disease_definition(
  icd10 = c("J41", "J42", "J43", "J44"),
  icd9 = c("490", "491", "492", "496"),
  self_report = c(1112),
  death_icd10 = c("J41", "J42", "J43", "J44"),
  algo_date_field = 42016,
  algo_source_field = 42017
)

When parsing algorithm fields, both column naming styles are supported: p{field}_i0 and p{field}.

You can combine it with predefined diseases:

all_diseases <- c(
  get_predefined_diseases()[c("Hypertension", "Diabetes")],
  list(LiverDisease = my_disease)
)

4.4 Composite endpoints

Create composite endpoints by merging multiple definitions:

mace <- combine_disease_definitions(
  get_predefined_diseases()[c("MI", "Stroke", "HF")]
)

This merges ICD-10, ICD-9, self-report, and death codes from all input definitions into a single combined definition.

4.5 Comparing data sources

To check how many cases each source contributes, use compare_data_sources():

library(data.table)
ukb_data <- fread("population.csv")
diseases <- get_predefined_diseases()[c("AA", "Hypertension")]

comparison <- compare_data_sources(ukb_data, diseases)
comparison

This returns a summary table showing case counts per source per disease, which is useful for justifying source selection in your manuscript.