Chapter 4 Target retrieval and DEG analysis

In network pharmacology analysis, disease target retrieval and differential gene screening are two essential preparatory steps. This chapter introduces common and free disease target databases and their retrieval strategies, demonstrates volcano plot visualization of DESeq2 results, and performs intersection analysis between DEGs and disease targets using Venn and UpSet plots.


4.1 Disease target databases

Multiple public databases catalogue gene–disease associations with varying evidence levels. In practice, combining targets from multiple sources (often intersection) improves both coverage and reliability. Here, we introduce some commonly used and user-friendly databases for disease target retrieval in TCM network pharmacology research.

4.1.1 GeneCards

GeneCards is an integrative database that provides comprehensive information on human genes, including disease associations (Stelzer et al., 2016). Each gene–disease link is assigned a Relevance Score that summarizes evidence from multiple sources.

However, GeneCards does not offer a direct API for bulk retrieval, so the common workflow involves manual search and export.

For example, to retrieve targets for “Diabetic Nephropathy”, we can manually search GeneCards, export the results as a CSV file, and filter by Relevance Score.

genecards_dn_targets
genecards_dn_targets

Then, we can read the exported CSV and filter genes based on the Relevance Score:

## uploaded "GeneCards-SearchResults.csv" from GeneCards export
dn_raw <- read.csv("GeneCards-SearchResults.csv")

## subset genes with Relevance Score > 10
dn_targets <- dn_raw$Gene.Symbol[dn_raw$Relevance.Score > 10]
head(dn_targets)
[1] "INS"   "HNF1B" "ACE"   "HNF1A" "GCK"   "NLRP3"

Notably, TCMDATA provides a built-in GeneCards-derived target list for Diabetic Nephropathy for case study demonstration:

library(TCMDATA)
data(dn_gcds)
head(dn_gcds, 20)
#>  [1] "INS"     "ACE"     "GCK"     "HNF1A"   "HNF1B"   "KCNJ11"  "UMOD"   
#>  [8] "ABCC8"   "IL6"     "PPARG"   "HNF4A"   "WFS1"    "COL4A1"  "NLRP3"  
#> [15] "TCF7L2"  "PDX1"    "INSR"    "TNF"     "NEUROD1" "TGFBR2"

4.1.2 Open Targets Platform

The Open Targets Platform is a comprehensive and robust resource designed for therapeutic target identification and prioritization (Buniello et al., 2025). It systematically aggregates, validates, and scores evidence linking targets to diseases across heterogeneous data sources. By organizing its knowledge base around five core entities—Target, Disease/Phenotype, Variant, Study, and Drug—the platform provides a highly structured framework that facilitates hypothesis generation and evidence-based target selection in drug discovery.

User can retrieve disease targets by both API and web interface. The API allows programmatic access to the data, while the web interface provides an intuitive way to explore target–disease associations.

For web interface, “Diabetic Nephropathy” can be searched directly, and the resulting target list can be exported as json and tsv format for downstream analysis:

opentargetplatform
opentargetplatform
ot <- read.delim("OT-EFO_0000401-associated-targets-2_21_2026-v25_12.tsv")
head(ot)
  symbol globalScore   gwasCredibleSets geneBurden                eva
1    ACE   0.7654601            No data    No data 0.8841439697057125
2  AGTR1   0.5902786            No data    No data            No data
3 TCF7L2   0.5130827 0.8296366016990858    No data            No data
4    INS   0.4939346 0.7894596444737526    No data            No data
5   UMOD   0.4840121 0.7620249013066388    No data            No data
6 COL4A3   0.4821762 0.7850753632678814    No data            No data

For API access, please refer to the Open Targets API documentation for detailed instructions on how to query target–disease associations programmatically.

4.1.3 CTD (Comparative Toxicogenomics Database)

The Comparative Toxicogenomics Database (CTD) is a robust, publicly available resource that aims to advance understanding about how environmental exposures affect human health (Davis et al., 2024). It provides manually curated information about chemical–gene/protein interactions, chemical–disease, and gene–disease relationships. By integrating these data with functional and pathway annotations, CTD helps researchers develop hypotheses about the mechanisms underlying environmentally influenced diseases. It is particularly valuable for network pharmacology studies involving environmental toxins or pharmacological exposures.

CTD provides both a web interface and a RESTful API for data retrieval. The web interface allows users to perform batch queries for diseases, chemicals, or genes, and export the results manually.

For programmatic access, the CTD Batch Query API is highly efficient. You can construct a query URL specifying the input type (disease), the query term (Diabetic Nephropathy), the desired report type (genes_curated or genes_inferred), and the output format (csv or tsv).

The report parameter controls which association types are returned:

report value Description DN example
genes_curated Manually curated from literature (high confidence) 46 genes
genes_inferred Inferred via chemical–gene–disease links ~26,000 genes
genes All associations (curated + inferred) ~26,300 genes

For example, to retrieve all gene targets associated with “Diabetic Nephropathy” directly into R:

# Construct the CTD Batch Query API URL
# Change report to "genes_curated" for high-confidence curated associations only
options(timeout = 300)
ctd_url <- paste0(
  "https://ctdbase.org/tools/batchQuery.go?",
  "inputType=disease&",
  "inputTerms=Diabetic%20Nephropathy&",
  "report=genes&",
  "format=tsv"
)

lines <- readLines(ctd_url)
lines[1] <- sub("^# ", "", lines[1])
ctd <- read.delim(textConnection(lines))

# Extract unique gene symbols
ctd_targets <- unique(ctd$GeneSymbol)
cat("Total gene targets:", length(ctd_targets), "\n")
head(ctd[, c("GeneSymbol", "GeneID", "DirectEvidence", "InferenceScore")])
Total gene targets: 26330

      GeneSymbol GeneID   DirectEvidence InferenceScore
1 1700001K19RIKL 299330                            3.99
2           1-SF 100049428                            2.47
3 9530082P21RIKL 360487                            3.94
4 9930111J21RIK2 245240                            3.74
5              A  50518 marker/mechanism             NA
6              A  50518                            2.63

The InferenceScore column quantifies the strength of inferred associations — higher values indicate more atypical (and potentially more meaningful) connectivity in the chemical–gene–disease network. Rows with DirectEvidence filled and InferenceScore = NA are curated associations.

Further details about the CTD data retrieval and interpretation can be found in the CTD documentation.

4.1.4 Other databases

In addition to the databases detailed above, several other resources are frequently used in network pharmacology to ensure comprehensive target collection:

  • DisGeNET: One of the largest publicly available collections of genes and variants associated with human diseases. It integrates data from expert-curated repositories, GWAS catalogues, animal models, and text-mining of the scientific literature.
  • OMIM (Online Mendelian Inheritance in Man): A comprehensive, authoritative compendium of human genes and genetic phenotypes. It is highly reliable for identifying genes with strong, well-documented genetic links to specific diseases.
  • TTD (Therapeutic Target Database): Focuses on known and explored therapeutic protein and nucleic acid targets, providing detailed information about the targeted diseases, pathway information, and corresponding drugs.
  • DrugBank: While primarily a drug database, it provides extensive information on drug targets, making it useful for finding targets of existing drugs used to treat the disease of interest.

Researchers typically query multiple databases and take intersection of the results to form a robust disease target set.

4.2 DEG visualization

Differential expression analysis identifies genes that are significantly altered between disease and control conditions. In the pharmacology network analysis, these DEGs can be integrated with disease targets to prioritize genes that are both statistically significant in expression and biologically linked to the disease phenotype. The intersection of DEGs and disease targets often represents the most promising candidates for further network analysis and experimental validation.

TCMDATA includes a demo DESeq2 result[4] from GSE142025[5] (early DN vs. control) for illustration. Further details about the dataset can be found in the original GEO page.

4.2.1 Load and inspect data

data(deg_earlydn)
str(deg_earlydn)
#> 'data.frame':    27183 obs. of  8 variables:
#>  $ baseMean      : num  31.9 1128.6 61.8 11.7 260.2 ...
#>  $ log2FoldChange: num  0.8175 -0.0325 -0.2766 0.0321 0.0982 ...
#>  $ lfcSE         : num  0.219 0.133 0.266 0.421 0.178 ...
#>  $ stat          : num  3.7246 -0.2439 -1.0382 0.0763 0.5528 ...
#>  $ pvalue        : num  0.000196 0.807271 0.299167 0.939176 0.580368 ...
#>  $ padj          : num  0.00409 0.91812 0.56789 0.97605 0.78759 ...
#>  $ names         : chr  "DDX11L1" "WASH7P" "MIR6859-1" "FAM138A" ...
#>  $ g             : chr  "up" "normal" "normal" "normal" ...
table(deg_earlydn$g)
#> 
#>   down normal     up 
#>    678  25852    653

4.2.2 Volcano plot with ivolcano

ivolcano[6] is an R package that provides both static and interactive volcano plot visualizations for differential expression results. The interactive mode allows users to explore DEGs dynamically, which hovers to display gene details, click to redirect to external databases, and zoom into specific regions.

By specifying dual thresholds, ivolcano automatically applies a FigureYa-styled color scheme that clearly separates genes into significance tiers.

library(ivolcano)

p <- ivolcano(deg_earlydn,
              logFC_col  = "log2FoldChange",
              pval_col   = "padj",
              gene_col   = "names",
              pval_cutoff  = 0.05,
              logFC_cutoff = 1,
              pval_cutoff2 = 0.01,
              logFC_cutoff2 = 2,
              size_by    = "manual",
              top_n      = 10,
              onclick_fun = onclick_genecards)
print(p)

The interactive plot supports hovering to view gene details (name, logFC, adjusted P-value) and clicking to redirect to external databases. For example, onclick_genecards opens the GeneCards page for any clicked gene. Other built-in redirects include onclick_ncbi, onclick_ensembl, onclick_uniprot, and onclick_pubmed.

To generate a static ggplot2 figure (e.g., for PDF output or manuscript submission), simply set interactive = FALSE:

p_static <- ivolcano(deg_earlydn,
                     logFC_col  = "log2FoldChange",
                     pval_col   = "padj",
                     gene_col   = "names",
                     pval_cutoff  = 0.05,
                     logFC_cutoff = 1,
                     pval_cutoff2 = 0.01,
                     logFC_cutoff2 = 2,
                     size_by    = "manual",
                     top_n      = 10,
                     interactive = FALSE)
print(p_static)


4.3 Intersection analysis

Integrating DEGs with disease targets helps prioritize genes that are both statistically significant in expression and biologically linked to the disease phenotype. Here, we combine DN targets from GeneCards and Open Targets Platform with DEGs for intersection analysis using TCMDATA.

4.3.1 Prepare gene sets

TCMDATA provides three built-in datasets for the DN case study: deg_earlydn (DEGs), dn_gcds (GeneCards targets), and dn_otp (Open Targets targets).

# select DEGs
degs <- deg_earlydn$names[deg_earlydn$g != "normal"]
cat("DEGs:", length(degs), "\n")
#> DEGs: 1331
# GeneCards disease targets
data(dn_gcds)
cat("GeneCards targets:", length(dn_gcds), "\n")
#> GeneCards targets: 4760
# Open Targets Platform targets
data(dn_otp)
cat("Open Targets targets:", length(dn_otp), "\n")
#> Open Targets targets: 4148

4.3.2 Venn diagram

getvenndata() constructs a logical membership matrix for the input vectors (it’s recommended when sets <= 4), and ggvenn_plot() renders the corresponding Venn diagram.

venn_df <- getvenndata(degs, dn_gcds, dn_otp,
                       set_names = c("DEGs", "GeneCards", "OpenTargets"))

venn_df |> head()
#>    Element DEGs GeneCards OpenTargets
#> 1  DDX11L1 TRUE     FALSE       FALSE
#> 2 MIR12136 TRUE     FALSE       FALSE
#> 3   FAM87B TRUE     FALSE       FALSE
#> 4     HES4 TRUE     FALSE       FALSE
#> 5     TP73 TRUE     FALSE       FALSE
#> 6   GPR153 TRUE     FALSE       FALSE

Then, the Venn diagram can be plotted with ggvenn_plot():

venn1 <- ggvenn_plot(venn_df)

venn2 <- ggvenn_plot(venn_df, set.color = c("#FF8748", "#5BAA56", "#B8BB5B"), stroke.color = "white")

aplot::plot_list(venn1, venn2, ncol = 1)

4.3.3 UpSet plot

When comparing 4 or more sets, an UpSet plot provides a clearer intersection overview than a Venn diagram. The function upset_plot() from aplotExtra package takes a named list of character vectors and renders an UpSet-style visualization:

library(aplotExtra)

gene_list <- list(
  DEGs       = degs,
  GeneCards  = dn_gcds,
  OpenTargets = dn_otp
)

upset_plot(gene_list, color.intersect.by = "Set2", color.set.by = "Dark2")

4.3.4 Extract intersection results

getvennresult() extracts all intersection subsets from the Venn membership matrix:

venn_res <- getvennresult(venn_df)
venn_res[, c("Set_Combination", "Gene_Count")]
#>              Set_Combination Gene_Count
#> 1 DEGs&GeneCards&OpenTargets        149
#> 2      GeneCards&OpenTargets       1900
#> 3           DEGs&OpenTargets         75
#> 4                OpenTargets       1820
#> 5             DEGs&GeneCards        135
#> 6                  GeneCards       2576
#> 7                       DEGs        972

Then, extract the gene lists for each intersection combination:

venn_res$Set_Combination
#> [1] "DEGs&GeneCards&OpenTargets" "GeneCards&OpenTargets"     
#> [3] "DEGs&OpenTargets"           "OpenTargets"               
#> [5] "DEGs&GeneCards"             "GeneCards"                 
#> [7] "DEGs"

For example, to extract the core targets shared by all three databases (the first row DEGs&GeneCards&OpenTargets):

core_genes <- strsplit(venn_res$Genes[1], ",\\s*")[[1]]
cat("Core targets:", length(core_genes), "\n")
#> Core targets: 149
head(core_genes, 20)
#>  [1] "ERRFI1"  "EPHA2"   "LIN28A"  "NR0B2"   "JUN"     "GADD45A" "CCN1"   
#>  [8] "GBP2"    "VCAM1"   "MCL1"    "S100A9"  "S100A8"  "S100A4"  "FCGR3B" 
#> [15] "RXRG"    "SELE"    "PTGS2"   "RGS1"    "BTG2"    "CD55"

The core intersection genes, those shared across multiple databases and DEGs, can then be carried forward to PPI network construction (Chapter 7) and enrichment analysis (Chapter 6).


4.4 References

  1. Stelzer G, Rosen R, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, Iny Stein T, Nudel R, Lieder I, Mazor Y, Kaplan S, Dahary D, Warshawsky D, Guan-Golan Y, Kohn A, Rappaport N, Safran M, and Lancet D. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Current Protocols in Bioinformatics (2016), 54:1.30.1–1.30.33. doi: 10.1002/cpbi.5.

  2. Buniello A, et al. Open Targets Platform: facilitating therapeutic hypotheses building in drug discovery. Nucleic Acids Research (2025). doi: 10.1093/nar/gkae1128.

  3. Davis AP, Wiegers TC, Sciaky D, Barkalow F, Strong M, Wyatt B, Wiegers J, McMorran R, Abrar S, Mattingly CJ. Comparative Toxicogenomics Database’s 20th anniversary: update 2025. Nucleic Acids Research (2024). doi: 10.1093/nar/gkae822.

  4. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology (2014), 15, 550. doi: 10.1186/s13059-014-0550-8.

  5. Fan Y, Yi Z, D’Agati VD, Sun Z, et al. Comparison of Kidney Transcriptomic Profiles of Early and Advanced Diabetic Nephropathy Reveals Potential New Mechanisms for Disease Progression. Diabetes (2019), 68:2301–2314. doi: 10.2337/db19-0204. PMID: 32086290.

  6. Yu G (2025). ivolcano: Interactive Volcano Plot. R package version 0.1.0. https://CRAN.R-project.org/package=ivolcano.

4.5 Session information

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] aplotExtra_0.0.4   ivolcano_0.0.5     enrichplot_1.30.5  TCMDATA_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.3.0               gson_0.1.0              gridExtra_2.3          
#>   [4] rematch2_2.1.2          rlang_1.1.7             magrittr_2.0.4         
#>   [7] DOSE_4.4.0              compiler_4.5.2          RSQLite_2.4.6          
#>  [10] png_0.1-8               systemfonts_1.3.2       vctrs_0.7.1            
#>  [13] reshape2_1.4.5          stringr_1.6.0           pkgconfig_2.0.3        
#>  [16] crayon_1.5.3            fastmap_1.2.0           XVector_0.50.0         
#>  [19] labeling_0.4.3          rmarkdown_2.30          purrr_1.2.1            
#>  [22] bit_4.6.0               xfun_0.56               cachem_1.1.0           
#>  [25] aplot_0.2.9             jsonlite_2.0.0          blob_1.3.0             
#>  [28] tidydr_0.0.6            tweenr_2.0.3            BiocParallel_1.44.0    
#>  [31] cluster_2.1.8.1         parallel_4.5.2          R6_2.6.1               
#>  [34] bslib_0.10.0            stringi_1.8.7           RColorBrewer_1.1-3     
#>  [37] DNAcopy_1.84.0          jquerylib_0.1.4         GOSemSim_2.36.0        
#>  [40] Rcpp_1.1.1              Seqinfo_1.0.0           bookdown_0.46          
#>  [43] knitr_1.51              ggtangle_0.1.1          R.utils_2.13.0         
#>  [46] IRanges_2.44.0          Matrix_1.7-4            splines_4.5.2          
#>  [49] igraph_2.2.2            tidyselect_1.2.1        qvalue_2.42.0          
#>  [52] rstudioapi_0.18.0       yaml_2.3.12             codetools_0.2-20       
#>  [55] lattice_0.22-7          tibble_3.3.1            plyr_1.8.9             
#>  [58] withr_3.0.2             Biobase_2.70.0          treeio_1.34.0          
#>  [61] KEGGREST_1.50.0         S7_0.2.1                evaluate_1.0.5         
#>  [64] survival_3.8-3          gridGraphics_0.5-1      polyclip_1.10-7        
#>  [67] scatterpie_0.2.6        Biostrings_2.78.0       pillar_1.11.1          
#>  [70] ggtree_4.0.4            stats4_4.5.2            clusterProfiler_4.18.4 
#>  [73] ggfun_0.2.0             generics_0.1.4          paletteer_1.7.0        
#>  [76] S4Vectors_0.48.0        ggplot2_4.0.2           scales_1.4.0           
#>  [79] tidytree_0.4.7          glue_1.8.0              gdtools_0.5.0          
#>  [82] lazyeval_0.2.2          tools_4.5.2             ggnewscale_0.5.2       
#>  [85] data.table_1.18.2.1     fgsea_1.36.2            ggvenn_0.1.19          
#>  [88] forcats_1.0.1           ggiraph_0.9.6           maftools_2.26.0        
#>  [91] fs_1.6.7                fastmatch_1.1-8         cowplot_1.2.0          
#>  [94] grid_4.5.2              tidyr_1.3.2             ape_5.8-1              
#>  [97] ggstar_1.0.6            AnnotationDbi_1.72.0    nlme_3.1-168           
#> [100] patchwork_1.3.2         ggforce_0.5.0           cli_3.6.5              
#> [103] rappdirs_0.3.4          fontBitstreamVera_0.1.1 dplyr_1.2.0            
#> [106] gtable_0.3.6            R.methodsS3_1.8.2       yulab.utils_0.2.4      
#> [109] sass_0.4.10             digest_0.6.39           fontquiver_0.2.1       
#> [112] BiocGenerics_0.56.0     ggrepel_0.9.7           ggplotify_0.1.3        
#> [115] htmlwidgets_1.6.4       farver_2.1.2            memoise_2.0.1          
#> [118] htmltools_0.5.9         R.oo_1.27.1             lifecycle_1.0.5        
#> [121] httr_1.4.8              GO.db_3.22.0            fontLiberation_0.1.0   
#> [124] bit64_4.6.0-1           MASS_7.3-65