Chapter 4 Target retrieval and DEG analysis
In network pharmacology analysis, disease target retrieval and differential gene screening are two essential preparatory steps. This chapter introduces common and free disease target databases and their retrieval strategies, demonstrates volcano plot visualization of DESeq2 results, and performs intersection analysis between DEGs and disease targets using Venn and UpSet plots.
4.1 Disease target databases
Multiple public databases catalogue gene–disease associations with varying evidence levels. In practice, combining targets from multiple sources (often intersection) improves both coverage and reliability. Here, we introduce some commonly used and user-friendly databases for disease target retrieval in TCM network pharmacology research.
4.1.1 GeneCards
GeneCards is an integrative database that provides comprehensive information on human genes, including disease associations (Stelzer et al., 2016). Each gene–disease link is assigned a Relevance Score that summarizes evidence from multiple sources.
However, GeneCards does not offer a direct API for bulk retrieval, so the common workflow involves manual search and export.
For example, to retrieve targets for “Diabetic Nephropathy”, we can manually search GeneCards, export the results as a CSV file, and filter by Relevance Score.
Then, we can read the exported CSV and filter genes based on the Relevance Score:
## uploaded "GeneCards-SearchResults.csv" from GeneCards export
dn_raw <- read.csv("GeneCards-SearchResults.csv")
## subset genes with Relevance Score > 10
dn_targets <- dn_raw$Gene.Symbol[dn_raw$Relevance.Score > 10]
head(dn_targets)[1] "INS" "HNF1B" "ACE" "HNF1A" "GCK" "NLRP3"
Notably, TCMDATA provides a built-in GeneCards-derived target list for Diabetic Nephropathy for case study demonstration:
#> [1] "INS" "ACE" "GCK" "HNF1A" "HNF1B" "KCNJ11" "UMOD"
#> [8] "ABCC8" "IL6" "PPARG" "HNF4A" "WFS1" "COL4A1" "NLRP3"
#> [15] "TCF7L2" "PDX1" "INSR" "TNF" "NEUROD1" "TGFBR2"
4.1.2 Open Targets Platform
The Open Targets Platform is a comprehensive and robust resource designed for therapeutic target identification and prioritization (Buniello et al., 2025). It systematically aggregates, validates, and scores evidence linking targets to diseases across heterogeneous data sources. By organizing its knowledge base around five core entities—Target, Disease/Phenotype, Variant, Study, and Drug—the platform provides a highly structured framework that facilitates hypothesis generation and evidence-based target selection in drug discovery.
User can retrieve disease targets by both API and web interface. The API allows programmatic access to the data, while the web interface provides an intuitive way to explore target–disease associations.
For web interface, “Diabetic Nephropathy” can be searched directly, and the resulting target list can be exported as json and tsv format for downstream analysis:
symbol globalScore gwasCredibleSets geneBurden eva
1 ACE 0.7654601 No data No data 0.8841439697057125
2 AGTR1 0.5902786 No data No data No data
3 TCF7L2 0.5130827 0.8296366016990858 No data No data
4 INS 0.4939346 0.7894596444737526 No data No data
5 UMOD 0.4840121 0.7620249013066388 No data No data
6 COL4A3 0.4821762 0.7850753632678814 No data No data
For API access, please refer to the Open Targets API documentation for detailed instructions on how to query target–disease associations programmatically.
4.1.3 CTD (Comparative Toxicogenomics Database)
The Comparative Toxicogenomics Database (CTD) is a robust, publicly available resource that aims to advance understanding about how environmental exposures affect human health (Davis et al., 2024). It provides manually curated information about chemical–gene/protein interactions, chemical–disease, and gene–disease relationships. By integrating these data with functional and pathway annotations, CTD helps researchers develop hypotheses about the mechanisms underlying environmentally influenced diseases. It is particularly valuable for network pharmacology studies involving environmental toxins or pharmacological exposures.
CTD provides both a web interface and a RESTful API for data retrieval. The web interface allows users to perform batch queries for diseases, chemicals, or genes, and export the results manually.
For programmatic access, the CTD Batch Query API is highly efficient. You can construct a query URL specifying the input type (disease), the query term (Diabetic Nephropathy), the desired report type (genes_curated or genes_inferred), and the output format (csv or tsv).
The report parameter controls which association types are returned:
report value |
Description | DN example |
|---|---|---|
genes_curated |
Manually curated from literature (high confidence) | 46 genes |
genes_inferred |
Inferred via chemical–gene–disease links | ~26,000 genes |
genes |
All associations (curated + inferred) | ~26,300 genes |
For example, to retrieve all gene targets associated with “Diabetic Nephropathy” directly into R:
# Construct the CTD Batch Query API URL
# Change report to "genes_curated" for high-confidence curated associations only
options(timeout = 300)
ctd_url <- paste0(
"https://ctdbase.org/tools/batchQuery.go?",
"inputType=disease&",
"inputTerms=Diabetic%20Nephropathy&",
"report=genes&",
"format=tsv"
)
lines <- readLines(ctd_url)
lines[1] <- sub("^# ", "", lines[1])
ctd <- read.delim(textConnection(lines))
# Extract unique gene symbols
ctd_targets <- unique(ctd$GeneSymbol)
cat("Total gene targets:", length(ctd_targets), "\n")
head(ctd[, c("GeneSymbol", "GeneID", "DirectEvidence", "InferenceScore")])Total gene targets: 26330
GeneSymbol GeneID DirectEvidence InferenceScore
1 1700001K19RIKL 299330 3.99
2 1-SF 100049428 2.47
3 9530082P21RIKL 360487 3.94
4 9930111J21RIK2 245240 3.74
5 A 50518 marker/mechanism NA
6 A 50518 2.63
The InferenceScore column quantifies the strength of inferred associations — higher values indicate more atypical (and potentially more meaningful) connectivity in the chemical–gene–disease network. Rows with DirectEvidence filled and InferenceScore = NA are curated associations.
Further details about the CTD data retrieval and interpretation can be found in the CTD documentation.
4.1.4 Other databases
In addition to the databases detailed above, several other resources are frequently used in network pharmacology to ensure comprehensive target collection:
- DisGeNET: One of the largest publicly available collections of genes and variants associated with human diseases. It integrates data from expert-curated repositories, GWAS catalogues, animal models, and text-mining of the scientific literature.
- OMIM (Online Mendelian Inheritance in Man): A comprehensive, authoritative compendium of human genes and genetic phenotypes. It is highly reliable for identifying genes with strong, well-documented genetic links to specific diseases.
- TTD (Therapeutic Target Database): Focuses on known and explored therapeutic protein and nucleic acid targets, providing detailed information about the targeted diseases, pathway information, and corresponding drugs.
- DrugBank: While primarily a drug database, it provides extensive information on drug targets, making it useful for finding targets of existing drugs used to treat the disease of interest.
Researchers typically query multiple databases and take intersection of the results to form a robust disease target set.
4.2 DEG visualization
Differential expression analysis identifies genes that are significantly altered between disease and control conditions. In the pharmacology network analysis, these DEGs can be integrated with disease targets to prioritize genes that are both statistically significant in expression and biologically linked to the disease phenotype. The intersection of DEGs and disease targets often represents the most promising candidates for further network analysis and experimental validation.
TCMDATA includes a demo DESeq2 result[4] from GSE142025[5] (early DN vs. control) for illustration. Further details about the dataset can be found in the original GEO page.
4.2.1 Load and inspect data
#> 'data.frame': 27183 obs. of 8 variables:
#> $ baseMean : num 31.9 1128.6 61.8 11.7 260.2 ...
#> $ log2FoldChange: num 0.8175 -0.0325 -0.2766 0.0321 0.0982 ...
#> $ lfcSE : num 0.219 0.133 0.266 0.421 0.178 ...
#> $ stat : num 3.7246 -0.2439 -1.0382 0.0763 0.5528 ...
#> $ pvalue : num 0.000196 0.807271 0.299167 0.939176 0.580368 ...
#> $ padj : num 0.00409 0.91812 0.56789 0.97605 0.78759 ...
#> $ names : chr "DDX11L1" "WASH7P" "MIR6859-1" "FAM138A" ...
#> $ g : chr "up" "normal" "normal" "normal" ...
#>
#> down normal up
#> 678 25852 653
4.2.2 Volcano plot with ivolcano
ivolcano[6] is an R package that provides both static and interactive volcano plot visualizations for differential expression results. The interactive mode allows users to explore DEGs dynamically, which hovers to display gene details, click to redirect to external databases, and zoom into specific regions.
By specifying dual thresholds, ivolcano automatically applies a FigureYa-styled color scheme that clearly separates genes into significance tiers.