Chapter 3 Molecule annotation

After identifying candidate herbs and their associated molecules through the approaches described in chapter 2, it is often necessary to annotate these compounds with chemical descriptors and identifiers from public databases. TCMDATA provides a suite of PubChem[1] utilities to facilitate compound annotation and similarity searching.


3.1 Resolving compound identifiers

The PubChem Compound Database assigns each chemical structure a unique identifier (CID). TCMDATA offers two functions to resolve molecule names or structural representations to CIDs.

3.1.1 Using resolve_cid()

resolve_cid() supports multiple identifier types and uses the PubChem PUG-REST API directly. It accepts CIDs, SMILES strings, InChI, InChIKey, or compound names.

library(TCMDATA)

# Get CIDs for Lingzhi (Ganoderma) molecules
herbs <- c("灵芝")
lz <- search_herb(herb = herbs, type = "Herb_cn_name")

# Sample a few molecules
lz_mol <- head(unique(lz$molecule), 5)

# Resolve molecule names to CIDs
lz_mol_cid <- resolve_cid(lz_mol, from = "name")
print(lz_mol_cid)
# [1] "3,4-Dihydroxybenzoic acid" "4-Hydroxybenzoic Acid"     "Adenine"                  
# [4] "Uracil"                    "Thymidine" 

💡 Note: Check your network connection if you encounter errors during CID resolution.

The from argument specifies the input identifier type. Supported values include "cid", "smiles", "inchi", "inchikey", and "name". For SMILES and InChI inputs, the function automatically handles URL encoding and uses POST requests to avoid length limitations.

3.1.2 Using getcid()

getcid() wraps the webchem::get_cid() function and adds built-in rate limiting (≤5 requests per second) to comply with PubChem’s usage policy. It is particularly useful when querying large batches of compound names.

# Alternative method using webchem backend
cid_results <- getcid(lz_mol, from = "name", match = "first")
head(cid_results)

3.2 Retrieving chemical properties

Once CIDs are obtained, getprops() retrieves a range of chemical descriptors from PubChem, including molecular formula, molecular weight, IUPAC name, canonical SMILES, InChIKey, and LogP values.

# Get properties for resolved CIDs
props <- getprops(lz_mol_cid, 
                  properties = c("MolecularFormula", 
                                "MolecularWeight", 
                                "CanonicalSMILES", 
                                "InChIKey"))
head(props)
# A tibble: 5 × 6
#  cid   CID   MolecularFormula MolecularWeight ConnectivitySMILES               InChIKey     
#  <chr> <chr> <chr>            <chr>           <chr>                            <chr>        
# 1 72    72    C7H6O4           154.12          C1=CC(=C(C=C1C(=O)O)O)O          YQUVCSBJEUQK…
# 2 135   135   C7H6O3           138.12          C1=CC(=CC=C1C(=O)O)O             FJKROLUGYXJW…
# 3 190   190   C5H5N5           135.13          C1=NC2=NC=NC(=C2N1)N             GFFGJBXGBJIS…
# 4 1174  1174  C4H4N2O2         112.09          C1=CNC(=O)NC1=O                  ISAKRJDGNUQO…
# 5 5789  5789  C10H14N2O5       242.23          CC1=CN(C(=O)NC1=O)C2CC(C(O2)CO)O IQFYYKKMVGJF…

The function internally uses rate limiting and retry logic to handle transient API errors. By default, it retrieves six common properties: MolecularFormula, MolecularWeight, IUPACName, CanonicalSMILES, InChIKey, and XLogP. Additional properties supported by PubChem can be specified via the properties argument.


3.3 Similarity searching

Structural similarity searches are useful for identifying compounds with similar chemical scaffolds. compound_similarity() performs Tanimoto-based 2D similarity searches on PubChem using the Fast Similarity API.

# Search for compounds similar to Quercetin (CID 5280343)
sim_hits <- compound_similarity(
  query = "5280343",
  from = "cid",
  threshold = 90,
  topn = 10,
  compute_score = TRUE
)

head(sim_hits)
# A tibble: 5 × 6
#  cid   CID   MolecularFormula MolecularWeight ConnectivitySMILES               InChIKey     
#  <chr> <chr> <chr>            <chr>           <chr>                            <chr>        
# 1 72    72    C7H6O4           154.12          C1=CC(=C(C=C1C(=O)O)O)O          YQUVCSBJEUQK…
# 2 135   135   C7H6O3           138.12          C1=CC(=CC=C1C(=O)O)O             FJKROLUGYXJW…
# 3 190   190   C5H5N5           135.13          C1=NC2=NC=NC(=C2N1)N             GFFGJBXGBJIS…
# 4 1174  1174  C4H4N2O2         112.09          C1=CNC(=O)NC1=O                  ISAKRJDGNUQO…
# 5 5789  5789  C10H14N2O5       242.23          CC1=CN(C(=O)NC1=O)C2CC(C(O2)CO)O IQFYYKKMVGJF…

The threshold parameter controls the minimum similarity percentage (0–100), and topn limits the number of returned hits. Setting compute_score = TRUE enables local Tanimoto score calculation using the rcdk package. This requires a working Java installation.

Alternative query types:

# Query by SMILES string
sim_by_smiles <- compound_similarity(
  query = "C1=CC(=C(C=C1O)O)C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O",
  from = "smiles",
  threshold = 85,
  topn = 5,
  compute_score = FALSE
)
# Query by compound name
sim_by_name <- compound_similarity(
  query = "Quercetin",
  from = "name",
  threshold = 90,
  topn = 10,
  compute_score = FALSE
)

The function returns a tibble with columns query_cid, query, hit_cid, hit_smiles, and score (if computed). This enables efficient filtering and prioritization of structurally related molecules for downstream pharmacological analysis.


3.4 References

  1. Sayers EW, Bolton EE, Fine AM, et al. (2026). “Database resources of the National Center for Biotechnology Information in 2026.” Nucleic Acids Research, 54(D1), D20–D27. doi: 10.1093/nar/gkaf1060.

3.5 Session information

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] enrichplot_1.30.5  TCMDATA_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.3.0               gson_0.1.0              rematch2_2.1.2         
#>   [4] rlang_1.1.7             magrittr_2.0.4          DOSE_4.4.0             
#>   [7] compiler_4.5.2          RSQLite_2.4.6           png_0.1-8              
#>  [10] systemfonts_1.3.2       vctrs_0.7.1             reshape2_1.4.5         
#>  [13] stringr_1.6.0           pkgconfig_2.0.3         crayon_1.5.3           
#>  [16] fastmap_1.2.0           XVector_0.50.0          labeling_0.4.3         
#>  [19] rmarkdown_2.30          purrr_1.2.1             bit_4.6.0              
#>  [22] xfun_0.56               cachem_1.1.0            aplot_0.2.9            
#>  [25] jsonlite_2.0.0          blob_1.3.0              tidydr_0.0.6           
#>  [28] tweenr_2.0.3            BiocParallel_1.44.0     cluster_2.1.8.1        
#>  [31] parallel_4.5.2          R6_2.6.1                bslib_0.10.0           
#>  [34] stringi_1.8.7           RColorBrewer_1.1-3      jquerylib_0.1.4        
#>  [37] GOSemSim_2.36.0         Rcpp_1.1.1              Seqinfo_1.0.0          
#>  [40] bookdown_0.46           knitr_1.51              ggtangle_0.1.1         
#>  [43] R.utils_2.13.0          IRanges_2.44.0          Matrix_1.7-4           
#>  [46] splines_4.5.2           igraph_2.2.2            tidyselect_1.2.1       
#>  [49] qvalue_2.42.0           rstudioapi_0.18.0       yaml_2.3.12            
#>  [52] codetools_0.2-20        lattice_0.22-7          tibble_3.3.1           
#>  [55] plyr_1.8.9              withr_3.0.2             Biobase_2.70.0         
#>  [58] treeio_1.34.0           KEGGREST_1.50.0         S7_0.2.1               
#>  [61] evaluate_1.0.5          gridGraphics_0.5-1      polyclip_1.10-7        
#>  [64] scatterpie_0.2.6        Biostrings_2.78.0       pillar_1.11.1          
#>  [67] ggtree_4.0.4            stats4_4.5.2            clusterProfiler_4.18.4 
#>  [70] ggfun_0.2.0             generics_0.1.4          paletteer_1.7.0        
#>  [73] S4Vectors_0.48.0        ggplot2_4.0.2           scales_1.4.0           
#>  [76] tidytree_0.4.7          glue_1.8.0              gdtools_0.5.0          
#>  [79] lazyeval_0.2.2          tools_4.5.2             ggnewscale_0.5.2       
#>  [82] data.table_1.18.2.1     fgsea_1.36.2            forcats_1.0.1          
#>  [85] ggiraph_0.9.6           fs_1.6.7                fastmatch_1.1-8        
#>  [88] cowplot_1.2.0           grid_4.5.2              tidyr_1.3.2            
#>  [91] ape_5.8-1               AnnotationDbi_1.72.0    nlme_3.1-168           
#>  [94] patchwork_1.3.2         ggforce_0.5.0           cli_3.6.5              
#>  [97] rappdirs_0.3.4          fontBitstreamVera_0.1.1 dplyr_1.2.0            
#> [100] gtable_0.3.6            R.methodsS3_1.8.2       yulab.utils_0.2.4      
#> [103] sass_0.4.10             digest_0.6.39           fontquiver_0.2.1       
#> [106] BiocGenerics_0.56.0     ggrepel_0.9.7           ggplotify_0.1.3        
#> [109] htmlwidgets_1.6.4       farver_2.1.2            memoise_2.0.1          
#> [112] htmltools_0.5.9         R.oo_1.27.1             lifecycle_1.0.5        
#> [115] httr_1.4.8              GO.db_3.22.0            fontLiberation_0.1.0   
#> [118] bit64_4.6.0-1           MASS_7.3-65