Chapter 3 Molecule annotation

After identifying candidate herbs and their associated molecules through the approaches described in chapter 2, it is often necessary to annotate these compounds with chemical descriptors and identifiers from public databases. TCMDATA provides a suite of PubChem[1] utilities to facilitate compound annotation and similarity searching.


3.1 Resolving compound identifiers

The PubChem Compound Database assigns each chemical structure a unique identifier (CID). TCMDATA offers two functions to resolve molecule names or structural representations to CIDs.

3.1.1 Using resolve_cid()

resolve_cid() supports multiple identifier types and uses the PubChem PUG-REST API directly. It accepts CIDs, SMILES strings, InChI, InChIKey, or compound names.

library(TCMDATA)

# Get CIDs for Lingzhi (Ganoderma) molecules
herbs <- c("灵芝")
lz <- search_herb(herb = herbs, type = "Herb_cn_name")

# Sample a few molecules
lz_mol <- head(unique(lz$molecule), 5)

# Resolve molecule names to CIDs
lz_mol_cid <- resolve_cid(lz_mol, from = "name")
print(lz_mol_cid)
# [1] "3,4-Dihydroxybenzoic acid" "4-Hydroxybenzoic Acid"     "Adenine"                  
# [4] "Uracil"                    "Thymidine" 

💡 Note: Check your network connection if you encounter errors during CID resolution.

The from argument specifies the input identifier type. Supported values include "cid", "smiles", "inchi", "inchikey", and "name". For SMILES and InChI inputs, the function automatically handles URL encoding and uses POST requests to avoid length limitations.

3.1.2 Using getcid()

getcid() wraps the webchem::get_cid() function and adds built-in rate limiting (≤5 requests per second) to comply with PubChem’s usage policy. It is particularly useful when querying large batches of compound names.

# Alternative method using webchem backend
cid_results <- getcid(lz_mol, from = "name", match = "first")

3.2 Retrieving chemical properties

Once CIDs are obtained, getprops() retrieves a range of chemical descriptors from PubChem, including molecular formula, molecular weight, IUPAC name, canonical SMILES, InChIKey, and LogP values.

# Get properties for resolved CIDs
props <- getprops(lz_mol_cid, properties = c("MolecularFormula", 
                                             "MolecularWeight", 
                                             "CanonicalSMILES", 
                                             "InChIKey"))

head(props)
# A tibble: 5 × 6
#  cid   CID   MolecularFormula MolecularWeight ConnectivitySMILES               InChIKey        
#  <chr> <chr> <chr>            <chr>           <chr>                            <chr>           
# 1 72    72    C7H6O4           154.12          C1=CC(=C(C=C1C(=O)O)O)O          YQUVCSBJEUQKSH-…
# 2 135   135   C7H6O3           138.12          C1=CC(=CC=C1C(=O)O)O             FJKROLUGYXJWQN-…
# 3 190   190   C5H5N5           135.13          C1=NC2=NC=NC(=C2N1)N             GFFGJBXGBJISGV-…
# 4 1174  1174  C4H4N2O2         112.09          C1=CNC(=O)NC1=O                  ISAKRJDGNUQOIC-…
# 5 5789  5789  C10H14N2O5       242.23          CC1=CN(C(=O)NC1=O)C2CC(C(O2)CO)O IQFYYKKMVGJFEH-…

The function internally uses rate limiting and retry logic to handle transient API errors. By default, it retrieves six common properties: MolecularFormula, MolecularWeight, IUPACName, CanonicalSMILES, InChIKey, and XLogP. Additional properties supported by PubChem can be specified via the properties argument.


3.3 Similarity searching

Structural similarity searches are useful for identifying compounds with similar chemical scaffolds. compound_similarity() performs Tanimoto-based 2D similarity searches on PubChem using the Fast Similarity API.

# Search for compounds similar to Quercetin (CID 5280343)
sim_hits <- compound_similarity(
  query = "5280343",
  from = "cid",
  threshold = 90,
  topn = 10,
  compute_score = TRUE)

head(sim_hits)
# A tibble: 5 × 6
#  cid   CID   MolecularFormula MolecularWeight ConnectivitySMILES               InChIKey     
#  <chr> <chr> <chr>            <chr>           <chr>                            <chr>        
# 1 72    72    C7H6O4           154.12          C1=CC(=C(C=C1C(=O)O)O)O          YQUVCSBJEUQK…
# 2 135   135   C7H6O3           138.12          C1=CC(=CC=C1C(=O)O)O             FJKROLUGYXJW…
# 3 190   190   C5H5N5           135.13          C1=NC2=NC=NC(=C2N1)N             GFFGJBXGBJIS…
# 4 1174  1174  C4H4N2O2         112.09          C1=CNC(=O)NC1=O                  ISAKRJDGNUQO…
# 5 5789  5789  C10H14N2O5       242.23          CC1=CN(C(=O)NC1=O)C2CC(C(O2)CO)O IQFYYKKMVGJF…

The threshold parameter controls the minimum similarity percentage (0–100), and topn limits the number of returned hits. Setting compute_score = TRUE enables local Tanimoto score calculation using the rcdk package. This requires a working Java installation.

Alternative query types:

# Query by SMILES string
sim_by_smiles <- compound_similarity(
  query = "C1=CC(=C(C=C1O)O)C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O",
  from = "smiles",
  threshold = 85,
  topn = 5,
  compute_score = FALSE)
# Query by compound name
sim_by_name <- compound_similarity(
  query = "Quercetin",
  from = "name",
  threshold = 90,
  topn = 10,
  compute_score = FALSE)

The function returns a tibble with columns query_cid, query, hit_cid, hit_smiles, and score (if computed). This enables efficient filtering and prioritization of structurally related molecules for downstream pharmacological analysis.


3.4 Download molecular structures and format conversion

To support downstream docking (such as AutoDock Vina, usually require ligand and receptor structures) and structure-based analyses, TCMDATA also provides utilities for retrieving ligand and receptor structures and for converting molecular file formats. These functions are designed to streamline routine preprocessing steps and improve reproducibility in computational pharmacology workflows.

3.4.1 Downloading ligand structures from PubChem

download_ligand_structure() retrieves SDF files for a given PubChem CID. By default (type = "auto"), the function prioritizes 3D conformers and automatically falls back to 2D records when 3D coordinates are unavailable.

Here, we download the structure for CID 2244 (Quercetin) for demonstration. If a curated 3D structure exists in PubChem, it will be downloaded; otherwise, the function will retrieve the 2D structure.

# Download ligand structure by CID (auto: 3D if available, otherwise 2D)
lig_sdf <- download_ligand_structure(
  cid = 2244,
  type = "auto",
  destdir = "structures/ligands",
  overwrite = FALSE)

Then, you can go to structures/ligands to check the downloaded SDF file. This behavior is particularly useful in high-throughput settings, where compounds may differ in the availability of curated 3D structures.

3.4.2 Downloading receptor structures from the RCSB PDB

For receptor (protein), download_receptor_structure() downloads macromolecular structures from the RCSB Protein Data Bank using a 4-character PDB identifier. The function supports both legacy PDB format (format = "pdb") and PDBx/mmCIF format (format = "cif").

Here, we download the structure for PDB ID 4hhb (human hemoglobin) in PDB format.

# Download receptor structure by PDB ID
rec_pdb <- download_receptor_structure(
  pdb_id = "4hhb",
  format = "pdb",
  destdir = "structures/receptors",
  overwrite = FALSE)

3.4.3 Converting structure file formats

For molecular docking, different software may require specific structure formats (e.g., PDBQT for AutoDock Vina). It is neccessary to convert the downloaded structures into compatible formats.

Open Babel is a widely used open-source toolkit for chemical informatics that supports interconversion between diverse molecular file formats [2]. To use convert_structure(), users must first install Open Babel and ensure that the obabel executable is available on the system PATH.

On macOS, Open Babel can be installed with Homebrew (brew install open-babel). On Ubuntu/Debian systems, it can be installed with sudo apt install openbabel. For other platforms, please refer to the official installation guide.

After installation, verify that obabel is discoverable from the command line.

convert_structure() performs general-purpose format conversion through Open Babel (obabel) and supports common structure formats including SDF, PDB, MOL2, XYZ, and SMILES. Optional arguments allow hydrogen addition and 3D coordinate generation during conversion.

Here, we demonstrate converting a ligand SDF file to PDB format while adding hydrogens. The gen3d = FALSE argument preserves existing coordinates if available. Before running this step, ensure you have installed Open Babel and that the obabel command is accessible in your system’s PATH.

command -v obabel
# Example output on macOS:
/opt/homebrew/bin/obabel

Then, you can run the following code to convert the ligand structure. This step is essential for preparing ligand files for docking software that may require specific formats and explicit hydrogens.

# Convert ligand SDF to PDB and add hydrogens
lig_pdb <- convert_structure(
  input_file = lig_sdf,
  output_type = "pdb",
  add_hydrogens = TRUE,
  gen3d = FALSE,
  overwrite = TRUE)

3.5 References

  1. Sayers EW, Bolton EE, Fine AM, et al. (2026). “Database resources of the National Center for Biotechnology Information in 2026.” Nucleic Acids Research, 54(D1), D20–D27. doi: 10.1093/nar/gkaf1060.
  2. https://github.com/openbabel/openbabel

3.6 Session information

sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] enrichplot_1.30.5 TCMDATA_0.1.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.3.0               gson_0.1.0              rlang_1.2.0            
#>   [4] magrittr_2.0.5          DOSE_4.4.0              compiler_4.5.3         
#>   [7] RSQLite_2.4.6           png_0.1-9               systemfonts_1.3.2      
#>  [10] vctrs_0.7.3             reshape2_1.4.5          stringr_1.6.0          
#>  [13] pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
#>  [16] XVector_0.50.0          labeling_0.4.3          rmarkdown_2.31         
#>  [19] purrr_1.2.2             bit_4.6.0               xfun_0.57              
#>  [22] cachem_1.1.0            aplot_0.2.9             jsonlite_2.0.0         
#>  [25] blob_1.3.0              tidydr_0.0.6            tweenr_2.0.3           
#>  [28] BiocParallel_1.44.0     cluster_2.1.8.2         parallel_4.5.3         
#>  [31] R6_2.6.1                bslib_0.10.0            stringi_1.8.7          
#>  [34] RColorBrewer_1.1-3      jquerylib_0.1.4         GOSemSim_2.36.0        
#>  [37] Rcpp_1.1.1              Seqinfo_1.0.0           bookdown_0.46          
#>  [40] knitr_1.51              ggtangle_0.1.1          R.utils_2.13.0         
#>  [43] IRanges_2.44.0          Matrix_1.7-5            splines_4.5.3          
#>  [46] igraph_2.2.3            tidyselect_1.2.1        qvalue_2.42.0          
#>  [49] rstudioapi_0.18.0       yaml_2.3.12             codetools_0.2-20       
#>  [52] lattice_0.22-9          tibble_3.3.1            plyr_1.8.9             
#>  [55] withr_3.0.2             Biobase_2.70.0          treeio_1.34.0          
#>  [58] KEGGREST_1.50.0         S7_0.2.1                evaluate_1.0.5         
#>  [61] gridGraphics_0.5-1      polyclip_1.10-7         scatterpie_0.2.6       
#>  [64] Biostrings_2.78.0       pillar_1.11.1           ggtree_4.0.5           
#>  [67] stats4_4.5.3            clusterProfiler_4.18.4  ggfun_0.2.0            
#>  [70] generics_0.1.4          S4Vectors_0.48.1        ggplot2_4.0.2          
#>  [73] scales_1.4.0            tidytree_0.4.7          glue_1.8.0             
#>  [76] gdtools_0.5.0           lazyeval_0.2.3          tools_4.5.3            
#>  [79] ggnewscale_0.5.2        data.table_1.18.2.1     fgsea_1.36.2           
#>  [82] ggiraph_0.9.6           fs_2.0.1                fastmatch_1.1-8        
#>  [85] cowplot_1.2.0           grid_4.5.3              tidyr_1.3.2            
#>  [88] ape_5.8-1               AnnotationDbi_1.72.0    nlme_3.1-168           
#>  [91] patchwork_1.3.2         ggforce_0.5.0           cli_3.6.6              
#>  [94] rappdirs_0.3.4          fontBitstreamVera_0.1.1 dplyr_1.2.1            
#>  [97] gtable_0.3.6            R.methodsS3_1.8.2       yulab.utils_0.2.4      
#> [100] sass_0.4.10             digest_0.6.39           fontquiver_0.2.1       
#> [103] BiocGenerics_0.56.0     ggrepel_0.9.8           ggplotify_0.1.3        
#> [106] htmlwidgets_1.6.4       farver_2.1.2            memoise_2.0.1          
#> [109] htmltools_0.5.9         R.oo_1.27.1             lifecycle_1.0.5        
#> [112] httr_1.4.8              GO.db_3.22.0            fontLiberation_0.1.0   
#> [115] bit64_4.6.0-1           MASS_7.3-65