The goal of diffEnrich is to compare functional enrichment between two experimentally-derived groups of genes or proteins. Given a list of NCBI gene symbols, diffEnrich will perform differential enrichment analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) REST API. This package provides a number of functions that are intended to be used in a pipeline (See Figure 1). Briefly, the user provides a KEGG formatted species id for either human, mouse or rat, and the package will download and store species specific ENTREZ gene IDs and map them to their respective KEGG pathways by accessing the KEGG REST API. The KEGG API is used to guarantee the most up-to-date pathway data from KEGG. Next, the user will identify significantly enriched pathways in two different gene sets, and finally, the user will identify pathways that are differentially enriched between the two gene sets. In addition to the analysis pipeline, this package also provides a plotting function.
The KEGG REST API
KEGG is a database resource for understanding high-level functions of a biological system, such as a cell, an organism and an ecosystem, from genomic and molecular-level information https://www.kegg.jp/kegg/kegg1a.html. KEGG is an integrated database resource consisting of eighteen databases that are clustered into 4 main categories: 1) systems information (e.g. hierarchies and maps), 2) genomic information (e.g. genes and proteins), 3) chemical information (e.g. biochemical reactions), and 4) health information (e.g. human disease and drugs) https://www.kegg.jp/kegg/kegg1a.html.
In 2012 KEGG released its first application programming interface (API), and has been adding features and functionality ever since. There are benefits to using an API. First, API’s, like KEGG’s, allow users to perform customized analyses with the most up-to-date versions of the data contained in the database. In addition, accessing the KEGG API is very easy using statistical programming tools like R or Python and integrating data retrieval into user’s code makes the program reproducible. To further enforce reproducibilty diffEnrich adds a date and KEGG release tag to all data files it generates from accessing the API. For update histories and release notes for the KEGG REST API please go here.
Motivating experimental design for differential enrichment
Often high throughput omics studies include a functional enrichment analysis to glean biological insight from a list of candidate genes, proteins, metabolites, etc. Functional enrichment examines whether the number of genes in the list associated with a biological function or particular pathway is more than would be expected by chance. As an example, enrichment of a particular pathway among a list of genes that are differentially expressed after an experimental manipulation may indicate that the pathway has been altered by that manipulation. This analysis is rather straight forward and many solutions have been offered (e.g., [Haung et al., 2009]https://pubmed.ncbi.nlm.nih.gov/19033363/); Kuleshov et al., 2016; Liao et al., 2019; Subramanian et al., 2005). A wide variety of databases have also been used to define these pathways (e.g., Kanehisa and Goto, 2000) and ontologies (e.g., Ashburn et al., 2000).
One key component of a statistically rigorous functional enrichment analysis is the definition of a background data set that can be used to estimate the number of candidate genes that are ‘expected’ to be associated with the pathway by chance, e.g., if 5% of genes in the background data set are associated with a pathway then 5% of candidate gene are expected to be associated with the pathway by chance. For many study designs, the background data set is relatively simple to define (e.g., RNA-Seq analyses where the background data set includes genes expressed above background).
However, for some newer omics technologies, the background data set is hard to define. For example, LC-MS analysis can be used to identify carbonylated proteins ( Peterson et al., 2018; Shearn et al., 2019; Shearn et al., 2018). With this study design, carbonylated proteins are isolated using a BH-derivation and then LC-MS is used to identify peptides in this isolated sample. The most appropriate background data set would be proteins present in that tissue, but this would require a separate analytical analysis. Furthermore, most functional enrichment analyses involve a single gene list. However, in protein modification studies, the typical experimental design compares the presence or absence of particular modified proteins between multiple groups.
When there are two or more gene lists to compare and the background gene list is not clearly defined, as is often the case in protein modification experiments, we propose a differential enrichment analysis. In this analysis, we compare the proportion of genes/proteins from one gene list associated with a particular pathway to the proportion of genes/proteins from a second gene list that are associated with that pathway. To easily execute this analysis, we have designed an R package that uses the KEGG REST API to obtain the most recent version of the KEGG PATHWAY ( Kanehisa and Goto, 2000) database to initially identify functional enrichment within a gene list using the entire KEGG transcriptome as the background data set and then to identify differentially enriched pathways between two gene lists. This R package includes a function to generate a “differential enrichment” graphic.
You can install the released version of diffEnrich from CRAN with:
First we will use the get_kegg function to access the KEGG REST API and download the data sets required to perform our downstream analysis. This function takes two arguments. The first, ‘species’ is required. Currently, diffEnrich supports three species, and the argument is a character string using the KEGG code https://www.pnas.org/doi/10.1073/pnas.0806162105: Homo sapiens (human), use ‘hsa’; Mus musculus (mouse), use ‘mmu’; and Rattus norvegicus (rat), use ‘rno’. The second, ‘path’ is also passed as a character string, and is the path of the directory in which the user would like to write the data sets downloaded from the KEGG REST API. If the user does not provide a path, the data sets will be automatically written to the current working directory using the here::here() functionality. These data sets will be tab delimited files with a name describing the data, and for reproducibility, the date they were generated and the version of KEGG when the API was accessed. In addition to these flat files, get_kegg will also create a named list in R with the three relevant KEGG data sets. The names of this list will describe the data set. For a detailed description of list elements use ?get_kegg.
## run get_kegg() using rat
kegg_rno <- get_kegg('rno')
#> 3 data sets will be written as tab delimited text files
#> File location: /tmp/RtmpaBbzjQ/Rbuildcdb5510a130/diffEnrich
#> Kegg Release: Release_112.0+_11-12_Nov_24
Here are examples of the output files:
ncbi_to_kegg2019-10-17Release_92.0+_10-17_Oct_19.txt
kegg_to_pathway2019-10-17Release_92.0+_10-17_Oct_19.txt
pathway_to_species2019-10-17Release_92.0+_10-17_Oct_19.txt
Note: Because it is assumed that the user might want to use the data sets generated by get_kegg, it is careful not to overwrite data sets with exact names. get_kegg checks the path provided for data sets generated ‘same-day/same-version’, and if it finds even one of the three, it will not re-write any of the data sets. It will still however, let the user know it is not writing out new data sets and still generate the named list object. Users can generate ‘same-day/same-version’ data sets in different directories if they so choose.
## run get_kegg() using rat
kegg_rno <- get_kegg('rno')
#> These files already exist in your working directory. New files will not be generated.
#> Kegg Release: Release_112.0+_11-12_Nov_24
## run get_kegg() using rat
kegg_rno <- get_kegg(read = TRUE, path = "/path/to/files", date = "2019-10-17", release = "92")
Here is an example of the output:
Reading in the following files:
ncbi_to_kegg2019-10-17Release_92.0+_10-17_Oct_19.txt
kegg_to_pathway2019-10-17Release_92.0+_10-17_Oct_19.txt
pathway_to_species2019-10-17Release_92.0+_10-17_Oct_19.txt
File location: ~/Desktop
get_kegg.list.object | Object.description |
---|---|
ncbi_to_kegg | ncbi gene ID <– mapped to –> KEGG gene ID |
kegg_to_pathway | KEGG gene ID <– mapped to –> KEGG pathway ID |
pathway_to_species | KEGG pathway ID <– mapped to –> KEGG pathway species description |
In this step we will use the pathEnrich function to identify KEGG pathways that are enriched (i.e. over-represented) based on a gene list of interest. User gene lists must be character vectors and be formatted as ENTREZ gene IDs. The clusterProfiler package offers a nice function (bitr) that maps gene symbols and Ensembl IDs to ENTREZ gene IDs, and an example can be seen in their vignette.
## View sample gene lists from package data
head(geneLists$list1)
#> [1] "361692" "293654" "293655" "500974" "100361529" "171434"
head(geneLists$list2)
#> [1] "315547" "315548" "315549" "315550" "50938" "58856"
This function may not always use the complete list of genes provided by the user. Specifically, it will only use the genes from the list provided that are also in the most current species list pulled from the KEGG REST API using get_kegg, or from the older KEGG data loaded by the user from a previous get_kegg call. Users can also decide which KEGG pathways should be tested based on how many genes from their gene list are contained in the KEGG pathway. Users can set this parameter by changing the ‘N’ argument. The default is N = 2. The pathEnrich function should be run once for the genes of interest in list 1 and once for the genes of interest in list2. Each pathEnrich call generates a data frame summarizing the results of an enrichment analysis in which a Fisher’s Exact test is used to identify which KEGG pathways are enriched for the user’s list of genes compared to all genes annotated to a KEGG pathway. Users can choose a multiple correction option from those supported by stats::p.adjust. The default is the False Discovery Rate ( Benjamini and Hochberg, 1995), and the default threshold to reach significance is 0.05.
# run pathEnrich using kegg_rno
## List 1
list1_pe <- pathEnrich(gk_obj = kegg, gene_list = geneLists$list1, cutoff = 0.05, N = 2)
## list2
list2_pe <- pathEnrich(gk_obj = kegg, gene_list = geneLists$list2, cutoff = 0.05, N = 2)
KEGG_PATHWAY_ID | KEGG_PATHWAY_description | KEGG_PATHWAY_cnt | KEGG_PATHWAY_in_list | KEGG_DATABASE_cnt | KEG_DATABASE_in_list | expected | enrich_p | p_adj | fold_enrichment |
---|---|---|---|---|---|---|---|---|---|
rno04530 | Tight junction - Rattus norvegicus (rat) | 170 | 19 | 8856 | 295 | 5.662827 | 0.0000035 | 0.0005387 | 3.355214 |
rno05135 | Yersinia infection - Rattus norvegicus (rat) | 128 | 16 | 8856 | 295 | 4.263776 | 0.0000049 | 0.0005387 | 3.752542 |
rno05210 | Colorectal cancer - Rattus norvegicus (rat) | 88 | 12 | 8856 | 295 | 2.931346 | 0.0000318 | 0.0023211 | 4.093683 |
rno05231 | Choline metabolism in cancer - Rattus norvegicus (rat) | 99 | 12 | 8856 | 295 | 3.297764 | 0.0001032 | 0.0044701 | 3.638829 |
rno05213 | Endometrial cancer - Rattus norvegicus (rat) | 58 | 9 | 8856 | 295 | 1.932023 | 0.0001132 | 0.0044701 | 4.658328 |
rno04144 | Endocytosis - Rattus norvegicus (rat) | 275 | 22 | 8856 | 295 | 9.160456 | 0.0001225 | 0.0044701 | 2.401627 |
pathEnrich generates a list object that contains a data frame with 9 columns described below as well as other metadata from the function call. Details also provided in pathEnrich documentation. S3 generic functions for print and summary are provided. The print function prints the results table as a tibble, and the summary function returns the number of pathways that reached statistical significance as well as their descriptions, the number of genes used from the KEGG data base, the KEGG species, and the method used for multiple testing correction.
summary(list1_pe)
#> 219 KEGG pathways were tested.
#> Only KEGG pathways that contained at least 2 genes from gene_list were tested.
#> KEGG pathway species: Rattus norvegicus (rat)
#> 8856 genes from gene_list were in the KEGG data pull.
#> p-value adjustment method: BH
#> 36 pathways reached statistical significance after multiple testing correction at a cutoff of 0.05.
#>
#> Significant pathways:
#> Tight junction
#> Yersinia infection
#> Colorectal cancer
#> Choline metabolism in cancer
#> Endometrial cancer
#> Endocytosis
#> Neurotrophin signaling pathway
#> Thermogenesis
#> Oocyte meiosis
#> VEGF signaling pathway
#> Thyroid hormone signaling pathway
#> Hippo signaling pathway
#> T cell receptor signaling pathway
#> Apoptosis
#> Hepatocellular carcinoma
#> MAPK signaling pathway
#> Focal adhesion
#> Salmonella infection
#> Non-alcoholic fatty liver disease (NAFLD)
#> ErbB signaling pathway
#> Sphingolipid signaling pathway
#> Pancreatic cancer
#> Progesterone-mediated oocyte maturation
#> Alzheimer disease
#> Endocrine resistance
#> Adrenergic signaling in cardiomyocytes
#> IL-17 signaling pathway
#> Chronic myeloid leukemia
#> Dopaminergic synapse
#> Prostate cancer
#> EGFR tyrosine kinase inhibitor resistance
#> Hepatitis C
#> Ras signaling pathway
#> Acute myeloid leukemia
#> Insulin signaling pathway
#> Fc epsilon RI signaling pathway
Column Names | Column Description |
---|---|
KEGG_PATHWAY_ID | KEGG Pathway Identifier |
KEGG_PATHWAY_description | Description of KEGG Pathway (provided by KEGG) |
KEGG_PATHWAY_cnt | Number of Genes in KEGG Pathway |
KEGG_PATHWAY_in_list | Number of Genes from gene list in KEGG Pathway |
KEGG_DATABASE_cnt | Number of Genes in KEGG Database |
KEG_DATABASE_in_list | Number of Genes from gene list in KEGG Database |
expected | Expected number of genes from list to be in KEGG pathway by chance |
enrich_p | P-value for enrichment within the KEGG pathway for list genes |
p_adj | Multiple testing adjusted enrichment p-values (default = False Discovery Rate (Benjamini and Hochberg, 1995)) |
fold_enrichment | Ratio of number of genes observed from the gene list annotated to the KEGG pathway to the number of genes expected from the gene list to be annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list/expected) |
The diffEnrich function will merge two results from the pathEnrich calls generated above. Specifically, the data frame ‘list1_pe’ and the data frame ‘list2_pe’ will be merged by the following columns: “KEGG_PATHWAY_ID”, “KEGG_PATHWAY_description”, “KEGG_PATHWAY_cnt”, “KEGG_DATABASE_cnt”. This merged data set will then be used to perform differential enrichment using the same method and p-value calculation as described above. Users do have the option of choosing a method for multiple testing adjustment. Users can choose from those supported by stats::p.adjust. The default is the False Discovery Rate ( Benjamini and Hochberg, 1995). KEGG pathways that do not contain any genes from either gene list (i.e., list1_peenrichtableKEGG_PATHWAY_in_list for ‘rno04530’ = 0 AND list2_peenrichtableKEGG_PATHWAY_in_list for ‘rno04530’ = 0) will be removed as these cannot be tested. If this is the case a warning will be printed that tells the user how many pathways were removed. This can be avoided by setting the ‘N’ parameter to a value > 0 in the pathEnrich calls.
## Perform differential enrichment
diff_enrich <- diffEnrich(list1_pe = list1_pe, list2_pe = list2_pe, method = 'none', cutoff = 0.05)
KEGG_PATHWAY_ID | KEGG_PATHWAY_description | KEGG_PATHWAY_cnt | KEGG_DATABASE_cnt | KEGG_PATHWAY_in_list1 | KEGG_DATABASE_in_list1 | expected_list1 | enrich_p_list1 | p_adj_list1 | fold_enrichment_list1 | KEGG_PATHWAY_in_list2 | KEGG_DATABASE_in_list2 | expected_list2 | enrich_p_list2 | p_adj_list2 | fold_enrichment_list2 | odd_ratio | diff_enrich_p | diff_enrich_adjusted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rno04530 | Tight junction - Rattus norvegicus (rat) | 170 | 8856 | 19 | 295 | 5.662827 | 0.0000035 | 0.0005387 | 3.355214 | 131 | 5308 | 101.89250 | 0.0000015 | 0.0000056 | 1.285669 | 0.3676651 | 0.0002936 | 0.0002936 |
rno05135 | Yersinia infection - Rattus norvegicus (rat) | 128 | 8856 | 16 | 295 | 4.263776 | 0.0000049 | 0.0005387 | 3.752542 | 105 | 5308 | 76.71906 | 0.0000001 | 0.0000003 | 1.368630 | 0.3520039 | 0.0005435 | 0.0005435 |
rno05210 | Colorectal cancer - Rattus norvegicus (rat) | 88 | 8856 | 12 | 295 | 2.931346 | 0.0000318 | 0.0023211 | 4.093683 | 81 | 5308 | 52.74435 | 0.0000000 | 0.0000000 | 1.535709 | 0.3655602 | 0.0032572 | 0.0032572 |
rno05213 | Endometrial cancer - Rattus norvegicus (rat) | 58 | 8856 | 9 | 295 | 1.932023 | 0.0001132 | 0.0044701 | 4.658328 | 55 | 5308 | 34.76332 | 0.0000000 | 0.0000000 | 1.582127 | 0.3328275 | 0.0058775 | 0.0058775 |
rno04660 | T cell receptor signaling pathway - Rattus norvegicus (rat) | 106 | 8856 | 11 | 295 | 3.530940 | 0.0007743 | 0.0129121 | 3.115318 | 79 | 5308 | 63.53297 | 0.0011075 | 0.0022562 | 1.243449 | 0.3901694 | 0.0072048 | 0.0072048 |
rno04657 | IL-17 signaling pathway - Rattus norvegicus (rat) | 95 | 8856 | 9 | 295 | 3.164521 | 0.0042709 | 0.0346419 | 2.844032 | 59 | 5308 | 56.93993 | 0.3737223 | 0.4591045 | 1.036180 | 0.3572935 | 0.0087538 | 0.0087538 |
Column Names | Column Description |
---|---|
KEGG_PATHWAY_ID | KEGG Pathway Identifier |
KEGG_PATHWAY_description | Description of KEGG Pathway (provided by KEGG) |
KEGG_PATHWAY_cnt | Number of Genes in KEGG Pathway |
KEGG_DATABASE_cnt | Number of Genes in KEGG Database |
KEGG_PATHWAY_in_list1 | Number of Genes from gene list 1 in KEGG Pathway |
KEGG_DATABASE_in_list1 | Number of Genes from gene list 1 in KEGG Database |
expected_list1 | Expected number of genes from list 1 to be in KEGG pathway by chance |
enrich_p_list1 | P-value for enrichment of list 1 genes related to KEGG pathway |
p_adj_list1 | Multiple testing adjusted enrichment p-values from gene list 1 (default = False Discovery Rate (Benjamini and Hochberg, 1995)) |
fold_enrichment_list1 | Ratio of number of genes observed from gene list 1 annotated to the KEGG pathway to the number of genes expected from gene list 1 annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list1/expected_list1) |
KEGG_PATHWAY_in_list2 | Number of Genes from gene list 2 in KEGG Pathway |
KEGG_DATABASE_in_list2 | Number of Genes from gene list 2 in KEGG Database |
expected_list2 | Expected number of genes from list 2 to be in KEGG pathway by chance |
enrich_p_list2 | P-value for enrichment of list 2 genes related to KEGG pathway |
p_adj_list2 | Multiple testing adjusted enrichment p-values from gene list 2 (default = False Discovery Rate (Benjamini and Hochberg, 1995)) |
fold_enrichment_list2 | Ratio of number of genes observed from gene list 2 annotated to the KEGG pathway to the number of genes expected from gene list 2 annotated to the KEGG pathway if there was no enrichment (i.e. KEGG_PATHWAY_in_list2/expected_list2) |
odd_ratio | Odds of a gene from list 2 being from this KEGG pathway / Odds of a gene from list 1 being from this KEGG pathway |
diff_enrich_p | P-value for differential enrichment of this KEGG pathway between list 1 and list 2 |
diff_enrich_adjusted | Multiple testing adjusted differential enrichment p-values (default = False Discovery Rate (Benjamini and Hochberg, 1995)) |
The result of the diffEnrich call is a list object that contains a data frame with the estimated odds ratio generated by the Fisher’s Exact test and the associated p-value as well as other metadata from the function call. S3 generic functions for print and summary are provided. The print function prints the results table as a tibble, and the summary function returns the number of pathways that reached statistical significance as well as their descriptions, the number of genes used from the KEGG database, the KEGG species, the number of pathways that were shared (i.e. had at least 1 gene from each gene list present in the pathway based on what the user chose for N in pathEnrich) between the gene lists and the method used for multiple testing correction.
summary(diff_enrich)
#> 219 KEGG pathways were shared between gene lists and were tested.
#> KEGG pathway species: Rattus norvegicus (rat)
#> 8856 genes from gene_list were in the KEGG data pull.
#> p-value adjustment method: none
#> 19 pathways reached statistical significance after multiple testing correction at a cutoff of 0.05.
#>
#> Significant pathways:
#> Tight junction
#> Yersinia infection
#> Colorectal cancer
#> Endometrial cancer
#> T cell receptor signaling pathway
#> IL-17 signaling pathway
#> Salmonella infection
#> Choline metabolism in cancer
#> Endocytosis
#> VEGF signaling pathway
#> Oocyte meiosis
#> Thermogenesis
#> Hippo signaling pathway
#> Neurotrophin signaling pathway
#> Apoptosis
#> Hepatocellular carcinoma
#> Fc epsilon RI signaling pathway
#> Thyroid hormone signaling pathway
#> Alzheimer disease
plotFoldEnrichment generates a grouped bar plot using ggplot2 and the ggnewscale package. There are 3 arguments: 1) de_res is the dataframe generated from the diffEnrich function, 2) pval is the threshold for the adjusted p-value associated with differential enrichment that will filter which KEGG pathways to plot, and 3) after filtering based on pval N tells the function how many pathways to plot. It is important to make a note that the significance of the fold change is associated with the number of genes in the gene list. Notice that in this example the pathways in gene list 2 have smaller fold changes (shorter bars) than those in list 1, but that many of them are more significant (darker blue). This is because there are more genes in gene list 2 compared to gene list 1.
Ashburner et al. (2000) Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25-29.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B,.57, 289–300
Huang DW. et al. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1):1-13.
Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30.
Kuleshov MV. et al. (2016) Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44(1): 90 – 97.
Liao, Y. et al. (2019) Gene set analysis toolkit with revamped UIs and APIs, Nucleic Acids Res. 47(1):199 – 205.
Petersen DR. et al. (2018) Elevated Nrf-2 responses are insufficient to mitigate protein carbonylation in hepatospecific PTEN deletion mice. PLoS One. 13(5):e0198139.
Shearn CT. et al. (2019) Cholestatic liver disease results increased production of reactive aldehydes and an atypical periportal hepatic antioxidant response. Free Radic Biol Med;143:101-114. [Epub ahead of print] PubMed PMID: 31377417.
Shearn CT. et al. (2018) Knockout of the Gsta4 Gene in Male Mice Leads to an Altered Pattern of Hepatic Protein Carbonylation and Enhanced Inflammation Following Chronic Consumption of an Ethanol Diet. Alcohol Clin Exp Res. 42(7):1192-1205.
Subramanian T. et al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102, 15545-15550.
Yu G. et al. (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 16(5), 284-287.