This notebook will demonstrate how to:
AnnotationDBI
annotation packagesmsigdbr
packageclusterProfiler
packageIn this notebook, we’ll cover a type of pathway or gene set analysis called over-representation analysis (ORA). The idea behind ORA is relatively straightforward: given a set of genes, do these genes overlap with a pathway more than we expect by chance? The simplicity of only requiring an input gene set (sort of, more on that below) can be attractive.
ORA has some limitations, outlined nicely (and more extensively!) in Khatri et al. (2012). One of the main issues with ORA is that typically all genes are treated as equal – the context of the magnitude of a change we may be measuring is removed and each gene is treated as independent, which can sometimes result in an incorrect estimate of significance.
We will use the clusterProfiler
package (Yu et
al. 2012) to perform ORA. clusterProfiler
has many
built-in functions that will run a specific type of analysis using a
specific source of pathways/gene sets automatically, but for our
purposes we’re going to keep things as general as possible. See the clusterProfiler
book for more information about the package’s full suite of
functionality.
Because different bioinformatics tools often require different types
of gene identifiers, we’ll also cover how to convert between gene
identifiers using AnnotationDbi
Bioconductor packages in this notebook. Check out the AnnotationDbi:
Introduction To Bioconductor Annotation Packages (Carlson 2020.)
vignette for more information.
clusterProfiler
, see Intro
to DGE: Functional Analysis. from Harvard Chan Bioinformatics Core
Training.WebGestaltR
is another R package that can be used for ORA.# Package we'll use for ORA
library(clusterProfiler)
clusterProfiler v4.12.0 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
Attaching package: 'clusterProfiler'
The following object is masked from 'package:stats':
filter
# Package that contains MSigDB gene sets in tidy format
library(msigdbr)
# Mus musculus annotation package we'll use for gene identifier conversion
library(org.Mm.eg.db)
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: IRanges
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:clusterProfiler':
rename
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Attaching package: 'IRanges'
The following object is masked from 'package:clusterProfiler':
slice
Attaching package: 'AnnotationDbi'
The following object is masked from 'package:clusterProfiler':
select
# We'll create a directory to specifically hold the ORA results if it doesn't
# exist yet
results_dir <- file.path("results", "leukemia")
if (!dir.exists(results_dir)) {
dir.create(results_dir, recursive = TRUE)
}
For our ORA example, we’re going to use two tables of differential gene expression (DGE) analysis results.
input_dir <- file.path("data", "leukemia")
# This file contains the DGE results for a cell population with high stem cell
# capacity as compared to a cell population
vs_low_file <- file.path(input_dir,
"high_capacity_vs_low_capacity_results.tsv")
# This file contains the DGE results for the high stem cell capacity population
# vs. unsorted cells (e.g., a mixture of capacities)
vs_unsorted_file <- file.path(input_dir,
"high_capacity_vs_unsorted_results.tsv")
We’ll save the table of ORA results (e.g., p-values).
kegg_results_file <- file.path(results_dir, "leukemia_kegg_ora_results.tsv")
We will use gene sets from the Molecular
Signatures Database (MSigDB) from the Broad Institute (Subramanian, Tamayo
et al. 2005). The msigdbr
package contains MSigDB datasets already in the tidy format required by
clusterProfiler
and supports multiple organisms.
Let’s take a look at what organisms the package supports.
msigdbr_species()
The results we’re interested in here come from mouse samples, so we
can obtain just the gene sets relevant to M. musculus with the
species
argument to msigdbr()
.
mm_msigdb_df <- msigdbr(species = "Mus musculus")
MSigDB contains 8 different gene set collections.
H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures
In this example, we will use canonical pathways which are (ref):
Gene sets from pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts.
And are a subset of C2: curated gene sets
. Specifically,
we will use the KEGG (Kyoto
Encyclopedia of Genes and Genomes) pathways.
# Filter the mouse data frame to the KEGG pathways that are included in the
# curated gene sets
mm_kegg_df <- mm_msigdb_df |>
dplyr::filter(gs_cat == "C2", # curated gene sets
gs_subcat == "CP:KEGG") # KEGG pathways
Note: We could specified that we wanted the KEGG gene sets using
the category
and subcategory
arguments of
msigdbr()
, but we’re going for general steps!
colnames(mm_kegg_df)
[1] "gs_cat" "gs_subcat" "gs_name"
[4] "gene_symbol" "entrez_gene" "ensembl_gene"
[7] "human_gene_symbol" "human_entrez_gene" "human_ensembl_gene"
[10] "gs_id" "gs_pmid" "gs_geoid"
[13] "gs_exact_source" "gs_url" "gs_description"
[16] "taxon_id" "ortholog_sources" "num_ortholog_sources"
gs_cat
gs_subcat
gs_name
gene_symbol
entrez_gene
ensembl_gene
human_gene_symbol
human_entrez_gene
human_ensembl_gene
gs_id
gs_pmid
gs_geoid
gs_exact_source
gs_url
gs_description
taxon_id
ortholog_sources
num_ortholog_sources
The clusterProfiler
function we will use requires a data
frame with two columns, where one column contains the term identifier or
name and one column contains gene identifiers that match our gene lists
we want to check for enrichment. Our data frame with KEGG terms contains
Entrez IDs and gene symbols.
vs_low_df <- readr::read_tsv(vs_low_file)
Rows: 35429 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take a peek at the top of the DGE results data frame.
head(vs_low_df)
Our data frame of DGE results contains Ensembl gene identifiers. So
we will need to convert from these identifiers into either the gene
symbols or Entrez IDs that are present in the data we extracted with
msigdbr()
.
We’re going to convert our identifiers to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Mm.eg.db
contains information
for different identifiers. org.Mm.eg.db
is specific to
Mus musculus – this is what the Mm
in the package
name is referencing. To perform gene identifier conversion in human
(Homo sapiens) we could use org.Hs.eg.db
; we would
use org.Dr.eg.db
for zebrafish (Danio rerio).
We can see what types of IDs are available to us in an annotation
package with keytypes()
.
keytypes(org.Mm.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
[21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
ACCNUM
ALIAS
ENSEMBL
ENSEMBLPROT
ENSEMBLTRANS
ENTREZID
ENZYME
EVIDENCE
EVIDENCEALL
GENENAME
GENETYPE
GO
GOALL
IPI
MGI
ONTOLOGY
ONTOLOGYALL
PATH
PFAM
PMID
PROSITE
REFSEQ
SYMBOL
UNIPROT
Even though we’ll use this package to convert from Ensembl gene IDs
(ENSEMBL
) to gene symbols (SYMBOL
), we could
just as easily use it to convert from an Ensembl transcript ID
(ENSEMBLTRANS
) to Entrez IDs (ENTREZID
).
The function we will use to map from Ensembl gene IDs to gene symbols
is called mapIds()
.
# This returns a named vector which we can convert to a data frame, where
# the keys (Ensembl IDs) are the names
symbols_vector <- mapIds(org.Mm.eg.db, # Specify the annotation package
# The vector of gene identifiers we want to
# map
keys = vs_low_df$Gene,
# What type of gene identifiers we're starting
# with
keytype = "ENSEMBL",
# The type of gene identifier we want returned
column = "SYMBOL",
# In the case of 1:many mappings, return the
# first one. This is default behavior!
multiVals = "first")
'select()' returned 1:many mapping between keys and columns
# We would like a data frame we can join to the DGE results
symbols_df <- data.frame(
ensembl_id = names(symbols_vector),
gene_symbol = symbols_vector
)
This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs.
Now we are ready to add the gene symbols to our data frame with the
DGE results. We can use a join function from the
dplyr
package to do this, which will use the Ensembl gene
IDs in both data frames to determine how to join the rows.
Let’s do this first for the comparison to the low stem cell capacity population.
vs_low_df <- symbols_df |>
# An *inner* join will only return rows that are in both data frames
dplyr::inner_join(vs_low_df,
# The name of the column that contains the Ensembl gene IDs
# in the left data frame and right data frame
by = c("ensembl_id" = "Gene"))
NA
valuesSome of these rows have NA
values in padj
,
which can
happen for a number of reasons including when all samples have zero
counts or a gene has low mean expression.
Let’s filter to rows that do not have any NA
using a function tidyr::drop_na()
. This will also drop
genes that have an Ensembl gene identifier but no gene symbol!
# Remove rows that are not complete (e.g., contain NAs) by filtering to only
# complete rows
vs_low_df <- vs_low_df |>
tidyr::drop_na()
Now we’ll read in our data frame of DGE results from another
comparison. To save us some time during instruction, we’ve
already done the gene identifier conversion and filtering to remove
NA
values in this
notebook. We took a different series of steps to achieve the same
thing, which is often possible in R!
vs_unsorted_df <- readr::read_tsv(vs_unsorted_file)
Rows: 17151 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Gene, gene_symbol
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To test for over-representation, we can calculate a p-value with a hypergeometric test (ref).
\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\)
Where N
is the number of genes in the background
distribution, M
is the number of genes in a pathway,
n
is the number of genes we are interested in (our marker
genes), and k
is the number of genes that overlap between
the pathway and our marker genes.
Borrowing an example from clusterProfiler: universal enrichment tool for functional and comparative study (Yu ):
Example: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set.
We’ll call genes that are differentially expressed
gene_in_interest
and genes that are in the gene set
in_gene_set
.
gene_table <- data.frame(
gene_not_interest = c(2613, 15310),
gene_in_interest = c(28, 29)
)
rownames(gene_table) <- c("in_gene_set", "not_in_gene_set")
gene_table
We can assess if the 28 overlapping genes mean that the differentially expressed genes are over-represented in the gene set with the hypergeometric distribution. This corresponds to a one-sided Fisher’s exact test.
fisher.test(gene_table, alternative = "greater")
Fisher's Exact Test for Count Data
data: gene_table
p-value = 1
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.110242 Inf
sample estimates:
odds ratio
0.1767937
When we test multiple pathways or gene sets, the p-values then need to be adjusted for multiple hypothesis testing.
Our DGE results are from data published as part of Sachs et al. (2014). The authors sorted populations of primary leukemia cells and examined the stem cell capacity of these cell populations. (This study may sound familiar if you’ve worked on one of our bulk RNA-seq exercise notebooks in the past!)
We compared the population that the authors identified as having high stem cell capacity to a low stem cell capacity population. We also compared the high stem cell capacity cells to a mix of populations (e.g., unsorted cells). You can see the code in here.
We’re interested in what pathways are over-represented in genes that specifically distinguish the high capacity population from the low capacity population.
Let’s generate a list of genes that have higher expression in the high stem cell capacity population compared to the low stem cell capacity population, but we’ll also want to exclude genes that show up in our other comparison to unsorted cells.
We’ll start with the high stem cell capacity vs. low stem cell capacity population comparison. Genes with positive log2 fold-changes (LFC) will be more highly expressed in the high stem cell capacity cells based on how we set up the analysis.
vs_low_genes <- vs_low_df |>
# Filter to the positive LFC and filter based on significance too (padj)
dplyr::filter(log2FoldChange > 0,
padj < 0.05) |>
# Return a vector of gene symbols
dplyr::pull(gene_symbol)
Although we’re picking a commonly used cutoff (FDR < 0.05), it’s still arbitrary and we could just as easily pick a different threshold for our LFC values. When we generate lists of genes of interest for ORA, we typically pick an arbitrary cutoff. This is one of the approach’s weaknesses – we’ve removed all other context.
Now, we’ll take the same steps for our other results.
vs_unsorted_genes <- vs_unsorted_df |>
dplyr::filter(log2FoldChange > 0,
padj < 0.05) |>
dplyr::pull(gene_symbol)
We want genes that are in the first comparison but not in the second!
We can use setdiff()
, a base R function for set operations,
to get the list that we want.
# What genes are in the first set but *not* in the second set
genes_for_ora <- setdiff(vs_low_genes, vs_unsorted_genes)
As we saw above, calculating the p-value relies on the number of
genes in the background distribution. Sometimes folks consider genes
from the entire genome to comprise the background, but in the example
borrowed from the clusterProfiler
authors, they state:
17,980 genes detected in a Microarray study
Where the key phrase is genes detected.
If we were unable to include a gene in one of our differential
expression comparisons because, for example, it had low mean expression
in our experiment and therefore was filtered out in our
tidyr::drop_na()
step, we shouldn’t include in our
background set.
We can use another function for set operations,
intersect()
, to get our background set of genes that were
included in both comparisons.
# intersect() will return the genes in both sets - we are using the entire data
# frame here (complete cases), not just the significant genes
background_set <- intersect(vs_low_df$gene_symbol,
vs_unsorted_df$gene_symbol)
# Remove anything that couldn't reliably be measured/assessed in both from the
# genes of interest list - using intersect() will drop anything in the first set
# that isn't also in the second set
genes_for_ora <- intersect(genes_for_ora, background_set)
enricher()
Now that we have our background set, our genes of interest, and our
pathway information, we’re ready to run ORA using the
enricher()
function.
kegg_ora_results <- enricher(
gene = genes_for_ora, # Genes of interest
pvalueCutoff = 0.05,
pAdjustMethod = "BH", # FDR
universe = background_set, # Background set
# The pathway information should be a data frame with a term name or
# identifier and the gene identifiers
TERM2GENE = dplyr::select(mm_kegg_df,
gs_name,
gene_symbol)
)
Note: using enrichKEGG()
is a shortcut for doing ORA
using KEGG, but the approach we covered here can be used with any gene
sets you’d like!
What is returned by enricher()
?
View(kegg_ora_results)
The information we’re most likely interested in is in the
results
slot. Let’s convert this into a data frame that we
can write to file.
kegg_result_df <- data.frame(kegg_ora_results@result)
We can use a dot plot to visualize our significant enrichment results.
enrichplot::dotplot(kegg_ora_results)
We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.
enrichplot::upsetplot(kegg_ora_results)
We can see that some of the DNA repair pathways share genes. Gene sets or pathways aren’t independent, either! Sometimes multiple pathways that show up in our results as significant are indicative of only a handful of genes in our gene list.
We can look at the geneID
column of our results to see
what genes overlap; it’s a good idea to take a look.
kegg_result_df |>
# Use dplyr::select() - the name of the pathway is in the ID column
dplyr::select(ID, geneID)
readr::write_tsv(kegg_result_df, file = kegg_results_file)
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Mm.eg.db_3.19.1 AnnotationDbi_1.66.0 IRanges_2.38.0
[4] S4Vectors_0.42.0 Biobase_2.64.0 BiocGenerics_0.50.0
[7] msigdbr_7.5.1 clusterProfiler_4.12.0 optparse_1.7.5
loaded via a namespace (and not attached):
[1] DBI_1.2.2 gson_0.1.0 shadowtext_0.1.3
[4] gridExtra_2.3 rlang_1.1.3 magrittr_2.0.3
[7] DOSE_3.30.0 compiler_4.4.1 RSQLite_2.3.6
[10] png_0.1-8 vctrs_0.6.5 reshape2_1.4.4
[13] stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.2
[16] fastmap_1.1.1 XVector_0.44.0 labeling_0.4.3
[19] ggraph_2.2.1 utf8_1.2.4 HDO.db_0.99.1
[22] rmarkdown_2.26 tzdb_0.4.0 enrichplot_1.24.0
[25] UCSC.utils_1.0.0 purrr_1.0.2 bit_4.0.5
[28] xfun_0.43 zlibbioc_1.50.0 cachem_1.0.8
[31] aplot_0.2.2 GenomeInfoDb_1.40.0 jsonlite_1.8.8
[34] blob_1.2.4 highr_0.10 BiocParallel_1.38.0
[37] tweenr_2.0.3 parallel_4.4.1 R6_2.5.1
[40] bslib_0.7.0 stringi_1.8.3 RColorBrewer_1.1-3
[43] jquerylib_0.1.4 GOSemSim_2.30.0 Rcpp_1.0.12
[46] knitr_1.46 readr_2.1.5 Matrix_1.7-0
[49] splines_4.4.1 igraph_2.0.3 tidyselect_1.2.1
[52] qvalue_2.36.0 yaml_2.3.8 viridis_0.6.5
[55] codetools_0.2-20 lattice_0.22-6 tibble_3.2.1
[58] plyr_1.8.9 treeio_1.28.0 withr_3.0.0
[61] KEGGREST_1.44.0 evaluate_0.23 gridGraphics_0.5-1
[64] scatterpie_0.2.2 getopt_1.20.4 polyclip_1.10-6
[67] ggupset_0.3.0 Biostrings_2.72.0 ggtree_3.12.0
[70] pillar_1.9.0 ggfun_0.1.4 generics_0.1.3
[73] vroom_1.6.5 hms_1.1.3 ggplot2_3.5.1
[76] tidytree_0.4.6 munsell_0.5.1 scales_1.3.0
[79] glue_1.7.0 lazyeval_0.2.2 tools_4.4.1
[82] data.table_1.15.4 fgsea_1.30.0 babelgene_22.9
[85] fs_1.6.4 graphlayouts_1.1.1 fastmatch_1.1-4
[88] tidygraph_1.3.1 cowplot_1.1.3 grid_4.4.1
[91] ape_5.8 tidyr_1.3.1 colorspace_2.1-0
[94] nlme_3.1-164 patchwork_1.2.0 GenomeInfoDbData_1.2.12
[97] ggforce_0.4.2 cli_3.6.2 fansi_1.0.6
[100] viridisLite_0.4.2
[ reached getOption("max.print") -- omitted 15 entries ]