This notebook will demonstrate how to:
msigdbr
packageclusterProfiler
packageenrichplot
packageIn this notebook, we’ll analyze the differential expression results from the last notebook.
GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in Subramanian et al. (2005). The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.
There are 3 general steps in FCS methods (Khatri et al. 2012):
clusterProfiler
for GSEA, see
Intro
to DGE: Functional Analysis. from Harvard Chan Bioinformatics Core
Training.clusterProfiler
here uses
fgsea
(Fast Gene Set Enrichment Analysis) under the hood.
You can read more about fgsea
in Korotkevich et al.
(2021).# Package to run GSEA
library(clusterProfiler)
clusterProfiler v4.12.0 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
Attaching package: 'clusterProfiler'
The following object is masked from 'package:stats':
filter
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
# We'll use the marker genes as GSEA input
rms_analysis_dir <- file.path("analysis", "rms")
# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(rms_analysis_dir, "pathway-analysis")
fs::dir_create(results_dir)
input_file <- file.path(rms_analysis_dir,
"deseq",
"rms_myoblast_deseq_results.tsv")
We’ll save our table of GSEA results as a TSV.
output_file <- file.path(results_dir,
"rms_myoblast_gsea_results.tsv")
We will use gene sets from the Molecular
Signatures Database (MSigDB) from the Broad Institute (Subramanian, Tamayo
et al. 2005). The msigdbr
package contains MSigDB datasets already in the tidy format required by
clusterProfiler
and supports multiple organisms.
Let’s take a look at what organisms the package supports.
msigdbr_species()
MSigDB contains 8 different gene set collections.
H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures
We’ll use the Hallmark collection for GSEA. Here’s an excerpt of the collection description:
Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.
Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.
We can retrieve only the Hallmark gene sets by specifying
category = "H"
to the msigdbr()
function.
hs_hallmarks_df <- msigdbr(species = "Homo sapiens",
category = "H")
Adapted from refine.bio examples
Figure 1. Subramanian et al. (2005).
GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings (Subramanian et al. 2005; Yu)
Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated: Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway. The enrichment score for a pathway is the running sum’s maximum deviation from zero. GSEA also assesses statistical significance of the scores for each pathway through permutation testing. As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing (Subramanian et al. 2005; Yu).
The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets. When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.
deseq_df <- readr::read_tsv(input_file)
Rows: 60319 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): ensembl_id, gene_symbol
dbl (25): baseMean, log2FoldChange, lfcSE, pvalue, padj, SCPCL000478.mean, S...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(deseq_df)
This data frame uses Ensembl gene identifiers. We’ll need to make sure our gene sets use the same identifiers. Let’s take a look at the first few rows of the data frame that contains the hallmark gene sets.
head(hs_hallmarks_df)
We can see that the gene sets from msigdbr
have Ensembl
gene identifiers associated with them, so we don’t need to do any
conversion. However, we’ll need to pass the correct column to the
function that runs GSEA.
If we needed to do gene identifier conversion, we would likely use
the AnnotationDbi
package. You can see an example in this
Harvard Chan Bioinformatics Core Training material: https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/AnnotationDbi_lesson.html
One good thing about Ensembl gene identifiers is that they are less likely to be duplicated than, for example, gene symbols. (Multiple Ensembl gene identifiers can map to the same symbol.)
The GSEA approach requires on discriminating between genes that are in a gene set and those that are not. Practically speaking, gene sets are just collections of gene identifiers! When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results. So, let’s check for duplicates in the data frame of DESeq2 results.
any(duplicated(deseq_df$ensembl_id))
[1] FALSE
There are no duplicates for us to worry about!
The GSEA()
function takes a pre-ranked (sorted) named
vector of statistics, where the names in the vector are gene
identifiers. This is step 1 – gene-level statistics.
lfc_vector <- deseq_df |>
# Extract a vector of `log2FoldChange` named by `ensembl_id`
dplyr::pull(log2FoldChange, name = ensembl_id)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
Let’s look at the top ranked values.
# Look at first entries of the log fold change vector
head(lfc_vector)
ENSG00000263366 ENSG00000223760 ENSG00000253377 ENSG00000265843 ENSG00000104722
11.364714 10.573752 10.476990 10.199449 10.019651
ENSG00000228835
9.898818
And the bottom of the list.
# Look at the last entries of the log fold change vector
tail(lfc_vector)
ENSG00000269186 ENSG00000268388 ENSG00000165606 ENSG00000285640 ENSG00000118432
-10.93216 -11.35119 -11.36925 -11.90034 -11.92082
ENSG00000184221
-12.12577
Now for the analysis!
We can use the GSEA()
function to perform GSEA with any
generic set of gene sets, but there are several functions for using
specific, commonly used gene sets (e.g., gseKEGG()
).
gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list
minGSSize = 25, # minimum gene set size
maxGSSize = 500, # maximum gene set set
pvalueCutoff = 0.05,
pAdjustMethod = "BH", # correction for multiple hypothesis testing
TERM2GENE = dplyr::select(hs_hallmarks_df,
gs_name,
ensembl_gene)) # pass the correct identifier column
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (20.32% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
leading edge analysis...
done...
Let’s take a look at the GSEA results.
View(gsea_results@result |>
dplyr::arrange(dplyr::desc(NES))
)
Normalized enrichment scores (NES) are enrichment scores that are scaled to make gene sets that contain different number of genes comparable.
Pathways with significant, highly positive NES are enriched in ERMS myoblasts, whereas pathways with significant, highly negative NES are enriched in ARMS myoblasts.
Let’s write these results to file.
gsea_results@result |> readr::write_tsv(output_file)
We can visualize GSEA results for individual pathways or gene sets
using enrichplot::gseaplot()
. Let’s take a look at 3
different pathways – one with a highly positive NES, one with a highly
negative NES, and one that was not a significant result – to get more
insight into how ES are calculated.
Let’s take look at a pathway with a highly positive NES
(HALLMARK_MYC_TARGETS_V2
) using a GSEA plot.
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_MYC_TARGETS_V2",
title = "HALLMARK_MYC_TARGETS_V2",
color.line = "#0066FF")
Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores.
The gene set HALLMARK_MYOGENESIS
had a highly negative
NES.
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_MYOGENESIS",
title = "HALLMARK_MYOGENESIS",
color.line = "#0066FF")
This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph.
The @results
slot will only show gene sets that pass the
pvalueCutoff
threshold we supplied to GSEA()
,
but we can plot any gene set so long as we know its name. Let’s look at
HALLMARK_P53_PATHWAY
, which was not in the results we
viewed earlier.
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_P53_PATHWAY",
title = "HALLMARK_P53_PATHWAY",
color.line = "#0066FF")
Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more “middling” score.
Note: The plots returned by enrichplot::gseaplot
are
ggplots, so we could use ggplot2::ggsave()
to save them to
file if we wanted to.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] msigdbr_7.5.1 clusterProfiler_4.12.0 optparse_1.7.5
loaded via a namespace (and not attached):
[1] DBI_1.2.2 gson_0.1.0 shadowtext_0.1.3
[4] gridExtra_2.3 rlang_1.1.3 magrittr_2.0.3
[7] DOSE_3.30.0 compiler_4.4.1 RSQLite_2.3.6
[10] png_0.1-8 vctrs_0.6.5 reshape2_1.4.4
[13] stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.2
[16] fastmap_1.1.1 XVector_0.44.0 labeling_0.4.3
[19] ggraph_2.2.1 utf8_1.2.4 HDO.db_0.99.1
[22] rmarkdown_2.26 tzdb_0.4.0 enrichplot_1.24.0
[25] UCSC.utils_1.0.0 purrr_1.0.2 bit_4.0.5
[28] xfun_0.43 zlibbioc_1.50.0 cachem_1.0.8
[31] aplot_0.2.2 GenomeInfoDb_1.40.0 jsonlite_1.8.8
[34] blob_1.2.4 highr_0.10 BiocParallel_1.38.0
[37] tweenr_2.0.3 parallel_4.4.1 R6_2.5.1
[40] bslib_0.7.0 stringi_1.8.3 RColorBrewer_1.1-3
[43] jquerylib_0.1.4 GOSemSim_2.30.0 Rcpp_1.0.12
[46] knitr_1.46 readr_2.1.5 IRanges_2.38.0
[49] Matrix_1.7-0 splines_4.4.1 igraph_2.0.3
[52] tidyselect_1.2.1 qvalue_2.36.0 yaml_2.3.8
[55] viridis_0.6.5 codetools_0.2-20 lattice_0.22-6
[58] tibble_3.2.1 plyr_1.8.9 treeio_1.28.0
[61] Biobase_2.64.0 withr_3.0.0 KEGGREST_1.44.0
[64] evaluate_0.23 gridGraphics_0.5-1 scatterpie_0.2.2
[67] getopt_1.20.4 polyclip_1.10-6 Biostrings_2.72.0
[70] ggtree_3.12.0 pillar_1.9.0 stats4_4.4.1
[73] ggfun_0.1.4 generics_0.1.3 vroom_1.6.5
[76] hms_1.1.3 S4Vectors_0.42.0 ggplot2_3.5.1
[79] tidytree_0.4.6 munsell_0.5.1 scales_1.3.0
[82] glue_1.7.0 lazyeval_0.2.2 tools_4.4.1
[85] data.table_1.15.4 fgsea_1.30.0 babelgene_22.9
[88] fs_1.6.4 graphlayouts_1.1.1 fastmatch_1.1-4
[91] tidygraph_1.3.1 cowplot_1.1.3 grid_4.4.1
[94] ape_5.8 tidyr_1.3.1 AnnotationDbi_1.66.0
[97] colorspace_2.1-0 nlme_3.1-164 patchwork_1.2.0
[100] GenomeInfoDbData_1.2.12
[ reached getOption("max.print") -- omitted 20 entries ]