Objectives

This notebook will demonstrate how to:

  • Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA), including how to remove duplicate gene identifiers
  • Perform GSEA with the clusterProfiler package
  • Visualize GSEA results with the enrichplot package

In this notebook, we will perform Gene Set Enrichment Analysis (GSEA) on the neuroblastoma cell line differential gene expression (DGE) results we generated during the RNA-seq module.

To refresh our memory, we analyzed data from Harenza et al. (2017) and specifically tested for DGE between with MYCN amplified cell lines and non-amplified cell lines using DESeq2. We have a table of results that contains our log2 fold changes and adjusted p-values.

GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in Subramanian et al. (2005). The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest. FCS methods are better suited for identifying these pathways that show coordinated changes than ORA. In ORA, we pick a cutoff that typically only captures genes with large individual changes.

There are 3 general steps in FCS methods (Khatri et al. 2012):

  1. Calculate a gene-level statistic (we’ll use log2 fold change from DESeq2 here)
  2. Gene-level statistics are aggregated into a pathway-level statistic
  3. Assess the statistical significance of the pathway-level statistic

Other resources

Set up

Libraries

# Package to run GSEA
library(clusterProfiler)
clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141

Attaching package: 'clusterProfiler'
The following object is masked from 'package:stats':

    filter
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)

Directories and Files

Directories

# Where the DGE results are stored
input_dir <- file.path("..", "RNA-seq", "results", "NB-cell")

# We will create a directory to specifically hold our GSEA results if it does
# not yet exist
output_dir <- file.path("results", "NB-cell")
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}

Input files

# DGE results
dge_results_file <- file.path(input_dir,
                             "NB-cell_DESeq_amplified_v_nonamplified_results.tsv")

Output files

# GSEA pathway-level scores and statistics
gsea_results_file <- file.path(output_dir,
                               "NB-cell_gsea_results.tsv")

Gene Set Enrichment Analysis

Adapted from refine.bio examples