Objectives

This notebook will demonstrate how to:

  • Identify when Gene Set Variation Analysis (GSVA) is well-suited for an analysis
  • Perform GSVA on transformed RNA-seq data with the GSVA package
  • Explore the dependence of GSVA scores on gene set size with random gene sets

So far every pathway analysis method we’ve covered relies on some information about groups of samples in our data. For over-representation analysis (ORA), we created gene sets from two different two group comparisons. In the Gene Set Enrichment Analysis (GSEA) example, we used statistics from a differential gene expression (DGE) analysis where we compared MYCN amplified cell lines to non-amplified cell lines; we needed that amplification status information.

What if we’re less sure about groups in our data or we want to analyze our data in a more unsupervised manner?

In this notebook we will cover a method called Gene Set Variation Analysis (GSVA) (Hänzelmann et al. 2013) that allows us to calculate gene set or pathway scores on a per-sample basis.

We like this quote from the GSVA paper (Hänzelmann et al. 2013) to set the stage:

While [gene set enrichment] methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology.

Rather than contextualizing some results you already have from another analysis like DGE, GSVA is designed to provide an estimate of pathway variation for each of the samples in an experiment. Note that these scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and reran GSVA, you would expect the scores to change.

Set up

Libraries

# Gene Set Variation Analysis
library(GSVA)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'HDF5Array'
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'

Directories and files

Directories

# We have some medulloblastoma data from the OpenPBTA project that we've
# prepared ahead of time
input_dir <- file.path("data", "open-pbta")

# Create a directory specifically for the results using this dataset
output_dir <- file.path("results", "open-pbta")
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}

Input

We have VST transformed RNA-seq data, annotated with gene symbols, that has been collapsed such that there are no duplicated gene identifiers (see setup).

rnaseq_file <- file.path(input_dir, "medulloblastoma_vst_collapsed.tsv")

Output

gsva_results_file <- file.path(output_dir, "medulloblastoma_gsva_results.tsv")

Gene sets

The function that we will use to run GSVA wants the gene sets to be in a list, rather than a tidy data frame that we used with clusterProfiler (although it does accept other formats).

We’re going to take this opportunity to introduce a different format that gene sets are often distributed in called GMT (Gene Matrix Transposed).

We’re going to read in the Hallmark collection file directly from MSigDB, rather than using msigdbr like we did in earlier notebooks.

The RNA-seq data uses gene symbols, so we need gene sets that use gene symbols, too.

# R can often read in data from a URL
hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.1/h.all.v7.1.symbols.gmt"

# QuSAGE is another pathway analysis method, the qusage package has a function
# for reading GMT files and turning them into a list
hallmarks_list <- qusage::read.gmt(hallmarks_url)

What does this list look like?

head(hallmarks_list)

RNA-seq data

We have VST transformed RNA-seq data, which is on a log2-like scale. These data are from the Open Pediatric Brain Tumor Atlas (OpenPBTA) OpenPBTA is a collaborative project organized by the CCDL and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia conducted openly on GitHub.

You can read more about the project here.

We’re only working with the medulloblastoma samples in this example.

rnaseq_df <- readr::read_tsv(rnaseq_file)
Rows: 49569 Columns: 122
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr   (1): gene_symbol
dbl (121): BS_09Z7TC35, BS_1AYRM596, BS_1BWP5MCT, BS_1QXEC43H, BS_1TWCV047, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# What does the RNA-seq data frame look like?
rnaseq_df[1:5, 1:5]

For GSVA, we need a matrix.

rnaseq_mat <- rnaseq_df |>
  tibble::column_to_rownames("gene_symbol") |>
  as.matrix()

Note: If we had duplicate gene symbols here, we couldn’t set them as rownames.

GSVA