This notebook will demonstrate how to:
GSVA
packageSo far every pathway analysis method we’ve covered relies on some information about groups of samples in our data. For over-representation analysis (ORA), we created gene sets from two different two group comparisons. In the Gene Set Enrichment Analysis (GSEA) example, we used statistics from a differential gene expression (DGE) analysis where we compared MYCN amplified cell lines to non-amplified cell lines; we needed that amplification status information.
What if we’re less sure about groups in our data or we want to analyze our data in a more unsupervised manner?
In this notebook we will cover a method called Gene Set Variation Analysis (GSVA) (Hänzelmann et al. 2013) that allows us to calculate gene set or pathway scores on a per-sample basis.
We like this quote from the GSVA paper (Hänzelmann et al. 2013) to set the stage:
While [gene set enrichment] methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology.
Rather than contextualizing some results you already have from another analysis like DGE, GSVA is designed to provide an estimate of pathway variation for each of the samples in an experiment. Note that these scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and reran GSVA, you would expect the scores to change.
# Gene Set Variation Analysis
library(GSVA)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'HDF5Array'
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# We have some medulloblastoma data from the OpenPBTA project that we've
# prepared ahead of time
input_dir <- file.path("data", "open-pbta")
# Create a directory specifically for the results using this dataset
output_dir <- file.path("results", "open-pbta")
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
We have VST transformed RNA-seq data, annotated with gene symbols,
that has been collapsed such that there are no duplicated gene
identifiers (see setup
).
rnaseq_file <- file.path(input_dir, "medulloblastoma_vst_collapsed.tsv")
gsva_results_file <- file.path(output_dir, "medulloblastoma_gsva_results.tsv")
The function that we will use to run GSVA wants the gene sets to be
in a list, rather than a tidy data frame that we used with
clusterProfiler
(although it does accept other
formats).
We’re going to take this opportunity to introduce a different format that gene sets are often distributed in called GMT (Gene Matrix Transposed).
We’re going to read in the Hallmark collection file directly from MSigDB,
rather than using msigdbr
like we did in earlier
notebooks.
The RNA-seq data uses gene symbols, so we need gene sets that use gene symbols, too.
# R can often read in data from a URL
hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.1/h.all.v7.1.symbols.gmt"
# QuSAGE is another pathway analysis method, the qusage package has a function
# for reading GMT files and turning them into a list
hallmarks_list <- qusage::read.gmt(hallmarks_url)
What does this list look like?
head(hallmarks_list)
We have VST transformed RNA-seq data, which is on a log2-like scale. These data are from the Open Pediatric Brain Tumor Atlas (OpenPBTA) OpenPBTA is a collaborative project organized by the CCDL and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia conducted openly on GitHub.
You can read more about the project here.
We’re only working with the medulloblastoma samples in this example.
rnaseq_df <- readr::read_tsv(rnaseq_file)
Rows: 49569 Columns: 122
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene_symbol
dbl (121): BS_09Z7TC35, BS_1AYRM596, BS_1BWP5MCT, BS_1QXEC43H, BS_1TWCV047, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# What does the RNA-seq data frame look like?
rnaseq_df[1:5, 1:5]
For GSVA, we need a matrix.
rnaseq_mat <- rnaseq_df |>
tibble::column_to_rownames("gene_symbol") |>
as.matrix()
Note: If we had duplicate gene symbols here, we couldn’t set them as rownames.
Figure 1 from Hänzelmann et al. (2013).
You may notice that GSVA has some commonalities with GSEA. Rather than ranking genes based on some statistic we selected ahead of time, GSVA fits a model and ranks genes based on their expression level relative to the sample distribution. This is a way of asking if a gene i is highly or lowly expressed in a sample j in the context of this experiment and ranking accordingly (Hänzelmann et al. 2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra. 2018). (This is sometimes called a competitive test in gene set enrichment literature.) The intuition here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (overexpressed or underexpressed relative to the overall population) (Hänzelmann et al. 2013).
The output is a gene set by sample matrix of GSVA scores.
The main work for GSVA is done by the gsva()
function,
which can actually do a variety of different enrichment score
calculations. To specify that we want the algorithm used by Hänzelmann et al.
(2013), we will pass as the first argument a call to the
gsvaParam()
function, which is where we will put our data,
gene sets, and other parameters specific to that algorithm.
gsva_results <- gsva(
gsvaParam(
exprData = rnaseq_mat,
geneSets = hallmarks_list,
# Minimum gene set size
minSize = 15,
# Maximum gene set size
maxSize = 500,
# Kernel for estimation
# Gaussian for our transformed data on log2-like scale
kcdf = "Gaussian",
# enrichment score is the difference between largest positive and negative
maxDiff = TRUE
),
# Use 4 cores for multiprocessing
BPPARAM = BiocParallel::MulticoreParam(4)
)
Warning in .filterGenes(dataMatrix, removeConstant = removeConstant,
removeNzConstant = removeNzConstant): 945 genes with constant values throughout
the samples.
Warning in .filterGenes(dataMatrix, removeConstant = removeConstant,
removeNzConstant = removeNzConstant): Genes with constant values are discarded.
Setting parallel calculations through a MulticoreParam back-end with workers=4 and tasks=100.
Estimating GSVA scores for 50 gene sets.
Estimating ECDFs with Gaussian kernels
Estimating ECDFs in parallel on 4 cores
Note: the gsvaParam()
documentation says we can
use kcdf = "Gaussian"
if we had RNA-seq log-CPMs, log-RPKMs
or log-TPMs, but we would use kcdf = "Poisson"
on integer
counts.
# Let's explore what the output of gsva() looks like
gsva_results[1:5, 1:5]
BS_09Z7TC35 BS_1AYRM596 BS_1BWP5MCT
HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.44911439 0.5208667 -0.5609193
HALLMARK_HYPOXIA -0.38297104 0.2436910 -0.5058759
HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.26534735 0.1054224 -0.5180933
HALLMARK_MITOTIC_SPINDLE 0.12727006 0.2339489 -0.4338076
HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.09287646 0.1602124 -0.4427594
BS_1QXEC43H BS_1TWCV047
HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.32300630 -0.1468081
HALLMARK_HYPOXIA 0.36247083 -0.2971559
HALLMARK_CHOLESTEROL_HOMEOSTASIS 0.32418657 -0.4561386
HALLMARK_MITOTIC_SPINDLE -0.07023068 -0.1861483
HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.41372523 -0.1152437
Often the scores of gene set enrichment methods are not comparable between gene sets of different sizes. Let’s do an experiment using randomly generated gene sets to explore this idea a bit more.
We need to get a collection of all possible genes we will sample from to create random gene sets. Because we’re doing some random sampling, we need to set a seed for this to be reproducible.
# Use all the gene symbols in the dataset as the pool of possible genes
all_genes <- rownames(rnaseq_mat)
# Set a seed for reproducibility
set.seed(2020)
Our minimum gene set size earlier was 15 genes and our maximum gene set size was 500 genes. We’ll use the same minimum and maximum values for our random gene sets and some values in between.
# Make a list of integers that indicate the random gene set sizes
gene_set_size <- list(15, 25, 50, 100, 250, 500)
For each gene set size, we will generate 100 random gene sets.
# Set number of replicates
nreps <- 100
# Generate 100 random gene sets of each size
random_gene_sets <- rep(gene_set_size, nreps) |> # Repeat gene sizes so we run `nreps` times
purrr::map(
# Sample the vector of all genes, choosing the number of items specified
# in the element of gene set size
~ base::sample(x = all_genes,
size = .x)
)
The Hallmarks list we used earlier stored the gene set names as the
name of the list, so let’s add names to our random gene sets that
indicate what size they are and so gsva()
doesn’t get
upset.
# We will include the size of the gene set in the gene set name
# Start by taking the length of each pathway and appending "pathway_" to that
# number
lengths_vector <- random_gene_sets |>
# Get the length of each gene set (number of genes)
purrr::map(~ length(.x)) |>
# Make it "pathway_<gene set size>"
purrr::map(~ paste0("pathway_", .x)) |>
# Return a vector
purrr::flatten_chr()
# Add the names in lengths_vector to the list - "pathway_<gene set size>"
random_gene_sets <- random_gene_sets |>
# make.names() appends a "version" if something is not unique
purrr::set_names(nm = make.names(lengths_vector, unique = TRUE))
Run GSVA on our dataset with the same parameters as before, but now with random gene sets.
random_gsva_results <- gsva(
gsvaParam(
exprData = rnaseq_mat,
geneSets = random_gene_sets,
minSize = 15,
maxSize = 500,
kcdf = "Gaussian",
maxDiff = TRUE
),
# Use 4 cores for multiprocessing
BPPARAM = BiocParallel::MulticoreParam(4)
)
Warning in .filterGenes(dataMatrix, removeConstant = removeConstant,
removeNzConstant = removeNzConstant): 945 genes with constant values throughout
the samples.
Warning in .filterGenes(dataMatrix, removeConstant = removeConstant,
removeNzConstant = removeNzConstant): Genes with constant values are discarded.
Setting parallel calculations through a MulticoreParam back-end with workers=4 and tasks=100.
Estimating GSVA scores for 579 gene sets.
Estimating ECDFs with Gaussian kernels
Estimating ECDFs in parallel on 4 cores
Now let’s make a plot to look at the distribution of scores from
random gene sets. First we need to get this data in an appropriate
format for ggplot2
.
# The random results are a matrix
random_long_df <- random_gsva_results |>
data.frame() |>
# Gene set names are rownames
tibble::rownames_to_column("gene_set") |>
# Get into long format
tidyr::pivot_longer(
cols = -gene_set,
names_to = "Kids_First_Biospecimen_ID",
values_to = "gsva_score"
) |>
# Remove the .version added by make.names()
dplyr::mutate(gene_set = stringr::str_remove(gene_set, "\\..*")) |>
# Add a column that keeps track of the gene set size
dplyr::mutate(gene_set_size = stringr::word(gene_set, 2, sep = "_")) |>
# We want to plot smallest no. genes -> largest no. genes
dplyr::mutate(gene_set_size = factor(gene_set_size,
levels = c(15, 25, 50, 100, 250, 500)))
Let’s make a violin plot so we can look at the distribution of scores by gene set size.
# Violin plot comparing GSVA scores of different random gene set sizes
random_long_df |>
ggplot2::ggplot(ggplot2::aes(x = gene_set_size,
y = gsva_score)) +
# Make a violin plot that's a pretty blue!
ggplot2::geom_violin(fill = "#99CCFF", alpha = 0.5) +
# Add a point with the mean value
ggplot2::stat_summary(
geom = "point",
fun = "mean",
# Change the aesthetics of the points
size = 3,
color = "#0066CC",
shape = 18
) +
# Flip the axes
ggplot2::coord_flip() +
ggplot2::labs(title = "Random Gene Set GSVA Scores",
x = "gene set size",
y = "GSVA score") +
ggplot2::theme_bw()
What do you notice about these distributions? How might you use this information to inform your interpretation of GSVA scores?
If you did have groups information for your samples, you could test for pathway scores differences between groups. Here’s an example of that from the OpenPBTA project itself!
You can also visualize this matrix in a heatmap. Here’s a figure from the OpenPBTA project, where the middle panel is a heatmap of GSVA scores that were significantly different between histologies.
gsva_results |>
as.data.frame() |>
tibble::rownames_to_column("pathway") |>
readr::write_tsv(file = gsva_results_file)
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GSVA_1.52.0 optparse_1.7.5
loaded via a namespace (and not attached):
[1] DBI_1.2.2 GSEABase_1.66.0
[3] rlang_1.1.3 magrittr_2.0.3
[5] matrixStats_1.3.0 compiler_4.4.1
[7] RSQLite_2.3.6 png_0.1-8
[9] vctrs_0.6.5 stringr_1.5.1
[11] pkgconfig_2.0.3 SpatialExperiment_1.14.0
[13] crayon_1.5.2 fastmap_1.1.1
[15] magick_2.8.3 XVector_0.44.0
[17] labeling_0.4.3 utf8_1.2.4
[19] rmarkdown_2.26 tzdb_0.4.0
[21] graph_1.82.0 UCSC.utils_1.0.0
[23] purrr_1.0.2 bit_4.0.5
[25] xfun_0.43 zlibbioc_1.50.0
[27] cachem_1.0.8 beachmat_2.20.0
[29] GenomeInfoDb_1.40.0 jsonlite_1.8.8
[31] blob_1.2.4 highr_0.10
[33] rhdf5filters_1.16.0 DelayedArray_0.30.0
[35] Rhdf5lib_1.26.0 BiocParallel_1.38.0
[37] irlba_2.3.5.1 parallel_4.4.1
[39] R6_2.5.1 bslib_0.7.0
[41] stringi_1.8.3 limma_3.60.0
[43] GenomicRanges_1.56.0 jquerylib_0.1.4
[45] estimability_1.5 Rcpp_1.0.12
[47] SummarizedExperiment_1.34.0 knitr_1.46
[49] readr_2.1.5 IRanges_2.38.0
[51] tidyselect_1.2.1 Matrix_1.7-0
[53] abind_1.4-5 yaml_2.3.8
[55] codetools_0.2-20 lattice_0.22-6
[57] tibble_3.2.1 withr_3.0.0
[59] Biobase_2.64.0 KEGGREST_1.44.0
[61] coda_0.19-4.1 evaluate_0.23
[63] getopt_1.20.4 Biostrings_2.72.0
[65] pillar_1.9.0 MatrixGenerics_1.16.0
[67] stats4_4.4.1 generics_0.1.3
[69] vroom_1.6.5 ggplot2_3.5.1
[71] S4Vectors_0.42.0 hms_1.1.3
[73] munsell_0.5.1 scales_1.3.0
[75] sparseMatrixStats_1.16.0 xtable_1.8-4
[77] glue_1.7.0 emmeans_1.10.1
[79] tools_4.4.1 ScaledMatrix_1.12.0
[81] annotate_1.82.0 mvtnorm_1.2-4
[83] XML_3.99-0.16.1 rhdf5_2.48.0
[85] grid_4.4.1 tidyr_1.3.1
[87] colorspace_2.1-0 AnnotationDbi_1.66.0
[89] SingleCellExperiment_1.26.0 nlme_3.1-164
[91] GenomeInfoDbData_1.2.12 BiocSingular_1.20.0
[93] qusage_2.38.0 HDF5Array_1.32.0
[95] cli_3.6.2 rsvd_1.0.5
[97] fansi_1.0.6 S4Arrays_1.4.0
[99] dplyr_1.1.4 gtable_0.3.5
[ reached getOption("max.print") -- omitted 13 entries ]