1 Purpose of this analysis

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013). You can use the GSVA scores for other downstream analyses. In this analysis, we will test GSVA scores for differential expression.

⬇️ Jump to the analysis code ⬇️

1.0.1 What is pathway analysis?

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

1.0.2 How to choose a pathway analysis?

This table summarizes the pathway analyses examples in this module.

Analysis	What is required for input	What output looks like	✅ Pros	⚠️ Cons
ORA (Over-representation Analysis)	A list of gene IDs (no stats needed)	A per-pathway hypergeometric test result	- Simple - Inexpensive computationally to calculate p-values	- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)	A list of genes IDs with gene-level summary statistics	A per-pathway enrichment score	- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes	- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)	A gene expression matrix (like what you get from refine.bio directly)	Pathway-level scores on a per-sample basis	- Does not require two groups to compare upfront - Normally distributed scores	- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the `.Rmd` file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

# Define the file path to the gene_sets directory
gene_sets_dir <- "gene_sets"

# Create the gene_sets folder if it doesn't exist
if (!dir.exists(gene_sets_dir)) {
  dir.create(gene_sets_dir)
}

In the same place you put this .Rmd file, you should now have four new empty folders called data, plots, results, and gene_sets!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this medulloblastoma dataset (Northcott et al. 2012). The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. Medulloblastoma is the most common childhood brain cancer and is often categorized by subgroups. We will use these subgroup labels from our metadata to perform differential expression with our GSVA scores.

2.5 Place the dataset in your new `data/` folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The GSE37382 folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the GSE37382 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

The example analysis .Rmd you downloaded
A folder called “data” which contains:
- The GSE37382 folder which contains:
  - The gene expression
  - The metadata TSV
A folder for plots (currently empty)
A folder for results (currently empty)
A folder for gene_sets (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "GSE37382")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "GSE37382.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)

## [1] TRUE

# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)

## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

4 Gene set variation analysis - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using the GSVA package to perform GSVA and the qusage package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013; Yaari et al. 2013).

We will also need the org.Hs.eg.db package to perform gene identifier conversion (Carlson 2020).

We’ll also be performing differential expression test on our GSVA scores, and for that we will use limma (Ritchie et al. 2015) and we’ll make a sina plot of the scores of our most significant pathway using a ggplot2 companion package, ggforce.

if (!("GSVA" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("GSVA", update = FALSE)
}

if (!("qusage" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("qusage", update = FALSE)
}

if (!("org.Hs.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Hs.eg.db", update = FALSE)
}

if (!("limma" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("limma", update = FALSE)
}

if (!("ggforce" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  install.packages("ggforce")
}

Attach the packages we need for this analysis.

# Attach the `qusage` library
library(qusage)

# Attach the `GSVA` library
library(GSVA)

# Human annotation package we'll use for gene identifier conversion
library(org.Hs.eg.db)

# Attach the ggplot2 package for plotting
library(ggplot2)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)

## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   refinebio_age = col_double(),
##   refinebio_cell_line = col_logical(),
##   refinebio_compound = col_logical(),
##   refinebio_disease = col_logical(),
##   refinebio_disease_stage = col_logical(),
##   refinebio_genetic_information = col_logical(),
##   refinebio_processed = col_logical(),
##   refinebio_race = col_logical(),
##   refinebio_source_archive_url = col_logical(),
##   refinebio_specimen_part = col_logical(),
##   refinebio_subject = col_logical(),
##   refinebio_time = col_logical(),
##   refinebio_treatment = col_logical(),
##   channel_count = col_double(),
##   `contact_zip/postal_code` = col_double(),
##   data_row_count = col_double(),
##   taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.

# Read in data TSV file
expression_df <- readr::read_tsv(data_file)

## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Let’s ensure that the metadata and data are in the same sample order.

# Make the data in the order of the metadata
expression_df <- expression_df %>%
  dplyr::select(c(Gene, metadata$geo_accession))

# Check if this is in the same order
all.equal(colnames(expression_df)[-1], metadata$geo_accession)

## [1] TRUE

4.2.1 Import Gene Sets

The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.

The function that we will use to run GSVA requires the gene sets to be in a list. We are going to download a GMT file that contains the the Hallmark gene sets (Liberzon et al. 2015) from MSigDB (Subramanian et al. 2005; Liberzon et al. 2011) to the gene_sets directory.

Note that when downloading GMT files from MSigDB, only Homo sapiens gene sets are supported. If you’d like to use MSigDB gene sets with GSVA for a commonly studied model organism, check out our RNA-seq GSVA example.

hallmarks_gmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt"

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

hallmarks_gmt_file <- file.path(
  gene_sets_dir,
  "h.all.v7.2.entrez.gmt"
)

Using the URL (hallmarks_gmt_url) and file path (hallmark_gmt_file) we can download the file and use the destfile argument to specify where it should be saved.

download.file(
  hallmarks_gmt_url,
  # The file will be saved to this location and with this name
  destfile = hallmarks_gmt_file
)

Now let’s double check that the file that contains the gene sets is in the right place.

# Check if the file exists
file.exists(hallmarks_gmt_file)

## [1] TRUE

Now we’re ready to read the file into R with qusage::read.gmt().

# QuSAGE is another pathway analysis method, the `qusage` package has a function
# for reading GMT files and turning them into a list that we can use with GSVA
hallmarks_list <- qusage::read.gmt(hallmarks_gmt_file)

What does this hallmarks_list look like?

head(hallmarks_list, n = 2)

## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
##   [1] "3726"   "2920"   "467"    "4792"   "7128"   "5743"   "2919"  
##   [8] "8870"   "9308"   "6364"   "2921"   "23764"  "4791"   "7127"  
##  [15] "1839"   "1316"   "330"    "5329"   "7538"   "3383"   "3725"  
##  [22] "1960"   "3553"   "597"    "23645"  "80149"  "6648"   "4929"  
##  [29] "3552"   "5971"   "7185"   "7832"   "1843"   "1326"   "2114"  
##  [36] "2152"   "6385"   "1958"   "3569"   "7124"   "23135"  "4790"  
##  [43] "3976"   "5806"   "8061"   "3164"   "182"    "6351"   "2643"  
##  [50] "6347"   "1827"   "1844"   "10938"  "9592"   "5966"   "8837"  
##  [57] "8767"   "4794"   "8013"   "22822"  "51278"  "8744"   "2669"  
##  [64] "1647"   "3627"   "10769"  "8553"   "1959"   "9021"   "11182" 
##  [71] "5734"   "1847"   "5055"   "4783"   "5054"   "10221"  "25976" 
##  [78] "5970"   "329"    "6372"   "9516"   "7130"   "960"    "3624"  
##  [85] "5328"   "4609"   "3604"   "6446"   "10318"  "10135"  "2355"  
##  [92] "10957"  "3398"   "969"    "3575"   "1942"   "7262"   "5209"  
##  [99] "6352"   "79693"  "3460"   "8878"   "10950"  "4616"   "8942"  
## [106] "50486"  "694"    "4170"   "7422"   "5606"   "1026"   "3491"  
## [113] "10010"  "3433"   "3606"   "7280"   "3659"   "2353"   "4973"  
## [120] "388"    "374"    "4814"   "65986"  "8613"   "9314"   "6373"  
## [127] "6303"   "1435"   "1880"   "56937"  "5791"   "7097"   "57007" 
## [134] "7071"   "4082"   "3914"   "1051"   "9322"   "2150"   "687"   
## [141] "3949"   "7050"   "127544" "55332"  "2683"   "11080"  "1437"  
## [148] "5142"   "8303"   "5341"   "6776"   "23258"  "595"    "23586" 
## [155] "8877"   "941"    "25816"  "57018"  "2526"   "9034"   "80176" 
## [162] "8848"   "9334"   "150094" "23529"  "4780"   "2354"   "5187"  
## [169] "10725"  "490"    "3593"   "3572"   "9120"   "19"     "3280"  
## [176] "604"    "8660"   "6515"   "1052"   "51561"  "4088"   "6890"  
## [183] "9242"   "64135"  "3601"   "79155"  "602"    "24145"  "24147" 
## [190] "1906"   "10209"  "650"    "1846"   "10611"  "23308"  "9945"  
## [197] "10365"  "3371"   "5271"   "4084"  
## 
## $HALLMARK_HYPOXIA
##   [1] "5230"   "5163"   "2632"   "5211"   "226"    "2026"   "5236"  
##   [8] "10397"  "3099"   "230"    "2821"   "4601"   "6513"   "5033"  
##  [15] "133"    "8974"   "2023"   "5214"   "205"    "26355"  "5209"  
##  [22] "7422"   "665"    "7167"   "30001"  "55818"  "901"    "3939"  
##  [29] "2997"   "2597"   "8553"   "51129"  "3725"   "5054"   "4015"  
##  [36] "2645"   "8497"   "23764"  "54541"  "6515"   "3486"   "4783"  
##  [43] "2353"   "3516"   "3098"   "10370"  "3669"   "2584"   "26118" 
##  [50] "5837"   "6781"   "23036"  "694"    "123"    "1466"   "7436"  
##  [57] "23210"  "2131"   "2152"   "5165"   "55139"  "7360"   "229"   
##  [64] "8614"   "54206"  "2027"   "10957"  "3162"   "5228"   "26330" 
##  [71] "9435"   "55076"  "63827"  "467"    "857"    "272"    "2719"  
##  [78] "3340"   "8660"   "8819"   "2548"   "6385"   "8987"   "8870"  
##  [85] "5313"   "3484"   "5329"   "112464" "8839"   "9215"   "25819" 
##  [92] "6275"   "58528"  "7538"   "1956"   "1907"   "3423"   "1026"  
##  [99] "6095"   "1843"   "4282"   "5507"   "10570"  "11015"  "1837"  
## [106] "136"    "9957"   "284119" "2908"   "1316"   "2239"   "3491"  
## [113] "7128"   "771"    "3073"   "633"    "23645"  "55276"  "5292"  
## [120] "25824"  "55577"  "1027"   "680"    "8277"   "4493"   "538"   
## [127] "4502"   "9672"   "25976"  "5317"   "302"    "5224"   "1649"  
## [134] "5578"   "2542"   "7852"   "1944"   "1356"   "8609"   "1490"  
## [141] "9469"   "7163"   "56925"  "124872" "10891"  "596"    "2651"  
## [148] "3036"   "54800"  "949"    "6576"   "6383"   "839"    "7428"  
## [155] "2309"   "5155"   "126792" "6518"   "8406"   "1942"   "2745"  
## [162] "57007"  "5066"   "7045"   "1634"   "6478"   "51316"  "2203"  
## [169] "8459"   "5260"   "4627"   "1028"   "9380"   "5105"   "3623"  
## [176] "3309"   "8509"   "23327"  "7162"   "7511"   "3569"   "6533"  
## [183] "4214"   "3948"   "9590"   "26136"  "3798"   "3906"   "1289"  
## [190] "2817"   "3069"   "10994"  "1463"   "7052"   "2113"   "3219"  
## [197] "8991"   "2355"   "6820"   "7043"

Looks like we have a list of gene sets with associated Entrez IDs.

In our gene expression data frame, expression_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.

4.2.2 Gene identifier conversion

We’re going to convert our identifiers in expression_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

We can see what types of IDs are available to us in an annotation package with keytypes().

keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds() and comes from the AnnotationDbi package.

Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.

# First let's create a mapped data frame we can join to the gene expression
# values
mapped_df <- data.frame(
  "entrez_id" = mapIds(
    # Replace with annotation package for the organism relevant to your data
    org.Hs.eg.db,
    keys = expression_df$Gene,
    # Replace with the type of gene identifiers in your data
    keytype = "ENSEMBL",
    # Replace with the type of gene identifiers you would like to map to
    column = "ENTREZID",
    # This will keep only the first mapped value for each Ensembl ID
    multiVals = "first"
  )
) %>%
  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop
  # that from the data frame
  dplyr::filter(!is.na(entrez_id)) %>%
  # Make an `Ensembl` column to store the row names
  tibble::rownames_to_column("Ensembl") %>%
  # Now let's join the rest of the expression data
  dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))

## 'select()' returned 1:many mapping between keys and columns

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

Let’s see a preview of mapped_df.

head(mapped_df)

We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.

Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.

any(duplicated(mapped_df$entrez_id))

## [1] TRUE

Looks like we do have duplicated Entrez IDs. Let’s find out which ones.

dup_entrez_ids <- mapped_df %>%
  dplyr::filter(duplicated(entrez_id)) %>%
  dplyr::pull(entrez_id)

dup_entrez_ids

## [1] "6013" "3117"

4.2.2.1 Handling duplicate gene identifiers

As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for the GSVA steps later. GSVA is executed on a per sample basis so let’s keep the maximum expression value per sample associated with the duplicate Entrez gene identifiers. In other words, we will keep only the maximum expression value found across the duplicate Entrez gene identifier instances for each sample or column.

Let’s take a look at the rows associated with the duplicated Entrez IDs and see how this will play out.

mapped_df %>%
  dplyr::filter(entrez_id %in% dup_entrez_ids)

As an example using the strategy we described, for GSM917111’s data in the first column, 0.2294387 is larger than 0.1104345 so moving forward, Entrez gene 6013 will have 0.2294387 value and the 0.1104345 would be dropped from the dataset. This is just one method of handling duplicate gene identifiers. See the Gene Set Enrichment Analysis (GSEA) User guide for more information on other commonly used strategies, such as taking the median expression value.

Now, let’s implement the maximum value method for all samples and Entrez IDs using tidyverse functions.

max_dup_df <- mapped_df %>%
  # We won't be using Ensembl IDs moving forward, so we will drop this column
  dplyr::select(-Ensembl) %>%
  # Filter to include only the rows associated with the duplicate Entrez gene
  # identifiers
  dplyr::filter(entrez_id %in% dup_entrez_ids) %>%
  # Group by Entrez IDs
  dplyr::group_by(entrez_id) %>%
  # Get the maximum expression value using all values associated with each
  # duplicate Entrez ID for each column or sample in this case
  dplyr::summarize_all(max)

max_dup_df

We can see GSM917111 now has the 0.2294387 value for Entrez ID 6013 like expected. Looks like we were able to successfully get rid of the duplicate Entrez gene identifiers!

Now let’s combine our newly de-duplicated data with the rest of the mapped data!

filtered_mapped_df <- mapped_df %>%
  # We won't be using Ensembl IDs moving forward, so we will drop this column
  dplyr::select(-Ensembl) %>%
  # First let's get the data associated with the Entrez gene identifiers that
  # aren't duplicated
  dplyr::filter(!entrez_id %in% dup_entrez_ids) %>%
  # Now let's bind the rows of the maximum expression data we stored in
  # `max_dup_df`
  dplyr::bind_rows(max_dup_df)

As mentioned earlier, we need a matrix for GSVA. Let’s now convert our data frame into a matrix and prepare our object for GSVA.

filtered_mapped_matrix <- filtered_mapped_df %>%
  # We need to store our gene identifiers as row names
  tibble::column_to_rownames("entrez_id") %>%
  # Now we can convert our object into a matrix
  as.matrix()

Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.

4.3 GSVA - Microarray

GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).

The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013).

The output is a gene set by sample matrix of GSVA scores.

4.3.1 Perform GSVA

Let’s perform GSVA using the gsva() function. See ?gsva for more options.

gsva_results <- gsva(
  filtered_mapped_matrix,
  hallmarks_list,
  method = "gsva",
  # Appropriate for our log2-transformed microarray data
  kcdf = "Gaussian",
  # Minimum gene set size
  min.sz = 15,
  # Maximum gene set size
  max.sz = 500,
  # Compute Gaussian-distributed scores
  mx.diff = TRUE,
  # Don't print out the progress bar
  verbose = FALSE
)

Let’s explore what the output of gsva() looks like.

# First 6 rows, first 10 columns
head(gsva_results[, 1:10])

##                                      GSM917111   GSM917250
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2784726 -0.29221444
## HALLMARK_HYPOXIA                    -0.1907117 -0.13033725
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2307863 -0.22997233
## HALLMARK_MITOTIC_SPINDLE            -0.2134439  0.09773602
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668  0.27041084
## HALLMARK_TGF_BETA_SIGNALING         -0.2285640  0.08510027
##                                       GSM917281  GSM917062
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.30693127 -0.2953894
## HALLMARK_HYPOXIA                    -0.24058274 -0.2658532
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.25341066 -0.2214914
## HALLMARK_MITOTIC_SPINDLE            -0.13886393 -0.2020978
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
## HALLMARK_TGF_BETA_SIGNALING         -0.14161796 -0.2284998
##                                       GSM917288   GSM917230
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.22966329 -0.20914620
## HALLMARK_HYPOXIA                     0.06741065 -0.02691280
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.08702648 -0.03084332
## HALLMARK_MITOTIC_SPINDLE            -0.17902098  0.05763884
## HALLMARK_WNT_BETA_CATENIN_SIGNALING  0.21274606  0.08273239
## HALLMARK_TGF_BETA_SIGNALING          0.01208862 -0.13097578
##                                      GSM917152    GSM917242
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    0.33276903  0.001857506
## HALLMARK_HYPOXIA                    0.18446386 -0.118269791
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    0.05273271  0.104042284
## HALLMARK_MITOTIC_SPINDLE            0.14226250 -0.052122165
## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
## HALLMARK_TGF_BETA_SIGNALING         0.15915374  0.300603909
##                                      GSM917226    GSM917290
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.1329156 -0.385841741
## HALLMARK_HYPOXIA                    -0.2641157 -0.145480093
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2136088 -0.267519873
## HALLMARK_MITOTIC_SPINDLE            -0.3753805 -0.001471942
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
## HALLMARK_TGF_BETA_SIGNALING         -0.1973818 -0.130123427

4.4 Find differentially expressed pathways

If we want to identify most differentially expressed pathways across subgroups, we can use functionality in the limma package to test the GSVA scores.

This is one approach for working with GSVA scores; the mx.diff = TRUE argument that we supplied to the gsva() function in the previous section means the GSVA output scores should be normally distributed, which can be helpful if you want to perform downstream analyses with approaches that make assumptions of normality (Hänzelmann et al. 2020).

4.4.1 Create the design matrix

limma needs a numeric design matrix to indicate which subtype of medulloblastoma a sample originates from. Now we will create a model matrix based on our subgroup variable. We are using a + 0 in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0 so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.

# Create the design matrix
des_mat <- model.matrix(~ subgroup + 0, data = metadata)

Let’s take a look at the design matrix we created.

# Print out part of the design matrix
head(des_mat)

##   subgroupGroup 3 subgroupGroup 4 subgroupSHH
## 1               0               1           0
## 2               0               1           0
## 3               0               1           0
## 4               1               0           0
## 5               0               1           0
## 6               0               1           0

The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup designation they all have and any spaces in names.

# Make the column names less messy
colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup")

# Do a similar thing but remove spaces in names
colnames(des_mat) <- stringr::str_remove(colnames(des_mat), " ")

4.5 Perform differential expression on pathway scores

Run the linear model on each pathway (each row of gsva_results) with lmFit() and apply empirical Bayes smoothing with eBayes().

# Apply linear model to data
fit <- limma::lmFit(gsva_results, design = des_mat)

# Apply empirical Bayes to smooth standard errors
fit <- limma::eBayes(fit)

Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.

In this contrasts matrix, we are comparing each subgroup to the average of the other subgroups.
We’re dividing by two in this expression so that each group is compared to the average of the other two groups (makeContrasts() doesn’t allow you to use functions like mean(); it wants a formula).

contrast_matrix <- makeContrasts(
  "G3vsOther" = Group3 - (Group4 + SHH) / 2,
  "G4vsOther" = Group4 - (Group3 + SHH) / 2,
  "SHHvsOther" = SHH - (Group3 + Group4) / 2,
  levels = des_mat
)

Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like Group3 = Group3 - Control for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts() and model.matrix() setups should look like (Ritchie et al. 2015).

Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with eBayes().

# Fit the model according to the contrasts matrix
contrasts_fit <- contrasts.fit(fit, contrast_matrix)

# Re-smooth the Bayes
contrasts_fit <- eBayes(contrasts_fit)

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

Now let’s create the results table based on the contrasts fitted model.

This step will provide the Benjamini-Hochberg multiple testing correction. The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

# Apply multiple testing correction and obtain stats
stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>%
  tibble::rownames_to_column("Gene")

Let’s take a peek at our results table.

head(stats_df)

For each pathway, each group’s fold change in expression, compared to the average of the other groups is reported.

By default, results are ordered from largest F value to the smallest, which means your most differentially expressed pathways across all groups should be toward the top.

This means HALLMARK_UNFOLDED_PROTEIN_RESPONSE appears to be the pathway that contains the most significant distribution of scores across subgroups.

4.6 Visualizing Results

Let’s make a plot for our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE.

4.6.1 Sina plot

First we need to get our GSVA scores for this pathway into a long data frame, an appropriate format for ggplot2.

Let’s look at the current format of gsva_results.

head(gsva_results[, 1:10])

##                                      GSM917111   GSM917250
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2784726 -0.29221444
## HALLMARK_HYPOXIA                    -0.1907117 -0.13033725
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2307863 -0.22997233
## HALLMARK_MITOTIC_SPINDLE            -0.2134439  0.09773602
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668  0.27041084
## HALLMARK_TGF_BETA_SIGNALING         -0.2285640  0.08510027
##                                       GSM917281  GSM917062
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.30693127 -0.2953894
## HALLMARK_HYPOXIA                    -0.24058274 -0.2658532
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.25341066 -0.2214914
## HALLMARK_MITOTIC_SPINDLE            -0.13886393 -0.2020978
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
## HALLMARK_TGF_BETA_SIGNALING         -0.14161796 -0.2284998
##                                       GSM917288   GSM917230
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.22966329 -0.20914620
## HALLMARK_HYPOXIA                     0.06741065 -0.02691280
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.08702648 -0.03084332
## HALLMARK_MITOTIC_SPINDLE            -0.17902098  0.05763884
## HALLMARK_WNT_BETA_CATENIN_SIGNALING  0.21274606  0.08273239
## HALLMARK_TGF_BETA_SIGNALING          0.01208862 -0.13097578
##                                      GSM917152    GSM917242
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    0.33276903  0.001857506
## HALLMARK_HYPOXIA                    0.18446386 -0.118269791
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    0.05273271  0.104042284
## HALLMARK_MITOTIC_SPINDLE            0.14226250 -0.052122165
## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
## HALLMARK_TGF_BETA_SIGNALING         0.15915374  0.300603909
##                                      GSM917226    GSM917290
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.1329156 -0.385841741
## HALLMARK_HYPOXIA                    -0.2641157 -0.145480093
## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2136088 -0.267519873
## HALLMARK_MITOTIC_SPINDLE            -0.3753805 -0.001471942
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
## HALLMARK_TGF_BETA_SIGNALING         -0.1973818 -0.130123427

We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.

Now let’s convert these results into a data frame and into a long format, using the tidyr::pivot_longer() function.

annotated_results_df <- gsva_results %>%
  # Make this into a data frame
  data.frame() %>%
  # Gene set names are row names
  tibble::rownames_to_column("pathway") %>%
  # Get into long format using the `tidyr::pivot_longer()` function
  tidyr::pivot_longer(
    cols = -pathway,
    names_to = "sample",
    values_to = "gsva_score"
  )

# Preview the annotated results object
head(annotated_results_df)

Now let’s filter to include only the scores associated with our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE, and join the relevant group labels from the metadata for plotting.

top_pathway_annotated_results_df <- annotated_results_df %>%
  # Filter for only scores associated with our most significant pathway
  dplyr::filter(pathway == "HALLMARK_UNFOLDED_PROTEIN_RESPONSE") %>%
  # Join the column with the group labels that we would like to plot
  dplyr::left_join(metadata %>% dplyr::select(
    # Select the variables relevant to your data
    refinebio_accession_code,
    subgroup
  ),
  # Tell the join what columns are equivalent and should be used as a key
  by = c("sample" = "refinebio_accession_code")
  )

# Preview the filtered annotated results object
head(top_pathway_annotated_results_df)

Now let’s make a sina plot so we can look at the differences between the subgroup groups using our GSVA scores.

# Sina plot comparing GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`
# the `subgroup` groups in our dataset
sina_plot <-
  ggplot(
    top_pathway_annotated_results_df, # Supply our annotated data frame
    aes(
      x = subgroup, # Replace with a grouping variable relevant to your data
      y = gsva_score, # Column we previously created to store the GSVA scores
      color = subgroup # Let's make the groups different colors too
    )
  ) +
  # Add a boxplot that will have summary stats
  geom_boxplot(outlier.shape = NA) +
  # Make a sina plot that shows individual values
  ggforce::geom_sina() +
  # Rename the y-axis label
  labs(y = "HALLMARK_UNFOLDED_PROTEIN_RESPONSE_score") +
  # Adjust the plot background for better visualization
  theme_bw()

# Display plot
sina_plot

Looks like the Group 4 samples have lower GSVA scores for HALLMARK_UNFOLDED_PROTEIN_RESPONSE as compared to the SHH and Group 3 scores.

Let’s save this plot to PNG.

ggsave(
  file.path(
    plots_dir,
    "GSE37382_gsva_HALLMARK_UNFOLDED_PROTEIN_RESPONSE_sina_plot.png"
  ),
  plot = sina_plot
)

## Saving 7 x 5 in image

4.7 Write results to file

Now let’s write all of our GSVA results to file.

gsva_results %>%
  as.data.frame() %>%
  tibble::rownames_to_column("pathway") %>%
  readr::write_tsv(file.path(
    results_dir,
    "GSE37382_gsva_results.tsv"
  ))

5 Resources for further learning

GSVA Paper (Hänzelmann et al. 2013)
See this article on Decoding Gene Set Variation Analysis (Malhotra 2018)

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()

## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package              * version  date       lib source        
##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
##  AnnotationDbi        * 1.52.0   2020-10-27 [1] Bioconductor  
##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.3)
##  backports              1.2.1    2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
##  BiocGenerics         * 0.36.1   2021-04-16 [1] Bioconductor  
##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.3)
##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.3)
##  bitops                 1.0-7    2021-04-24 [1] RSPM (R 4.0.4)
##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.3)
##  bslib                  0.2.5    2021-05-12 [1] RSPM (R 4.0.4)
##  cachem                 1.0.5    2021-05-15 [1] RSPM (R 4.0.4)
##  cli                    2.5.0    2021-04-26 [1] RSPM (R 4.0.4)
##  coda                   0.19-4   2020-09-30 [1] RSPM (R 4.0.3)
##  colorspace             2.0-1    2021-05-04 [1] RSPM (R 4.0.4)
##  crayon                 1.4.1    2021-02-08 [1] RSPM (R 4.0.3)
##  DBI                    1.1.1    2021-01-15 [1] RSPM (R 4.0.3)
##  DelayedArray           0.16.3   2021-03-24 [1] Bioconductor  
##  digest                 0.6.27   2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr                  1.0.6    2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis               0.3.2    2021-04-29 [1] RSPM (R 4.0.4)
##  emmeans                1.6.0    2021-04-24 [1] RSPM (R 4.0.4)
##  estimability           1.3      2018-02-11 [1] RSPM (R 4.0.0)
##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.3)
##  fansi                  0.4.2    2021-01-15 [1] RSPM (R 4.0.3)
##  farver                 2.1.0    2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap                1.1.0    2021-01-25 [1] RSPM (R 4.0.3)
##  fftw                   1.0-6    2020-02-24 [1] RSPM (R 4.0.2)
##  generics               0.1.0    2020-10-31 [1] RSPM (R 4.0.3)
##  GenomeInfoDb           1.26.7   2021-04-08 [1] Bioconductor  
##  GenomeInfoDbData       1.2.4    2022-03-01 [1] Bioconductor  
##  GenomicRanges          1.42.0   2020-10-27 [1] Bioconductor  
##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
##  ggforce                0.3.3    2021-03-05 [1] RSPM (R 4.0.3)
##  ggplot2              * 3.3.3    2020-12-30 [1] RSPM (R 4.0.3)
##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.3)
##  graph                  1.68.0   2020-10-27 [1] Bioconductor  
##  GSEABase               1.52.1   2020-12-11 [1] Bioconductor  
##  GSVA                 * 1.38.2   2021-02-09 [1] Bioconductor  
##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.3)
##  highr                  0.9      2021-04-16 [1] RSPM (R 4.0.4)
##  hms                    1.0.0    2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools              0.5.1.1  2021-01-22 [1] RSPM (R 4.0.3)
##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.3)
##  IRanges              * 2.24.1   2020-12-12 [1] Bioconductor  
##  jquerylib              0.1.4    2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite               1.7.2    2020-12-09 [1] RSPM (R 4.0.3)
##  knitr                  1.33     2021-04-24 [1] RSPM (R 4.0.4)
##  labeling               0.4.2    2020-10-20 [1] RSPM (R 4.0.3)
##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.5)
##  lifecycle              1.0.0    2021-02-15 [1] RSPM (R 4.0.3)
##  limma                * 3.46.0   2020-10-27 [1] Bioconductor  
##  magrittr             * 2.0.1    2020-11-17 [1] RSPM (R 4.0.3)
##  MASS                   7.3-53.1 2021-02-12 [2] CRAN (R 4.0.5)
##  Matrix                 1.3-2    2021-01-06 [2] CRAN (R 4.0.5)
##  MatrixGenerics         1.2.1    2021-01-30 [1] Bioconductor  
##  matrixStats            0.58.0   2021-01-29 [1] RSPM (R 4.0.3)
##  memoise                2.0.0    2021-01-26 [1] RSPM (R 4.0.3)
##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.3)
##  mvtnorm                1.1-1    2020-06-09 [1] RSPM (R 4.0.3)
##  nlme                   3.1-152  2021-02-04 [2] CRAN (R 4.0.5)
##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
##  org.Hs.eg.db         * 3.12.0   2022-03-01 [1] Bioconductor  
##  pillar                 1.6.1    2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.3)
##  polyclip               1.10-0   2019-03-14 [1] RSPM (R 4.0.3)
##  ps                     1.6.0    2021-02-28 [1] RSPM (R 4.0.3)
##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.3)
##  qusage               * 2.24.0   2020-10-27 [1] Bioconductor  
##  R.cache                0.15.0   2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R6                     2.5.0    2020-10-28 [1] RSPM (R 4.0.3)
##  Rcpp                   1.0.6    2021-01-15 [1] RSPM (R 4.0.3)
##  RCurl                  1.98-1.3 2021-03-16 [1] RSPM (R 4.0.4)
##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.3)
##  rlang                  0.4.11   2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown              2.8      2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite                2.2.7    2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi             0.13     2020-11-12 [1] RSPM (R 4.0.3)
##  S4Vectors            * 0.28.1   2020-12-09 [1] Bioconductor  
##  sass                   0.4.0    2021-05-12 [1] RSPM (R 4.0.4)
##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.3)
##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.3)
##  stringi                1.6.1    2021-05-10 [1] RSPM (R 4.0.4)
##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.3)
##  styler                 1.4.1    2021-03-30 [1] RSPM (R 4.0.4)
##  SummarizedExperiment   1.20.0   2020-10-27 [1] Bioconductor  
##  tibble                 3.1.2    2021-05-16 [1] RSPM (R 4.0.4)
##  tidyr                  1.1.3    2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect             1.1.1    2021-04-30 [1] RSPM (R 4.0.4)
##  tweenr                 1.0.2    2021-03-23 [1] RSPM (R 4.0.4)
##  utf8                   1.2.1    2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs                  0.3.8    2021-04-29 [1] RSPM (R 4.0.4)
##  withr                  2.4.2    2021-04-18 [1] RSPM (R 4.0.4)
##  xfun                   0.23     2021-05-15 [1] RSPM (R 4.0.4)
##  XML                    3.99-0.6 2021-03-16 [1] RSPM (R 4.0.4)
##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.3)
##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.3)
##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

Hänzelmann S., R. Castelo, and J. Guinney, 2013 GSVA: Gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14: 7. https://doi.org/10.1186/1471-2105-14-7

Hänzelmann S., R. Castelo, and J. Guinney, 2020 GSVA: The gene set variation analysis package for microarray and RNA-seq data. https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, P. Tamayo, et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, J. P. Mesirov, et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

Malhotra S., 2018 Decoding gene set variation analysis. https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3

Northcott P., D. Shih, J. Peacock, L. Garzia, S. Morrissy, et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

Gene set variation analysis - Microarray

CCDL for ALSF

November 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the `.Rmd` file

2.2 Set up your analysis folders

2.3 Obtain the dataset from refine.bio

2.4 About the dataset we are using for this example

2.5 Place the dataset in your new `data/` folder

2.6 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Gene set variation analysis - Microarray

4.1 Install libraries

4.2 Import and set up data

4.2.1 Import Gene Sets

4.2.2 Gene identifier conversion

4.2.2.1 Handling duplicate gene identifiers

4.3 GSVA - Microarray

4.3.1 Perform GSVA

4.4 Find differentially expressed pathways

4.4.1 Create the design matrix

4.5 Perform differential expression on pathway scores

4.6 Visualizing Results

4.6.1 Sina plot

4.7 Write results to file

5 Resources for further learning

6 Session info

References

Gene set variation analysis - Microarray

CCDL for ALSF

November 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the .Rmd file

2.2 Set up your analysis folders

2.3 Obtain the dataset from refine.bio

2.4 About the dataset we are using for this example

2.5 Place the dataset in your new data/ folder

2.6 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Gene set variation analysis - Microarray

4.1 Install libraries

4.2 Import and set up data

4.2.1 Import Gene Sets

4.2.2 Gene identifier conversion

4.2.2.1 Handling duplicate gene identifiers

4.3 GSVA - Microarray

4.3.1 Perform GSVA

4.4 Find differentially expressed pathways

4.4.1 Create the design matrix

4.5 Perform differential expression on pathway scores

4.6 Visualizing Results

4.6.1 Sina plot

4.7 Write results to file

5 Resources for further learning

6 Session info

References

2.1 Obtain the `.Rmd` file

2.5 Place the dataset in your new `data/` folder