This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013). You can use the GSVA scores for other downstream analyses. In this analysis, we will test GSVA scores for differential expression.
⬇️ Jump to the analysis code ⬇️
Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
Analysis | What is required for input | What output looks like | ✅ Pros | ⚠️ Cons |
---|---|---|---|---|
ORA (Over-representation Analysis) | A list of gene IDs (no stats needed) | A per-pathway hypergeometric test result | - Simple - Inexpensive computationally to calculate p-values |
- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
GSEA (Gene Set Enrichment Analysis) | A list of genes IDs with gene-level summary statistics | A per-pathway enrichment score | - Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
GSVA (Gene Set Variation Analysis) | A gene expression matrix (like what you get from refine.bio directly) | Pathway-level scores on a per-sample basis | - Does not require two groups to compare upfront - Normally distributed scores |
- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
# Define the file path to the gene_sets directory
<- "gene_sets"
gene_sets_dir
# Create the gene_sets folder if it doesn't exist
if (!dir.exists(gene_sets_dir)) {
dir.create(gene_sets_dir)
}
In the same place you put this .Rmd
file, you should now have four new empty folders called data
, plots
, results
, and gene_sets
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this medulloblastoma dataset (Northcott et al. 2012). The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. Medulloblastoma is the most common childhood brain cancer and is often categorized by subgroups. We will use these subgroup
labels from our metadata to perform differential expression with our GSVA scores.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The GSE37382
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE37382
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedGSE37382
folder which contains:
plots
(currently empty)results
(currently empty)gene_sets
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "GSE37382")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "GSE37382.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_GSE37382.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using the GSVA
package to perform GSVA and the qusage
package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013; Yaari et al. 2013).
We will also need the org.Hs.eg.db
package to perform gene identifier conversion (Carlson 2020).
We’ll also be performing differential expression test on our GSVA scores, and for that we will use limma
(Ritchie et al. 2015) and we’ll make a sina plot of the scores of our most significant pathway using a ggplot2 companion package, ggforce
.
if (!("GSVA" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("GSVA", update = FALSE)
BiocManager
}
if (!("qusage" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("qusage", update = FALSE)
BiocManager
}
if (!("org.Hs.eg.db" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("org.Hs.eg.db", update = FALSE)
BiocManager
}
if (!("limma" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("limma", update = FALSE)
BiocManager
}
if (!("ggforce" %in% installed.packages())) {
# Install this package if it isn't installed yet
install.packages("ggforce")
}
Attach the packages we need for this analysis.
# Attach the `qusage` library
library(qusage)
# Attach the `GSVA` library
library(GSVA)
# Human annotation package we'll use for gene identifier conversion
library(org.Hs.eg.db)
# Attach the ggplot2 package for plotting
library(ggplot2)
# We will need this so we can use the pipe: %>%
library(magrittr)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_double(),
## refinebio_cell_line = col_logical(),
## refinebio_compound = col_logical(),
## refinebio_disease = col_logical(),
## refinebio_disease_stage = col_logical(),
## refinebio_genetic_information = col_logical(),
## refinebio_processed = col_logical(),
## refinebio_race = col_logical(),
## refinebio_source_archive_url = col_logical(),
## refinebio_specimen_part = col_logical(),
## refinebio_subject = col_logical(),
## refinebio_time = col_logical(),
## refinebio_treatment = col_logical(),
## channel_count = col_double(),
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) expression_df
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
# Make the data in the order of the metadata
<- expression_df %>%
expression_df ::select(c(Gene, metadata$geo_accession))
dplyr
# Check if this is in the same order
all.equal(colnames(expression_df)[-1], metadata$geo_accession)
## [1] TRUE
The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).
In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:
Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.
The function that we will use to run GSVA requires the gene sets to be in a list. We are going to download a GMT file that contains the the Hallmark gene sets (Liberzon et al. 2015) from MSigDB (Subramanian et al. 2005; Liberzon et al. 2011) to the gene_sets
directory.
Note that when downloading GMT files from MSigDB, only Homo sapiens gene sets are supported. If you’d like to use MSigDB gene sets with GSVA for a commonly studied model organism, check out our RNA-seq GSVA example.
<- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt" hallmarks_gmt_url
We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.
<- file.path(
hallmarks_gmt_file
gene_sets_dir,"h.all.v7.2.entrez.gmt"
)
Using the URL (hallmarks_gmt_url
) and file path (hallmark_gmt_file
) we can download the file and use the destfile
argument to specify where it should be saved.
download.file(
hallmarks_gmt_url,# The file will be saved to this location and with this name
destfile = hallmarks_gmt_file
)
Now let’s double check that the file that contains the gene sets is in the right place.
# Check if the file exists
file.exists(hallmarks_gmt_file)
## [1] TRUE
Now we’re ready to read the file into R with qusage::read.gmt()
.
# QuSAGE is another pathway analysis method, the `qusage` package has a function
# for reading GMT files and turning them into a list that we can use with GSVA
<- qusage::read.gmt(hallmarks_gmt_file) hallmarks_list
What does this hallmarks_list
look like?
head(hallmarks_list, n = 2)
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [1] "3726" "2920" "467" "4792" "7128" "5743" "2919"
## [8] "8870" "9308" "6364" "2921" "23764" "4791" "7127"
## [15] "1839" "1316" "330" "5329" "7538" "3383" "3725"
## [22] "1960" "3553" "597" "23645" "80149" "6648" "4929"
## [29] "3552" "5971" "7185" "7832" "1843" "1326" "2114"
## [36] "2152" "6385" "1958" "3569" "7124" "23135" "4790"
## [43] "3976" "5806" "8061" "3164" "182" "6351" "2643"
## [50] "6347" "1827" "1844" "10938" "9592" "5966" "8837"
## [57] "8767" "4794" "8013" "22822" "51278" "8744" "2669"
## [64] "1647" "3627" "10769" "8553" "1959" "9021" "11182"
## [71] "5734" "1847" "5055" "4783" "5054" "10221" "25976"
## [78] "5970" "329" "6372" "9516" "7130" "960" "3624"
## [85] "5328" "4609" "3604" "6446" "10318" "10135" "2355"
## [92] "10957" "3398" "969" "3575" "1942" "7262" "5209"
## [99] "6352" "79693" "3460" "8878" "10950" "4616" "8942"
## [106] "50486" "694" "4170" "7422" "5606" "1026" "3491"
## [113] "10010" "3433" "3606" "7280" "3659" "2353" "4973"
## [120] "388" "374" "4814" "65986" "8613" "9314" "6373"
## [127] "6303" "1435" "1880" "56937" "5791" "7097" "57007"
## [134] "7071" "4082" "3914" "1051" "9322" "2150" "687"
## [141] "3949" "7050" "127544" "55332" "2683" "11080" "1437"
## [148] "5142" "8303" "5341" "6776" "23258" "595" "23586"
## [155] "8877" "941" "25816" "57018" "2526" "9034" "80176"
## [162] "8848" "9334" "150094" "23529" "4780" "2354" "5187"
## [169] "10725" "490" "3593" "3572" "9120" "19" "3280"
## [176] "604" "8660" "6515" "1052" "51561" "4088" "6890"
## [183] "9242" "64135" "3601" "79155" "602" "24145" "24147"
## [190] "1906" "10209" "650" "1846" "10611" "23308" "9945"
## [197] "10365" "3371" "5271" "4084"
##
## $HALLMARK_HYPOXIA
## [1] "5230" "5163" "2632" "5211" "226" "2026" "5236"
## [8] "10397" "3099" "230" "2821" "4601" "6513" "5033"
## [15] "133" "8974" "2023" "5214" "205" "26355" "5209"
## [22] "7422" "665" "7167" "30001" "55818" "901" "3939"
## [29] "2997" "2597" "8553" "51129" "3725" "5054" "4015"
## [36] "2645" "8497" "23764" "54541" "6515" "3486" "4783"
## [43] "2353" "3516" "3098" "10370" "3669" "2584" "26118"
## [50] "5837" "6781" "23036" "694" "123" "1466" "7436"
## [57] "23210" "2131" "2152" "5165" "55139" "7360" "229"
## [64] "8614" "54206" "2027" "10957" "3162" "5228" "26330"
## [71] "9435" "55076" "63827" "467" "857" "272" "2719"
## [78] "3340" "8660" "8819" "2548" "6385" "8987" "8870"
## [85] "5313" "3484" "5329" "112464" "8839" "9215" "25819"
## [92] "6275" "58528" "7538" "1956" "1907" "3423" "1026"
## [99] "6095" "1843" "4282" "5507" "10570" "11015" "1837"
## [106] "136" "9957" "284119" "2908" "1316" "2239" "3491"
## [113] "7128" "771" "3073" "633" "23645" "55276" "5292"
## [120] "25824" "55577" "1027" "680" "8277" "4493" "538"
## [127] "4502" "9672" "25976" "5317" "302" "5224" "1649"
## [134] "5578" "2542" "7852" "1944" "1356" "8609" "1490"
## [141] "9469" "7163" "56925" "124872" "10891" "596" "2651"
## [148] "3036" "54800" "949" "6576" "6383" "839" "7428"
## [155] "2309" "5155" "126792" "6518" "8406" "1942" "2745"
## [162] "57007" "5066" "7045" "1634" "6478" "51316" "2203"
## [169] "8459" "5260" "4627" "1028" "9380" "5105" "3623"
## [176] "3309" "8509" "23327" "7162" "7511" "3569" "6533"
## [183] "4214" "3948" "9590" "26136" "3798" "3906" "1289"
## [190] "2817" "3069" "10994" "1463" "7052" "2113" "3219"
## [197] "8991" "2355" "6820" "7043"
Looks like we have a list of gene sets with associated Entrez IDs.
In our gene expression data frame, expression_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.
We’re going to convert our identifiers in expression_df
to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Hs.eg.db
contains information for different identifiers (Carlson 2020). org.Hs.eg.db
is specific to Homo sapiens – this is what the Hs
in the package name is referencing.
We can see what types of IDs are available to us in an annotation package with keytypes()
.
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL
) to Entrez IDs (ENTREZID
), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS
) to gene symbols (SYMBOL
).
Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.
The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds()
and comes from the AnnotationDbi
package.
Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.
# First let's create a mapped data frame we can join to the gene expression
# values
<- data.frame(
mapped_df "entrez_id" = mapIds(
# Replace with annotation package for the organism relevant to your data
org.Hs.eg.db,keys = expression_df$Gene,
# Replace with the type of gene identifiers in your data
keytype = "ENSEMBL",
# Replace with the type of gene identifiers you would like to map to
column = "ENTREZID",
# This will keep only the first mapped value for each Ensembl ID
multiVals = "first"
)%>%
) # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop
# that from the data frame
::filter(!is.na(entrez_id)) %>%
dplyr# Make an `Ensembl` column to store the row names
::rownames_to_column("Ensembl") %>%
tibble# Now let's join the rest of the expression data
::inner_join(expression_df, by = c("Ensembl" = "Gene")) dplyr
## 'select()' returned 1:many mapping between keys and columns
This 1:many mapping between keys and columns
message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.
Let’s see a preview of mapped_df
.
head(mapped_df)
We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.
Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.
any(duplicated(mapped_df$entrez_id))
## [1] TRUE
Looks like we do have duplicated Entrez IDs. Let’s find out which ones.
<- mapped_df %>%
dup_entrez_ids ::filter(duplicated(entrez_id)) %>%
dplyr::pull(entrez_id)
dplyr
dup_entrez_ids
## [1] "6013" "3117"
As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for the GSVA steps later. GSVA is executed on a per sample basis so let’s keep the maximum expression value per sample associated with the duplicate Entrez gene identifiers. In other words, we will keep only the maximum expression value found across the duplicate Entrez gene identifier instances for each sample or column.
Let’s take a look at the rows associated with the duplicated Entrez IDs and see how this will play out.
%>%
mapped_df ::filter(entrez_id %in% dup_entrez_ids) dplyr
As an example using the strategy we described, for GSM917111
’s data in the first column, 0.2294387
is larger than 0.1104345
so moving forward, Entrez gene 6013
will have 0.2294387
value and the 0.1104345
would be dropped from the dataset. This is just one method of handling duplicate gene identifiers. See the Gene Set Enrichment Analysis (GSEA) User guide for more information on other commonly used strategies, such as taking the median expression value.
Now, let’s implement the maximum value method for all samples and Entrez IDs using tidyverse functions.
<- mapped_df %>%
max_dup_df # We won't be using Ensembl IDs moving forward, so we will drop this column
::select(-Ensembl) %>%
dplyr# Filter to include only the rows associated with the duplicate Entrez gene
# identifiers
::filter(entrez_id %in% dup_entrez_ids) %>%
dplyr# Group by Entrez IDs
::group_by(entrez_id) %>%
dplyr# Get the maximum expression value using all values associated with each
# duplicate Entrez ID for each column or sample in this case
::summarize_all(max)
dplyr
max_dup_df
We can see GSM917111
now has the 0.2294387
value for Entrez ID 6013
like expected. Looks like we were able to successfully get rid of the duplicate Entrez gene identifiers!
Now let’s combine our newly de-duplicated data with the rest of the mapped data!
<- mapped_df %>%
filtered_mapped_df # We won't be using Ensembl IDs moving forward, so we will drop this column
::select(-Ensembl) %>%
dplyr# First let's get the data associated with the Entrez gene identifiers that
# aren't duplicated
::filter(!entrez_id %in% dup_entrez_ids) %>%
dplyr# Now let's bind the rows of the maximum expression data we stored in
# `max_dup_df`
::bind_rows(max_dup_df) dplyr
As mentioned earlier, we need a matrix for GSVA. Let’s now convert our data frame into a matrix and prepare our object for GSVA.
<- filtered_mapped_df %>%
filtered_mapped_matrix # We need to store our gene identifiers as row names
::column_to_rownames("entrez_id") %>%
tibble# Now we can convert our object into a matrix
as.matrix()
Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.
GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).
The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013).
The output is a gene set by sample matrix of GSVA scores.
Let’s perform GSVA using the gsva()
function. See ?gsva
for more options.
<- gsva(
gsva_results
filtered_mapped_matrix,
hallmarks_list,method = "gsva",
# Appropriate for our log2-transformed microarray data
kcdf = "Gaussian",
# Minimum gene set size
min.sz = 15,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Don't print out the progress bar
verbose = FALSE
)
Let’s explore what the output of gsva()
looks like.
# First 6 rows, first 10 columns
head(gsva_results[, 1:10])
## GSM917111 GSM917250
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.2784726 -0.29221444
## HALLMARK_HYPOXIA -0.1907117 -0.13033725
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2307863 -0.22997233
## HALLMARK_MITOTIC_SPINDLE -0.2134439 0.09773602
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668 0.27041084
## HALLMARK_TGF_BETA_SIGNALING -0.2285640 0.08510027
## GSM917281 GSM917062
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.30693127 -0.2953894
## HALLMARK_HYPOXIA -0.24058274 -0.2658532
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.25341066 -0.2214914
## HALLMARK_MITOTIC_SPINDLE -0.13886393 -0.2020978
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
## HALLMARK_TGF_BETA_SIGNALING -0.14161796 -0.2284998
## GSM917288 GSM917230
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.22966329 -0.20914620
## HALLMARK_HYPOXIA 0.06741065 -0.02691280
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.08702648 -0.03084332
## HALLMARK_MITOTIC_SPINDLE -0.17902098 0.05763884
## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.21274606 0.08273239
## HALLMARK_TGF_BETA_SIGNALING 0.01208862 -0.13097578
## GSM917152 GSM917242
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.33276903 0.001857506
## HALLMARK_HYPOXIA 0.18446386 -0.118269791
## HALLMARK_CHOLESTEROL_HOMEOSTASIS 0.05273271 0.104042284
## HALLMARK_MITOTIC_SPINDLE 0.14226250 -0.052122165
## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
## HALLMARK_TGF_BETA_SIGNALING 0.15915374 0.300603909
## GSM917226 GSM917290
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.1329156 -0.385841741
## HALLMARK_HYPOXIA -0.2641157 -0.145480093
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2136088 -0.267519873
## HALLMARK_MITOTIC_SPINDLE -0.3753805 -0.001471942
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
## HALLMARK_TGF_BETA_SIGNALING -0.1973818 -0.130123427
If we want to identify most differentially expressed pathways across subgroups, we can use functionality in the limma
package to test the GSVA scores.
This is one approach for working with GSVA scores; the mx.diff = TRUE
argument that we supplied to the gsva()
function in the previous section means the GSVA output scores should be normally distributed, which can be helpful if you want to perform downstream analyses with approaches that make assumptions of normality (Hänzelmann et al. 2020).
limma
needs a numeric design matrix to indicate which subtype of medulloblastoma a sample originates from. Now we will create a model matrix based on our subgroup
variable. We are using a + 0
in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0
so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.
# Create the design matrix
<- model.matrix(~ subgroup + 0, data = metadata) des_mat
Let’s take a look at the design matrix we created.
# Print out part of the design matrix
head(des_mat)
## subgroupGroup 3 subgroupGroup 4 subgroupSHH
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## 4 1 0 0
## 5 0 1 0
## 6 0 1 0
The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup
designation they all have and any spaces in names.
# Make the column names less messy
colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup")
# Do a similar thing but remove spaces in names
colnames(des_mat) <- stringr::str_remove(colnames(des_mat), " ")
Run the linear model on each pathway (each row of gsva_results
) with lmFit()
and apply empirical Bayes smoothing with eBayes()
.
# Apply linear model to data
<- limma::lmFit(gsva_results, design = des_mat)
fit
# Apply empirical Bayes to smooth standard errors
<- limma::eBayes(fit) fit
Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.
In this contrasts matrix, we are comparing each subgroup to the average of the other subgroups.
We’re dividing by two in this expression so that each group is compared to the average of the other two groups (makeContrasts()
doesn’t allow you to use functions like mean()
; it wants a formula).
<- makeContrasts(
contrast_matrix "G3vsOther" = Group3 - (Group4 + SHH) / 2,
"G4vsOther" = Group4 - (Group3 + SHH) / 2,
"SHHvsOther" = SHH - (Group3 + Group4) / 2,
levels = des_mat
)
Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like Group3 = Group3 - Control
for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts()
and model.matrix()
setups should look like (Ritchie et al. 2015).
Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with eBayes()
.
# Fit the model according to the contrasts matrix
<- contrasts.fit(fit, contrast_matrix)
contrasts_fit
# Re-smooth the Bayes
<- eBayes(contrasts_fit) contrasts_fit
Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).
Now let’s create the results table based on the contrasts fitted model.
This step will provide the Benjamini-Hochberg multiple testing correction. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply multiple testing correction and obtain stats
<- topTable(contrasts_fit, number = nrow(expression_df)) %>%
stats_df ::rownames_to_column("Gene") tibble
Let’s take a peek at our results table.
head(stats_df)
For each pathway, each group’s fold change in expression, compared to the average of the other groups is reported.
By default, results are ordered from largest F
value to the smallest, which means your most differentially expressed pathways across all groups should be toward the top.
This means HALLMARK_UNFOLDED_PROTEIN_RESPONSE
appears to be the pathway that contains the most significant distribution of scores across subgroups.
Let’s make a plot for our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE
.
First we need to get our GSVA scores for this pathway into a long data frame, an appropriate format for ggplot2
.
Let’s look at the current format of gsva_results
.
head(gsva_results[, 1:10])
## GSM917111 GSM917250
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.2784726 -0.29221444
## HALLMARK_HYPOXIA -0.1907117 -0.13033725
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2307863 -0.22997233
## HALLMARK_MITOTIC_SPINDLE -0.2134439 0.09773602
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668 0.27041084
## HALLMARK_TGF_BETA_SIGNALING -0.2285640 0.08510027
## GSM917281 GSM917062
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.30693127 -0.2953894
## HALLMARK_HYPOXIA -0.24058274 -0.2658532
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.25341066 -0.2214914
## HALLMARK_MITOTIC_SPINDLE -0.13886393 -0.2020978
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
## HALLMARK_TGF_BETA_SIGNALING -0.14161796 -0.2284998
## GSM917288 GSM917230
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.22966329 -0.20914620
## HALLMARK_HYPOXIA 0.06741065 -0.02691280
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.08702648 -0.03084332
## HALLMARK_MITOTIC_SPINDLE -0.17902098 0.05763884
## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.21274606 0.08273239
## HALLMARK_TGF_BETA_SIGNALING 0.01208862 -0.13097578
## GSM917152 GSM917242
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.33276903 0.001857506
## HALLMARK_HYPOXIA 0.18446386 -0.118269791
## HALLMARK_CHOLESTEROL_HOMEOSTASIS 0.05273271 0.104042284
## HALLMARK_MITOTIC_SPINDLE 0.14226250 -0.052122165
## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
## HALLMARK_TGF_BETA_SIGNALING 0.15915374 0.300603909
## GSM917226 GSM917290
## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.1329156 -0.385841741
## HALLMARK_HYPOXIA -0.2641157 -0.145480093
## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2136088 -0.267519873
## HALLMARK_MITOTIC_SPINDLE -0.3753805 -0.001471942
## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
## HALLMARK_TGF_BETA_SIGNALING -0.1973818 -0.130123427
We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.
Now let’s convert these results into a data frame and into a long format, using the tidyr::pivot_longer()
function.
<- gsva_results %>%
annotated_results_df # Make this into a data frame
data.frame() %>%
# Gene set names are row names
::rownames_to_column("pathway") %>%
tibble# Get into long format using the `tidyr::pivot_longer()` function
::pivot_longer(
tidyrcols = -pathway,
names_to = "sample",
values_to = "gsva_score"
)
# Preview the annotated results object
head(annotated_results_df)
Now let’s filter to include only the scores associated with our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE
, and join the relevant group labels from the metadata for plotting.
<- annotated_results_df %>%
top_pathway_annotated_results_df # Filter for only scores associated with our most significant pathway
::filter(pathway == "HALLMARK_UNFOLDED_PROTEIN_RESPONSE") %>%
dplyr# Join the column with the group labels that we would like to plot
::left_join(metadata %>% dplyr::select(
dplyr# Select the variables relevant to your data
refinebio_accession_code,
subgroup
),# Tell the join what columns are equivalent and should be used as a key
by = c("sample" = "refinebio_accession_code")
)
# Preview the filtered annotated results object
head(top_pathway_annotated_results_df)
Now let’s make a sina plot so we can look at the differences between the subgroup
groups using our GSVA scores.
# Sina plot comparing GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`
# the `subgroup` groups in our dataset
<-
sina_plot ggplot(
# Supply our annotated data frame
top_pathway_annotated_results_df, aes(
x = subgroup, # Replace with a grouping variable relevant to your data
y = gsva_score, # Column we previously created to store the GSVA scores
color = subgroup # Let's make the groups different colors too
)+
) # Add a boxplot that will have summary stats
geom_boxplot(outlier.shape = NA) +
# Make a sina plot that shows individual values
::geom_sina() +
ggforce# Rename the y-axis label
labs(y = "HALLMARK_UNFOLDED_PROTEIN_RESPONSE_score") +
# Adjust the plot background for better visualization
theme_bw()
# Display plot
sina_plot
Looks like the Group 4
samples have lower GSVA scores for HALLMARK_UNFOLDED_PROTEIN_RESPONSE
as compared to the SHH
and Group 3
scores.
Let’s save this plot to PNG.
ggsave(
file.path(
plots_dir,"GSE37382_gsva_HALLMARK_UNFOLDED_PROTEIN_RESPONSE_sina_plot.png"
),plot = sina_plot
)
## Saving 7 x 5 in image
Now let’s write all of our GSVA results to file.
%>%
gsva_results as.data.frame() %>%
::rownames_to_column("pathway") %>%
tibble::write_tsv(file.path(
readr
results_dir,"GSE37382_gsva_results.tsv"
))
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## annotate 1.68.0 2020-10-27 [1] Bioconductor
## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
## BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor
## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.3)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.3)
## bitops 1.0-7 2021-04-24 [1] RSPM (R 4.0.4)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cachem 1.0.5 2021-05-15 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.0.3)
## colorspace 2.0-1 2021-05-04 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## DelayedArray 0.16.3 2021-03-24 [1] Bioconductor
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## emmeans 1.6.0 2021-04-24 [1] RSPM (R 4.0.4)
## estimability 1.3 2018-02-11 [1] RSPM (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.0.3)
## fftw 1.0-6 2020-02-24 [1] RSPM (R 4.0.2)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## GenomeInfoDb 1.26.7 2021-04-08 [1] Bioconductor
## GenomeInfoDbData 1.2.4 2022-03-01 [1] Bioconductor
## GenomicRanges 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggforce 0.3.3 2021-03-05 [1] RSPM (R 4.0.3)
## ggplot2 * 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## graph 1.68.0 2020-10-27 [1] Bioconductor
## GSEABase 1.52.1 2020-12-11 [1] Bioconductor
## GSVA * 1.38.2 2021-02-09 [1] Bioconductor
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## highr 0.9 2021-04-16 [1] RSPM (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.3)
## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.0.3)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.5)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## limma * 3.46.0 2020-10-27 [1] Bioconductor
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## MASS 7.3-53.1 2021-02-12 [2] CRAN (R 4.0.5)
## Matrix 1.3-2 2021-01-06 [2] CRAN (R 4.0.5)
## MatrixGenerics 1.2.1 2021-01-30 [1] Bioconductor
## matrixStats 0.58.0 2021-01-29 [1] RSPM (R 4.0.3)
## memoise 2.0.0 2021-01-26 [1] RSPM (R 4.0.3)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## mvtnorm 1.1-1 2020-06-09 [1] RSPM (R 4.0.3)
## nlme 3.1-152 2021-02-04 [2] CRAN (R 4.0.5)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## org.Hs.eg.db * 3.12.0 2022-03-01 [1] Bioconductor
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## polyclip 1.10-0 2019-03-14 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## qusage * 2.24.0 2020-10-27 [1] Bioconductor
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## RCurl 1.98-1.3 2021-03-16 [1] RSPM (R 4.0.4)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## RSQLite 2.2.7 2021-04-22 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## SummarizedExperiment 1.20.0 2020-10-27 [1] Bioconductor
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyr 1.1.3 2021-03-03 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## tweenr 1.0.2 2021-03-23 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## XML 3.99-0.6 2021-03-16 [1] RSPM (R 4.0.4)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.3)
## XVector 0.30.0 2020-10-27 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library