This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013a). GSVA transforms a gene by sample gene expression matrix into a gene set by sample pathway enrichment matrix (Hänzelmann et al. 2013b). We’ll make a heatmap of the enrichment matrix, but you can use the GSVA scores for a number of other downstream analyses such as differential expression analysis.
⬇️ Jump to the analysis code ⬇️
Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
Analysis | What is required for input | What output looks like | ✅ Pros | ⚠️ Cons |
---|---|---|---|---|
ORA (Over-representation Analysis) | A list of gene IDs (no stats needed) | A per-pathway hypergeometric test result | - Simple - Inexpensive computationally to calculate p-values |
- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
GSEA (Gene Set Enrichment Analysis) | A list of genes IDs with gene-level summary statistics | A per-pathway enrichment score | - Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
GSVA (Gene Set Variation Analysis) | A gene expression matrix (like what you get from refine.bio directly) | Pathway-level scores on a per-sample basis | - Does not require two groups to compare upfront - Normally distributed scores |
- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The SRP140558
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP140558
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedSRP140558
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "SRP140558")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "SRP140558.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_SRP140558.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
We will be using DESeq2
to normalize and transform our RNA-seq data before running GSVA, so we will need to install that (Love et al. 2014).
In this analysis, we will be using the GSVA
package to perform GSVA and the qusage
package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013a; Yaari et al. 2013).
We will also need the org.Hs.eg.db
package to perform gene identifier conversion (Carlson 2020).
We’ll create a heatmap from our pathway analysis results using pheatmap
(Slowikowski 2017).
if (!("DESeq2" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("DESeq2", update = FALSE)
BiocManager
}
if (!("GSVA" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("GSVA", update = FALSE)
BiocManager
}
if (!("qusage" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("qusage", update = FALSE)
BiocManager
}
if (!("org.Hs.eg.db" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("org.Hs.eg.db", update = FALSE)
BiocManager
}
if (!("pheatmap" %in% installed.packages())) {
# Install pheatmap
install.packages("pheatmap", update = FALSE)
}
Attach the packages we need for this analysis.
# Attach the DESeq2 library
library(DESeq2)
# Attach the `qusage` library
library(qusage)
# Attach the `GSVA` library
library(GSVA)
# Human annotation package we'll use for gene identifier conversion
library(org.Hs.eg.db)
# We will need this so we can use the pipe: %>%
library(magrittr)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_logical(),
## refinebio_accession_code = col_character(),
## experiment_accession = col_character(),
## refinebio_organism = col_character(),
## refinebio_platform = col_character(),
## refinebio_source_database = col_character(),
## refinebio_subject = col_character(),
## refinebio_title = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
expression_df # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
# Make the data in the order of the metadata
<- expression_df %>%
expression_df ::select(metadata$refinebio_accession_code)
dplyr
# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
DESeq2
There are two things we need to do to prep our expression data for DESeq2.
First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2
function we will use later.
Then, we need to filter out the genes that have not been expressed or that have low expression counts since we can not be as confident in those genes being reliably measured. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.
<- expression_df %>%
expression_df # Only keep rows that have total counts above the cutoff
::filter(rowSums(.) >= 50) %>%
dplyr# The next DESeq2 functions need the values to be converted to integers
round()
We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
<- DESeqDataSetFromMatrix(
dds countData = expression_df, # Our prepped data frame with counts
colData = metadata, # Data frame with annotation for our samples
design = ~1 # Here we are not specifying a model
)
## converting counts to integer mode
We often suggest normalizing and transforming your data for various applications including for GSVA. We are going to use the vst()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
# function from the `DESEq2` R package
<- vst(dds) dds_norm
At this point, if your data set has any outlier samples, you should look into removing them as they can affect your results. For this example data set, we will skip this step (there are no obvious outliers) and proceed.
But now we are ready to format our dataset for input into gsva::gsva()
. We need to extract the normalized counts to a matrix and make it into a data frame so we can use with tidyverse functions later.
# Retrieve the normalized data from the `DESeqDataSet`
<- assay(dds_norm) %>%
vst_df as.data.frame() %>% # Make into a data frame
::rownames_to_column("ensembl_id") # Make Gene IDs into their own column tibble
The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).
In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:
Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.
Here we are obtaining the pathway information from the main function of the msigdbr
package (Dolgalev 2020). Because we are using human data in this example, we supply the formal organism name to the species
argument. We will want only the hallmark pathways, so we use the category = "H"
argument.
<- msigdbr::msigdbr(
hallmark_gene_sets species = "Homo sapiens", # Can change this to what species you need
category = "H" # Only hallmark gene sets
)
Let’s take a look at the format of hallmarks_gene_set
.
head(hallmark_gene_sets)
We can see this object is in a tabular format; each row corresponds to a gene and gene set pair. A row exists if that gene (entrez_gene
, gene_symbol
) belongs to a gene set (gs_name
).
The function that we will use to run GSVA wants the gene sets to be in a list, where each entry in the list is a vector of genes that comprise the pathway the element is named for. In the next step, we’ll demonstrate how to go from this data frame format to a list.
For this example we will use Entrez IDs (but note that there are gene symbols we could use just as easily). The info we need is in two columns: entrez_gene
contains the gene ID and gs_name
contains the name of the pathway that the gene is a part of.
To make this into the list format we need, we can use the split()
function. We want a list where each element of the list is a vector that contains the Entrez gene IDs that are in a particular pathway set.
<- split(
hallmarks_list $entrez_gene, # The genes we want split into pathways
hallmark_gene_sets$gs_name # The pathways made as the higher levels of the list
hallmark_gene_sets )
What does this hallmarks_list
look like?
head(hallmarks_list, n = 2)
## $HALLMARK_ADIPOGENESIS
## [1] 19 11194 10449 33 34 35 47 50 51
## [10] 112 149685 9370 79602 79602 56894 9131 204 217
## [19] 226 284 51129 334 348 369 10124 64225 483
## [28] 539 11176 593 23786 604 718 847 284119 8436
## [37] 901 977 9936 948 1031 400916 400916 1147 1149
## [46] 134147 51727 1306 1282 51805 84274 57017 1337 1349
## [55] 1351 1376 1384 1431 1537 1580 1629 1652 1652
## [64] 1666 8694 8694 1717 51635 25979 1737 1738 4189
## [73] 29103 128338 1891 1891 1892 84173 79071 5168 2053
## [82] 2101 23344 2109 2167 2184 8322 9908 1647 2632
## [91] 27069 57678 137964 2820 10243 2878 2879 80273 3033
## [100] 26275 26353 3417 3419 3421 3459 10989 3679 80760
## [109] 6453 84522 3910 3952 3977 3991 10162 4023 4056
## [118] 4056 8491 56922 4191 4199 11343 4259 84895 56246
## [127] 29088 54996 23788 23788 4638 64859 4698 4706 4713
## [136] 4722 4722 28512 4836 4958 5004 27250 10400 5195
## [145] 5209 5211 5236 23187 5264 415116 123 5447 5468
## [154] 5495 84919 10935 10113 55037 5733 5860 83871 7905
## [163] 92840 56729 54884 8780 55177 26994 6239 10313 25813
## [172] 949 6342 6390 6391 6573 6510 6576 1468 376497
## [181] 8884 130814 6623 6647 10580 65124 8404 58472 8082
## [190] 6776 2040 8802 6817 6888 10010 7086 10140 7263
## [199] 7316 29979 83549 7351 29796 10975 7384 27089 7423
## [208] 7532
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] 16 6059 10006 43 92 207 322 567 567
## [10] 586 8915 602 672 717 717 717 717 717
## [19] 717 717 822 9607 6356 6357 6363 6347 6367
## [28] 6351 6351 6351 6352 6352 6354 894 896 1230
## [37] 729230 1234 912 914 919 940 915 916 917
## [46] 920 958 959 961 924 972 973 941 942
## [55] 925 926 10225 1029 5199 56253 1435 1445 1520
## [64] 10563 4283 2833 1615 8560 8444 1956 8661 8664
## [73] 8669 8672 1984 1984 1991 1991 2000 2069 2113
## [82] 2147 2149 355 356 2213 2268 2316 2533 2589
## [91] 2634 2650 11146 8477 3001 3002 3059 9734 3091
## [100] 3105 3105 3105 3105 3105 3105 3105 3105 3108
## [109] 3108 3108 3108 3108 3108 3108 3108 3109 3109
## [118] 3109 3109 3109 3109 3109 3109 3111 3111 3111
## [127] 3111 3111 3111 3111 3112 3112 3112 3112 3112
## [136] 3112 3117 3117 3117 3117 3117 3117 3122 3122
## [145] 3122 3122 3122 3122 3122 3133 3133 3133 3133
## [154] 3133 3133 3133 3135 3135 3135 3135 3135 3135
## [163] 3135 3135 3383 23308 3455 3458 3459 3460 3460
## [172] 10261 10261 3551 3586 3589 3592 3593 3594 3596
## [181] 3600 3603 3606 8807 3553 3558 9466 9466 3559
## [190] 3560 3561 3565 3566 3569 3574 3578 3624 3625
## [199] 3662 3665 3665 3394 3683 3689 3702 3717 3824
## [208] 3848 3932 3937 3976 4050 4050 4050 4050 4050
## [217] 4050 4050 4050 4065 9450 4067 6885 11184 11184
## [226] 4153 4318 11222 4528 4689 4689 4690 9437 9437
## [235] 9437 9437 9437 9437 9437 9437 9437 9437 9437
## [244] 9437 9437 9437 9437 9437 9437 114548 4830 4843
## [253] 4869 5196 5551 5579 5582 5699 5777 5788 5788
## [262] 5917 8767 6170 6123 6133 6223 6189 6203 6203
## [271] 6203 6203 6203 6203 6203 6203 6203 6203 27240
## [280] 8651 9655 6688 5552 7903 23166 6772 6775 6890
## [289] 6890 6890 6890 6890 6890 6890 6890 6891 6891
## [298] 6891 6891 6891 6891 6891 6891 6892 6892 6892
## [307] 6892 6892 7040 7042 7070 7076 7096 7097 7098
## [316] 10333 7124 7124 7124 7124 7124 7124 7124 7124
## [325] 7163 7186 50852 7321 7334 7453 7454 7535
Looks like we have a list of gene sets with associated Entrez IDs.
In our gene expression data frame, expression_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.
We’re going to convert our identifiers in expression_df
to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Hs.eg.db
contains information for different identifiers (Carlson 2020). org.Hs.eg.db
is specific to Homo sapiens – this is what the Hs
in the package name is referencing.
We can see what types of IDs are available to us in an annotation package with keytypes()
.
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
We’ use this package to convert from Ensembl gene IDs (ENSEMBL
) to Entrez IDs (ENTREZID
) – since this is the IDs we used in our hallmarks_list
in the previous step. But, we could just as easily use it to convert to gene symbols (SYMBOL
) if we had built hallmarks_list
using gene symbols.
The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds()
and comes from the AnnotationDbi
package.
Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.
# First let's create a mapped data frame we can join to the gene expression values
<- data.frame(
mapped_df "entrez_id" = mapIds(
# Replace with annotation package for the organism relevant to your data
org.Hs.eg.db,keys = vst_df$ensembl_id,
# Replace with the type of gene identifiers in your data
keytype = "ENSEMBL",
# Replace with the type of gene identifiers you would like to map to
column = "ENTREZID",
# This will keep only the first mapped value for each Ensembl ID
multiVals = "first"
)%>%
) # If an Ensembl gene identifier doesn't map to a Entrez gene identifier,
# drop that from the data frame
::filter(!is.na(entrez_id)) %>%
dplyr# Make an `Ensembl` column to store the row names
::rownames_to_column("Ensembl") %>%
tibble# Now let's join the rest of the expression data
::inner_join(vst_df, by = c("Ensembl" = "ensembl_id")) dplyr
## 'select()' returned 1:many mapping between keys and columns
This 1:many mapping between keys and columns
message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs.
For more info on gene ID conversion, take a look at our other examples: the microarray example and the RNA-seq example.
Let’s see a preview of mapped_df
.
head(mapped_df)
We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.
Let’s count up how many Entrez IDs mapped to multiple Ensembl IDs.
sum(duplicated(mapped_df$entrez_id))
## [1] 68
Looks like we have 68 duplicated Entrez IDs.
As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for running GSVA.
For RNA-seq processing in refine.bio, transcripts were quantified (Ensembl transcript IDs) and aggregated to the gene-level (Ensembl gene IDs). For a single Entrez ID that maps to multiple Ensembl gene IDs, we will use the values associated with the Ensembl gene ID that seems to be most highly expressed. Specifically, we’re going retain the Ensembl gene ID with maximum mean expression value. We expect that this approach may be a better reflection of the reads that were quantified than taking the mean or median of the values for multiple Ensembl gene IDs would be.
Our example doesn’t contain too many duplicates; ultimately we only are losing 68 rows of data. If you find yourself using a dataset that has large proportion of duplicates, we’d recommend exercising some caution and exploring how well values for multiple gene IDs are correlated and the identity of those genes.
First, we first need to calculate the gene means, but we’ll need to move our non-numeric variables (the gene ID columns) out of the way for that calculation.
# First let's determine the gene means
<- rowMeans(mapped_df %>% dplyr::select(-Ensembl, -entrez_id))
gene_means
# Let's add this as a column in our `mapped_df`.
<- mapped_df %>%
mapped_df # Add gene_means as a column called gene_means
::mutate(gene_means) %>%
dplyr# Reorder the columns so `gene_means` column is upfront
::select(Ensembl, entrez_id, gene_means, dplyr::everything()) dplyr
Now we can filter out the duplicate gene identifiers using the gene mean values. First, we’ll use dplyr::arrange()
by gene_means
such that the the rows will be in order of highest gene mean to lowest gene mean. For the duplicate values of entrez_id
, the row with the lower index will be the one that’s kept by dplyr::distinct()
. In practice, this means that we’ll keep the instance of the Entrez ID with the highest gene mean value as intended.
<- mapped_df %>%
filtered_mapped_df # Sort so that the highest mean expression values are at the top
::arrange(dplyr::desc(gene_means)) %>%
dplyr# Filter out the duplicated rows using `dplyr::distinct()`
::distinct(entrez_id, .keep_all = TRUE) dplyr
Let’s do our check again to see if we still have duplicates.
sum(duplicated(filtered_mapped_df$entrez_id))
## [1] 0
We now have 0
duplicates which is what we want. All set!
Now we should prep this data so GSVA can use it.
<- filtered_mapped_df %>%
filtered_mapped_matrix # GSVA can't the Ensembl IDs so we should drop this column as well as the means
::select(-Ensembl, -gene_means) %>%
dplyr# We need to store our gene identifiers as row names
::column_to_rownames("entrez_id") %>%
tibble# Now we can convert our object into a matrix
as.matrix()
Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.
GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013a). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).
The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013a). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013a).
The output is a gene set by sample matrix of GSVA scores.
Let’s perform GSVA using the gsva()
function. See ?gsva
for more options.
<- gsva(
gsva_results
filtered_mapped_matrix,
hallmarks_list,method = "gsva",
# Appropriate for our vst transformed data
kcdf = "Gaussian",
# Minimum gene set size
min.sz = 15,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Don't print out the progress bar
verbose = FALSE
)
Note that the gsva()
function documentation says we can use kcdf = "Gaussian"
if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, but we would use kcdf = "Poisson"
on integer counts. Our vst()
transformed data is on a log2-like scale, so Gaussian
works for us.
Let’s explore what the output of gsva()
looks like.
# Print 6 rows,
head(gsva_results[, 1:10])
## SRR7011789 SRR7011790 SRR7011791
## HALLMARK_ADIPOGENESIS -0.24335621 -0.37787868 0.19722676
## HALLMARK_ALLOGRAFT_REJECTION -0.44920260 -0.47727915 -0.41040501
## HALLMARK_ANDROGEN_RESPONSE 0.05682655 -0.15097544 0.31512197
## HALLMARK_ANGIOGENESIS -0.33111804 -0.25529152 0.60728563
## HALLMARK_APICAL_JUNCTION -0.17877429 -0.24029588 0.10471763
## HALLMARK_APICAL_SURFACE -0.05438200 -0.05065078 0.01458023
## SRR7011792 SRR7011793 SRR7011794
## HALLMARK_ADIPOGENESIS -0.22695124 -0.26762849 -0.2412906
## HALLMARK_ALLOGRAFT_REJECTION -0.44940464 -0.43424273 -0.4802747
## HALLMARK_ANDROGEN_RESPONSE -0.13228539 0.05745514 -0.1175098
## HALLMARK_ANGIOGENESIS -0.28334284 0.44988116 -0.1788752
## HALLMARK_APICAL_JUNCTION -0.05897113 -0.29521294 -0.2117339
## HALLMARK_APICAL_SURFACE 0.19675796 -0.21997084 -0.2121956
## SRR7011795 SRR7011796 SRR7011797
## HALLMARK_ADIPOGENESIS -0.0592728 -0.1630326129 -0.11915513
## HALLMARK_ALLOGRAFT_REJECTION -0.5094029 -0.4555939338 -0.44101648
## HALLMARK_ANDROGEN_RESPONSE 0.1710897 0.0003871434 0.03102136
## HALLMARK_ANGIOGENESIS -0.2712203 -0.1053205917 0.23851735
## HALLMARK_APICAL_JUNCTION -0.1661251 -0.1911126698 -0.08816549
## HALLMARK_APICAL_SURFACE 0.0435752 -0.1026224436 -0.20066770
## SRR7011798
## HALLMARK_ADIPOGENESIS -0.1258561
## HALLMARK_ALLOGRAFT_REJECTION -0.4712004
## HALLMARK_ANDROGEN_RESPONSE -0.1825296
## HALLMARK_ANGIOGENESIS 0.2260402
## HALLMARK_APICAL_JUNCTION -0.3391836
## HALLMARK_APICAL_SURFACE -0.2036703
Let’s write all of our GSVA results to file.
%>%
gsva_results as.data.frame() %>%
::rownames_to_column("pathway") %>%
tibble::write_tsv(file.path(
readr
results_dir,"SRP140558_gsva_results.tsv"
))
Let’s make a heatmap for our pathways!
We will want our heatmap to include some information about the sample labels, but unfortunately some of the metadata for this dataset are not set up into separate, neat columns.
The most salient information for these samples is combined into one column, refinebio_title
. Let’s preview what this column looks like.
head(metadata$refinebio_title)
## [1] "AVB_006_AV_PBMC" "AVB_006_CV_PBMC" "AVB_007_AV_PBMC"
## [4] "AVB_007_CV_PBMC" "AVB_012_AV_PBMC" "AVB_012_CV_PBMC"
If we used these labels as is, it wouldn’t be very informative!
Looking at the author’s descriptions, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).
We can create a new variable, time_point
, that states this info more clearly. This new time_point
variable will have two labels: acute illness
and recovering
based on the AV
or CV
coding located in the refinebio_title
string variable.
<- metadata %>%
annot_df # We need the sample IDs and the main column that contains the metadata info
::select(
dplyr
refinebio_accession_code,
refinebio_title%>%
) # Create our `time_point` variable based on `refinebio_title`
::mutate(
dplyrtime_point = dplyr::case_when(
# Create our new variable based whether the refinebio_title column
# contains _AV_ or _CV_
::str_detect(refinebio_title, "_AV_") ~ "acute illness",
stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
stringr
)%>%
) # We don't need the older version of the variable anymore
::select(-refinebio_title) dplyr
These time point samples are paired, so you could also add the refinebio_subject
to the labels. For simplicity, we’ve left them off for now.
The pheatmap::pheatmap()
will want the annotation data frame to have matching row names to the data we supply it (which is our gsva_results
).
<- annot_df %>%
annot_df # pheatmap will want our sample names that match our data to
::column_to_rownames("refinebio_accession_code") tibble
Great! We’re all set. We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.
<- pheatmap::pheatmap(gsva_results,
pathway_heatmap annotation_col = annot_df, # Add metadata labels!
show_colnames = FALSE, # Don't show sample labels
fontsize_row = 6 # Shrink the pathway labels a tad
)
# Print out heatmap here
pathway_heatmap
Here we’ve used clustering and can see that samples somewhat cluster by time_point
.
We can also see that some pathways that share biology seem to cluster together (e.g. HALLMARK_INTERFERON_ALPHA_RESPONSE
and HALLMARK_INTERFERON_GAMMA_RESPONSE
). Pathways may cluster together, or have similar GSVA scores, because the genes in those pathways overlap.
Taking this example, we can look at how many genes are in common for HALLMARK_INTERFERON_ALPHA_RESPONSE
and HALLMARK_INTERFERON_GAMMA_RESPONSE
.
length(intersect(
$HALLMARK_INTERFERON_ALPHA_RESPONSE,
hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE
hallmarks_list ))
## [1] 73
These 73
genes out of HALLMARK_INTERFERON_ALPHA_RESPONSE
’s 139 and hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE
’s 284 is probably why those cluster together.
The pathways share genes and are not independent!
Now, let’s save this plot to PNG.
# Replace file name with a relevant output plot name
<- file.path(plots_dir, "SRP140558_heatmap.png")
heatmap_png_file
# Open a PNG file - width and height arguments control the size of the output
png(heatmap_png_file, width = 1000, height = 800)
# Print your heatmap
pathway_heatmap
# Close the PNG file:
dev.off()
## png
## 2
pheatmap
package allow, see the ComplexHeatmap Complete Reference Manual (Gu et al. 2016)At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## annotate 1.68.0 2020-10-27 [1] Bioconductor
## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## babelgene 21.4 2021-04-26 [1] RSPM (R 4.0.4)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
## BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor
## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.3)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.3)
## bitops 1.0-7 2021-04-24 [1] RSPM (R 4.0.4)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cachem 1.0.5 2021-05-15 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.0.3)
## colorspace 2.0-1 2021-05-04 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## DelayedArray 0.16.3 2021-03-24 [1] Bioconductor
## DESeq2 * 1.30.1 2021-02-19 [1] Bioconductor
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## emmeans 1.6.0 2021-04-24 [1] RSPM (R 4.0.4)
## estimability 1.3 2018-02-11 [1] RSPM (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.0.3)
## fftw 1.0-6 2020-02-24 [1] RSPM (R 4.0.2)
## genefilter 1.72.1 2021-01-21 [1] Bioconductor
## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## GenomeInfoDb * 1.26.7 2021-04-08 [1] Bioconductor
## GenomeInfoDbData 1.2.4 2022-03-01 [1] Bioconductor
## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggplot2 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## graph 1.68.0 2020-10-27 [1] Bioconductor
## GSEABase 1.52.1 2020-12-11 [1] Bioconductor
## GSVA * 1.38.2 2021-02-09 [1] Bioconductor
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## highr 0.9 2021-04-16 [1] RSPM (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.3)
## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.5)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## limma * 3.46.0 2020-10-27 [1] Bioconductor
## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## Matrix 1.3-2 2021-01-06 [2] CRAN (R 4.0.5)
## MatrixGenerics * 1.2.1 2021-01-30 [1] Bioconductor
## matrixStats * 0.58.0 2021-01-29 [1] RSPM (R 4.0.3)
## memoise 2.0.0 2021-01-26 [1] RSPM (R 4.0.3)
## msigdbr 7.4.1 2021-05-05 [1] RSPM (R 4.0.4)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## mvtnorm 1.1-1 2020-06-09 [1] RSPM (R 4.0.3)
## nlme 3.1-152 2021-02-04 [2] CRAN (R 4.0.5)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## org.Hs.eg.db * 3.12.0 2022-03-01 [1] Bioconductor
## pheatmap 1.0.12 2019-01-04 [1] RSPM (R 4.0.3)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## qusage * 2.24.0 2020-10-27 [1] Bioconductor
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## RCurl 1.98-1.3 2021-03-16 [1] RSPM (R 4.0.4)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## RSQLite 2.2.7 2021-04-22 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
## survival 3.2-10 2021-03-16 [2] CRAN (R 4.0.5)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## XML 3.99-0.6 2021-03-16 [1] RSPM (R 4.0.4)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.3)
## XVector 0.30.0 2020-10-27 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library