1 Purpose of this analysis

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013a). GSVA transforms a gene by sample gene expression matrix into a gene set by sample pathway enrichment matrix (Hänzelmann et al. 2013b). We’ll make a heatmap of the enrichment matrix, but you can use the GSVA scores for a number of other downstream analyses such as differential expression analysis.

⬇️ Jump to the analysis code ⬇️

1.0.1 What is pathway analysis?

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

1.0.2 How to choose a pathway analysis?

This table summarizes the pathway analyses examples in this module.

Analysis What is required for input What output looks like ✅ Pros ⚠️ Cons
ORA (Over-representation Analysis) A list of gene IDs (no stats needed) A per-pathway hypergeometric test result - Simple

- Inexpensive computationally to calculate p-values
- Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis) A list of genes IDs with gene-level summary statistics A per-pathway enrichment score - Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes
- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis) A gene expression matrix (like what you get from refine.bio directly) Pathway-level scores on a per-sample basis - Does not require two groups to compare upfront

- Normally distributed scores
- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The SRP140558 folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the SRP140558 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The SRP140558 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "SRP140558")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "SRP140558.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 Gene set variation analysis - RNA-Seq

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

We will be using DESeq2 to normalize and transform our RNA-seq data before running GSVA, so we will need to install that (Love et al. 2014).

In this analysis, we will be using the GSVA package to perform GSVA and the qusage package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013a; Yaari et al. 2013).

We will also need the org.Hs.eg.db package to perform gene identifier conversion (Carlson 2020).

We’ll create a heatmap from our pathway analysis results using pheatmap (Slowikowski 2017).

if (!("DESeq2" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("DESeq2", update = FALSE)
}

if (!("GSVA" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("GSVA", update = FALSE)
}

if (!("qusage" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("qusage", update = FALSE)
}

if (!("org.Hs.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Hs.eg.db", update = FALSE)
}

if (!("pheatmap" %in% installed.packages())) {
  # Install pheatmap
  install.packages("pheatmap", update = FALSE)
}

Attach the packages we need for this analysis.

# Attach the DESeq2 library
library(DESeq2)

# Attach the `qusage` library
library(qusage)

# Attach the `GSVA` library
library(GSVA)

# Human annotation package we'll use for gene identifier conversion
library(org.Hs.eg.db)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_logical(),
##   refinebio_accession_code = col_character(),
##   experiment_accession = col_character(),
##   refinebio_organism = col_character(),
##   refinebio_platform = col_character(),
##   refinebio_source_database = col_character(),
##   refinebio_subject = col_character(),
##   refinebio_title = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
  # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later
  tibble::column_to_rownames("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Let’s ensure that the metadata and data are in the same sample order.

# Make the data in the order of the metadata
expression_df <- expression_df %>%
  dplyr::select(metadata$refinebio_accession_code)

# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE

4.2.1 Prepare data for DESeq2

There are two things we need to do to prep our expression data for DESeq2.

First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

Then, we need to filter out the genes that have not been expressed or that have low expression counts since we can not be as confident in those genes being reliably measured. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.

expression_df <- expression_df %>%
  # Only keep rows that have total counts above the cutoff
  dplyr::filter(rowSums(.) >= 50) %>%
  # The next DESeq2 functions need the values to be converted to integers
  round()

4.3 Create a DESeqDataset

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

# Create a `DESeqDataSet` object
dds <- DESeqDataSetFromMatrix(
  countData = expression_df, # Our prepped data frame with counts
  colData = metadata, # Data frame with annotation for our samples
  design = ~1 # Here we are not specifying a model
)
## converting counts to integer mode

4.4 Perform DESeq2 normalization and transformation

We often suggest normalizing and transforming your data for various applications including for GSVA. We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
# function from the `DESEq2` R package
dds_norm <- vst(dds)

At this point, if your data set has any outlier samples, you should look into removing them as they can affect your results. For this example data set, we will skip this step (there are no obvious outliers) and proceed.

But now we are ready to format our dataset for input into gsva::gsva(). We need to extract the normalized counts to a matrix and make it into a data frame so we can use with tidyverse functions later.

# Retrieve the normalized data from the `DESeqDataSet`
vst_df <- assay(dds_norm) %>%
  as.data.frame() %>% # Make into a data frame
  tibble::rownames_to_column("ensembl_id") # Make Gene IDs into their own column

4.4.1 Import Gene Sets

The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.

Here we are obtaining the pathway information from the main function of the msigdbr package (Dolgalev 2020). Because we are using human data in this example, we supply the formal organism name to the species argument. We will want only the hallmark pathways, so we use the category = "H" argument.

hallmark_gene_sets <- msigdbr::msigdbr(
  species = "Homo sapiens", # Can change this to what species you need
  category = "H" # Only hallmark gene sets
)

Let’s take a look at the format of hallmarks_gene_set.

head(hallmark_gene_sets)

We can see this object is in a tabular format; each row corresponds to a gene and gene set pair. A row exists if that gene (entrez_gene, gene_symbol) belongs to a gene set (gs_name).

The function that we will use to run GSVA wants the gene sets to be in a list, where each entry in the list is a vector of genes that comprise the pathway the element is named for. In the next step, we’ll demonstrate how to go from this data frame format to a list.

For this example we will use Entrez IDs (but note that there are gene symbols we could use just as easily). The info we need is in two columns: entrez_gene contains the gene ID and gs_name contains the name of the pathway that the gene is a part of.

To make this into the list format we need, we can use the split() function. We want a list where each element of the list is a vector that contains the Entrez gene IDs that are in a particular pathway set.

hallmarks_list <- split(
  hallmark_gene_sets$entrez_gene, # The genes we want split into pathways
  hallmark_gene_sets$gs_name # The pathways made as the higher levels of the list
)

What does this hallmarks_list look like?

head(hallmarks_list, n = 2)
## $HALLMARK_ADIPOGENESIS
##   [1]     19  11194  10449     33     34     35     47     50     51
##  [10]    112 149685   9370  79602  79602  56894   9131    204    217
##  [19]    226    284  51129    334    348    369  10124  64225    483
##  [28]    539  11176    593  23786    604    718    847 284119   8436
##  [37]    901    977   9936    948   1031 400916 400916   1147   1149
##  [46] 134147  51727   1306   1282  51805  84274  57017   1337   1349
##  [55]   1351   1376   1384   1431   1537   1580   1629   1652   1652
##  [64]   1666   8694   8694   1717  51635  25979   1737   1738   4189
##  [73]  29103 128338   1891   1891   1892  84173  79071   5168   2053
##  [82]   2101  23344   2109   2167   2184   8322   9908   1647   2632
##  [91]  27069  57678 137964   2820  10243   2878   2879  80273   3033
## [100]  26275  26353   3417   3419   3421   3459  10989   3679  80760
## [109]   6453  84522   3910   3952   3977   3991  10162   4023   4056
## [118]   4056   8491  56922   4191   4199  11343   4259  84895  56246
## [127]  29088  54996  23788  23788   4638  64859   4698   4706   4713
## [136]   4722   4722  28512   4836   4958   5004  27250  10400   5195
## [145]   5209   5211   5236  23187   5264 415116    123   5447   5468
## [154]   5495  84919  10935  10113  55037   5733   5860  83871   7905
## [163]  92840  56729  54884   8780  55177  26994   6239  10313  25813
## [172]    949   6342   6390   6391   6573   6510   6576   1468 376497
## [181]   8884 130814   6623   6647  10580  65124   8404  58472   8082
## [190]   6776   2040   8802   6817   6888  10010   7086  10140   7263
## [199]   7316  29979  83549   7351  29796  10975   7384  27089   7423
## [208]   7532
## 
## $HALLMARK_ALLOGRAFT_REJECTION
##   [1]     16   6059  10006     43     92    207    322    567    567
##  [10]    586   8915    602    672    717    717    717    717    717
##  [19]    717    717    822   9607   6356   6357   6363   6347   6367
##  [28]   6351   6351   6351   6352   6352   6354    894    896   1230
##  [37] 729230   1234    912    914    919    940    915    916    917
##  [46]    920    958    959    961    924    972    973    941    942
##  [55]    925    926  10225   1029   5199  56253   1435   1445   1520
##  [64]  10563   4283   2833   1615   8560   8444   1956   8661   8664
##  [73]   8669   8672   1984   1984   1991   1991   2000   2069   2113
##  [82]   2147   2149    355    356   2213   2268   2316   2533   2589
##  [91]   2634   2650  11146   8477   3001   3002   3059   9734   3091
## [100]   3105   3105   3105   3105   3105   3105   3105   3105   3108
## [109]   3108   3108   3108   3108   3108   3108   3108   3109   3109
## [118]   3109   3109   3109   3109   3109   3109   3111   3111   3111
## [127]   3111   3111   3111   3111   3112   3112   3112   3112   3112
## [136]   3112   3117   3117   3117   3117   3117   3117   3122   3122
## [145]   3122   3122   3122   3122   3122   3133   3133   3133   3133
## [154]   3133   3133   3133   3135   3135   3135   3135   3135   3135
## [163]   3135   3135   3383  23308   3455   3458   3459   3460   3460
## [172]  10261  10261   3551   3586   3589   3592   3593   3594   3596
## [181]   3600   3603   3606   8807   3553   3558   9466   9466   3559
## [190]   3560   3561   3565   3566   3569   3574   3578   3624   3625
## [199]   3662   3665   3665   3394   3683   3689   3702   3717   3824
## [208]   3848   3932   3937   3976   4050   4050   4050   4050   4050
## [217]   4050   4050   4050   4065   9450   4067   6885  11184  11184
## [226]   4153   4318  11222   4528   4689   4689   4690   9437   9437
## [235]   9437   9437   9437   9437   9437   9437   9437   9437   9437
## [244]   9437   9437   9437   9437   9437   9437 114548   4830   4843
## [253]   4869   5196   5551   5579   5582   5699   5777   5788   5788
## [262]   5917   8767   6170   6123   6133   6223   6189   6203   6203
## [271]   6203   6203   6203   6203   6203   6203   6203   6203  27240
## [280]   8651   9655   6688   5552   7903  23166   6772   6775   6890
## [289]   6890   6890   6890   6890   6890   6890   6890   6891   6891
## [298]   6891   6891   6891   6891   6891   6891   6892   6892   6892
## [307]   6892   6892   7040   7042   7070   7076   7096   7097   7098
## [316]  10333   7124   7124   7124   7124   7124   7124   7124   7124
## [325]   7163   7186  50852   7321   7334   7453   7454   7535

Looks like we have a list of gene sets with associated Entrez IDs.

In our gene expression data frame, expression_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.

4.4.2 Gene identifier conversion

We’re going to convert our identifiers in expression_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

We can see what types of IDs are available to us in an annotation package with keytypes().

keytypes(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

We’ use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID) – since this is the IDs we used in our hallmarks_list in the previous step. But, we could just as easily use it to convert to gene symbols (SYMBOL) if we had built hallmarks_list using gene symbols.

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds() and comes from the AnnotationDbi package.

Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.

# First let's create a mapped data frame we can join to the gene expression values
mapped_df <- data.frame(
  "entrez_id" = mapIds(
    # Replace with annotation package for the organism relevant to your data
    org.Hs.eg.db,
    keys = vst_df$ensembl_id,
    # Replace with the type of gene identifiers in your data
    keytype = "ENSEMBL",
    # Replace with the type of gene identifiers you would like to map to
    column = "ENTREZID",
    # This will keep only the first mapped value for each Ensembl ID
    multiVals = "first"
  )
) %>%
  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier,
  # drop that from the data frame
  dplyr::filter(!is.na(entrez_id)) %>%
  # Make an `Ensembl` column to store the row names
  tibble::rownames_to_column("Ensembl") %>%
  # Now let's join the rest of the expression data
  dplyr::inner_join(vst_df, by = c("Ensembl" = "ensembl_id"))
## 'select()' returned 1:many mapping between keys and columns

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs.

For more info on gene ID conversion, take a look at our other examples: the microarray example and the RNA-seq example.

Let’s see a preview of mapped_df.

head(mapped_df)

We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.

Let’s count up how many Entrez IDs mapped to multiple Ensembl IDs.

sum(duplicated(mapped_df$entrez_id))
## [1] 68

Looks like we have 68 duplicated Entrez IDs.

4.4.2.1 Handling duplicate gene identifiers

As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for running GSVA.

For RNA-seq processing in refine.bio, transcripts were quantified (Ensembl transcript IDs) and aggregated to the gene-level (Ensembl gene IDs). For a single Entrez ID that maps to multiple Ensembl gene IDs, we will use the values associated with the Ensembl gene ID that seems to be most highly expressed. Specifically, we’re going retain the Ensembl gene ID with maximum mean expression value. We expect that this approach may be a better reflection of the reads that were quantified than taking the mean or median of the values for multiple Ensembl gene IDs would be.

Our example doesn’t contain too many duplicates; ultimately we only are losing 68 rows of data. If you find yourself using a dataset that has large proportion of duplicates, we’d recommend exercising some caution and exploring how well values for multiple gene IDs are correlated and the identity of those genes.

First, we first need to calculate the gene means, but we’ll need to move our non-numeric variables (the gene ID columns) out of the way for that calculation.

# First let's determine the gene means
gene_means <- rowMeans(mapped_df %>% dplyr::select(-Ensembl, -entrez_id))

# Let's add this as a column in our `mapped_df`.
mapped_df <- mapped_df %>%
  # Add gene_means as a column called gene_means
  dplyr::mutate(gene_means) %>%
  # Reorder the columns so `gene_means` column is upfront
  dplyr::select(Ensembl, entrez_id, gene_means, dplyr::everything())

Now we can filter out the duplicate gene identifiers using the gene mean values. First, we’ll use dplyr::arrange() by gene_means such that the the rows will be in order of highest gene mean to lowest gene mean. For the duplicate values of entrez_id, the row with the lower index will be the one that’s kept by dplyr::distinct(). In practice, this means that we’ll keep the instance of the Entrez ID with the highest gene mean value as intended.

filtered_mapped_df <- mapped_df %>%
  # Sort so that the highest mean expression values are at the top
  dplyr::arrange(dplyr::desc(gene_means)) %>%
  # Filter out the duplicated rows using `dplyr::distinct()`
  dplyr::distinct(entrez_id, .keep_all = TRUE)

Let’s do our check again to see if we still have duplicates.

sum(duplicated(filtered_mapped_df$entrez_id))
## [1] 0

We now have 0 duplicates which is what we want. All set!

Now we should prep this data so GSVA can use it.

filtered_mapped_matrix <- filtered_mapped_df %>%
  # GSVA can't the Ensembl IDs so we should drop this column as well as the means
  dplyr::select(-Ensembl, -gene_means) %>%
  # We need to store our gene identifiers as row names
  tibble::column_to_rownames("entrez_id") %>%
  # Now we can convert our object into a matrix
  as.matrix()

Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.

4.5 Gene Set Variation Analysis

GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013a). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).

The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013a). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013a).

The output is a gene set by sample matrix of GSVA scores.

4.5.1 Perform GSVA

Let’s perform GSVA using the gsva() function. See ?gsva for more options.

gsva_results <- gsva(
  filtered_mapped_matrix,
  hallmarks_list,
  method = "gsva",
  # Appropriate for our vst transformed data
  kcdf = "Gaussian",
  # Minimum gene set size
  min.sz = 15,
  # Maximum gene set size
  max.sz = 500,
  # Compute Gaussian-distributed scores
  mx.diff = TRUE,
  # Don't print out the progress bar
  verbose = FALSE
)

Note that the gsva() function documentation says we can use kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, but we would use kcdf = "Poisson" on integer counts. Our vst() transformed data is on a log2-like scale, so Gaussian works for us.

Let’s explore what the output of gsva() looks like.

# Print 6 rows,
head(gsva_results[, 1:10])
##                               SRR7011789  SRR7011790  SRR7011791
## HALLMARK_ADIPOGENESIS        -0.24335621 -0.37787868  0.19722676
## HALLMARK_ALLOGRAFT_REJECTION -0.44920260 -0.47727915 -0.41040501
## HALLMARK_ANDROGEN_RESPONSE    0.05682655 -0.15097544  0.31512197
## HALLMARK_ANGIOGENESIS        -0.33111804 -0.25529152  0.60728563
## HALLMARK_APICAL_JUNCTION     -0.17877429 -0.24029588  0.10471763
## HALLMARK_APICAL_SURFACE      -0.05438200 -0.05065078  0.01458023
##                               SRR7011792  SRR7011793 SRR7011794
## HALLMARK_ADIPOGENESIS        -0.22695124 -0.26762849 -0.2412906
## HALLMARK_ALLOGRAFT_REJECTION -0.44940464 -0.43424273 -0.4802747
## HALLMARK_ANDROGEN_RESPONSE   -0.13228539  0.05745514 -0.1175098
## HALLMARK_ANGIOGENESIS        -0.28334284  0.44988116 -0.1788752
## HALLMARK_APICAL_JUNCTION     -0.05897113 -0.29521294 -0.2117339
## HALLMARK_APICAL_SURFACE       0.19675796 -0.21997084 -0.2121956
##                              SRR7011795    SRR7011796  SRR7011797
## HALLMARK_ADIPOGENESIS        -0.0592728 -0.1630326129 -0.11915513
## HALLMARK_ALLOGRAFT_REJECTION -0.5094029 -0.4555939338 -0.44101648
## HALLMARK_ANDROGEN_RESPONSE    0.1710897  0.0003871434  0.03102136
## HALLMARK_ANGIOGENESIS        -0.2712203 -0.1053205917  0.23851735
## HALLMARK_APICAL_JUNCTION     -0.1661251 -0.1911126698 -0.08816549
## HALLMARK_APICAL_SURFACE       0.0435752 -0.1026224436 -0.20066770
##                              SRR7011798
## HALLMARK_ADIPOGENESIS        -0.1258561
## HALLMARK_ALLOGRAFT_REJECTION -0.4712004
## HALLMARK_ANDROGEN_RESPONSE   -0.1825296
## HALLMARK_ANGIOGENESIS         0.2260402
## HALLMARK_APICAL_JUNCTION     -0.3391836
## HALLMARK_APICAL_SURFACE      -0.2036703

4.6 Write results to file

Let’s write all of our GSVA results to file.

gsva_results %>%
  as.data.frame() %>%
  tibble::rownames_to_column("pathway") %>%
  readr::write_tsv(file.path(
    results_dir,
    "SRP140558_gsva_results.tsv"
  ))

4.7 Visualizing results with a heatmap

Let’s make a heatmap for our pathways!

4.7.1 Neaten up our metadata labels

We will want our heatmap to include some information about the sample labels, but unfortunately some of the metadata for this dataset are not set up into separate, neat columns.

The most salient information for these samples is combined into one column, refinebio_title. Let’s preview what this column looks like.

head(metadata$refinebio_title)
## [1] "AVB_006_AV_PBMC" "AVB_006_CV_PBMC" "AVB_007_AV_PBMC"
## [4] "AVB_007_CV_PBMC" "AVB_012_AV_PBMC" "AVB_012_CV_PBMC"

If we used these labels as is, it wouldn’t be very informative!

Looking at the author’s descriptions, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).

We can create a new variable, time_point, that states this info more clearly. This new time_point variable will have two labels: acute illness and recovering based on the AV or CV coding located in the refinebio_title string variable.

annot_df <- metadata %>%
  # We need the sample IDs and the main column that contains the metadata info
  dplyr::select(
    refinebio_accession_code,
    refinebio_title
  ) %>%
  # Create our `time_point` variable based on `refinebio_title`
  dplyr::mutate(
    time_point = dplyr::case_when(
      # Create our new variable based whether the refinebio_title column
      # contains _AV_ or _CV_
      stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness",
      stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
    )
  ) %>%
  # We don't need the older version of the variable anymore
  dplyr::select(-refinebio_title)

These time point samples are paired, so you could also add the refinebio_subject to the labels. For simplicity, we’ve left them off for now.

The pheatmap::pheatmap() will want the annotation data frame to have matching row names to the data we supply it (which is our gsva_results).

annot_df <- annot_df %>%
  # pheatmap will want our sample names that match our data to
  tibble::column_to_rownames("refinebio_accession_code")

4.7.2 Set up the heatmap itself

Great! We’re all set. We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.

pathway_heatmap <- pheatmap::pheatmap(gsva_results,
  annotation_col = annot_df, # Add metadata labels!
  show_colnames = FALSE, # Don't show sample labels
  fontsize_row = 6 # Shrink the pathway labels a tad
)

# Print out heatmap here
pathway_heatmap

Here we’ve used clustering and can see that samples somewhat cluster by time_point.

We can also see that some pathways that share biology seem to cluster together (e.g. HALLMARK_INTERFERON_ALPHA_RESPONSE and HALLMARK_INTERFERON_GAMMA_RESPONSE). Pathways may cluster together, or have similar GSVA scores, because the genes in those pathways overlap.

Taking this example, we can look at how many genes are in common for HALLMARK_INTERFERON_ALPHA_RESPONSE and HALLMARK_INTERFERON_GAMMA_RESPONSE.

length(intersect(
  hallmarks_list$HALLMARK_INTERFERON_ALPHA_RESPONSE,
  hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE
))
## [1] 73

These 73 genes out of HALLMARK_INTERFERON_ALPHA_RESPONSE’s 139 and hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE’s 284 is probably why those cluster together.

The pathways share genes and are not independent!

Now, let’s save this plot to PNG.

# Replace file name with a relevant output plot name
heatmap_png_file <- file.path(plots_dir, "SRP140558_heatmap.png")

# Open a PNG file - width and height arguments control the size of the output
png(heatmap_png_file, width = 1000, height = 800)

# Print your heatmap
pathway_heatmap

# Close the PNG file:
dev.off()
## png 
##   2

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package              * version  date       lib source        
##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
##  AnnotationDbi        * 1.52.0   2020-10-27 [1] Bioconductor  
##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.3)
##  babelgene              21.4     2021-04-26 [1] RSPM (R 4.0.4)
##  backports              1.2.1    2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
##  BiocGenerics         * 0.36.1   2021-04-16 [1] Bioconductor  
##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.3)
##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.3)
##  bitops                 1.0-7    2021-04-24 [1] RSPM (R 4.0.4)
##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.3)
##  bslib                  0.2.5    2021-05-12 [1] RSPM (R 4.0.4)
##  cachem                 1.0.5    2021-05-15 [1] RSPM (R 4.0.4)
##  cli                    2.5.0    2021-04-26 [1] RSPM (R 4.0.4)
##  coda                   0.19-4   2020-09-30 [1] RSPM (R 4.0.3)
##  colorspace             2.0-1    2021-05-04 [1] RSPM (R 4.0.4)
##  crayon                 1.4.1    2021-02-08 [1] RSPM (R 4.0.3)
##  DBI                    1.1.1    2021-01-15 [1] RSPM (R 4.0.3)
##  DelayedArray           0.16.3   2021-03-24 [1] Bioconductor  
##  DESeq2               * 1.30.1   2021-02-19 [1] Bioconductor  
##  digest                 0.6.27   2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr                  1.0.6    2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis               0.3.2    2021-04-29 [1] RSPM (R 4.0.4)
##  emmeans                1.6.0    2021-04-24 [1] RSPM (R 4.0.4)
##  estimability           1.3      2018-02-11 [1] RSPM (R 4.0.0)
##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.3)
##  fansi                  0.4.2    2021-01-15 [1] RSPM (R 4.0.3)
##  farver                 2.1.0    2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap                1.1.0    2021-01-25 [1] RSPM (R 4.0.3)
##  fftw                   1.0-6    2020-02-24 [1] RSPM (R 4.0.2)
##  genefilter             1.72.1   2021-01-21 [1] Bioconductor  
##  geneplotter            1.68.0   2020-10-27 [1] Bioconductor  
##  generics               0.1.0    2020-10-31 [1] RSPM (R 4.0.3)
##  GenomeInfoDb         * 1.26.7   2021-04-08 [1] Bioconductor  
##  GenomeInfoDbData       1.2.4    2022-03-01 [1] Bioconductor  
##  GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
##  ggplot2                3.3.3    2020-12-30 [1] RSPM (R 4.0.3)
##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.3)
##  graph                  1.68.0   2020-10-27 [1] Bioconductor  
##  GSEABase               1.52.1   2020-12-11 [1] Bioconductor  
##  GSVA                 * 1.38.2   2021-02-09 [1] Bioconductor  
##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.3)
##  highr                  0.9      2021-04-16 [1] RSPM (R 4.0.4)
##  hms                    1.0.0    2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools              0.5.1.1  2021-01-22 [1] RSPM (R 4.0.3)
##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.3)
##  IRanges              * 2.24.1   2020-12-12 [1] Bioconductor  
##  jquerylib              0.1.4    2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite               1.7.2    2020-12-09 [1] RSPM (R 4.0.3)
##  knitr                  1.33     2021-04-24 [1] RSPM (R 4.0.4)
##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.5)
##  lifecycle              1.0.0    2021-02-15 [1] RSPM (R 4.0.3)
##  limma                * 3.46.0   2020-10-27 [1] Bioconductor  
##  locfit                 1.5-9.4  2020-03-25 [1] RSPM (R 4.0.3)
##  magrittr             * 2.0.1    2020-11-17 [1] RSPM (R 4.0.3)
##  Matrix                 1.3-2    2021-01-06 [2] CRAN (R 4.0.5)
##  MatrixGenerics       * 1.2.1    2021-01-30 [1] Bioconductor  
##  matrixStats          * 0.58.0   2021-01-29 [1] RSPM (R 4.0.3)
##  memoise                2.0.0    2021-01-26 [1] RSPM (R 4.0.3)
##  msigdbr                7.4.1    2021-05-05 [1] RSPM (R 4.0.4)
##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.3)
##  mvtnorm                1.1-1    2020-06-09 [1] RSPM (R 4.0.3)
##  nlme                   3.1-152  2021-02-04 [2] CRAN (R 4.0.5)
##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
##  org.Hs.eg.db         * 3.12.0   2022-03-01 [1] Bioconductor  
##  pheatmap               1.0.12   2019-01-04 [1] RSPM (R 4.0.3)
##  pillar                 1.6.1    2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.3)
##  ps                     1.6.0    2021-02-28 [1] RSPM (R 4.0.3)
##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.3)
##  qusage               * 2.24.0   2020-10-27 [1] Bioconductor  
##  R.cache                0.15.0   2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R6                     2.5.0    2020-10-28 [1] RSPM (R 4.0.3)
##  RColorBrewer           1.1-2    2014-12-07 [1] RSPM (R 4.0.3)
##  Rcpp                   1.0.6    2021-01-15 [1] RSPM (R 4.0.3)
##  RCurl                  1.98-1.3 2021-03-16 [1] RSPM (R 4.0.4)
##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.3)
##  rlang                  0.4.11   2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown              2.8      2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite                2.2.7    2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi             0.13     2020-11-12 [1] RSPM (R 4.0.3)
##  S4Vectors            * 0.28.1   2020-12-09 [1] Bioconductor  
##  sass                   0.4.0    2021-05-12 [1] RSPM (R 4.0.4)
##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.3)
##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.3)
##  stringi                1.6.1    2021-05-10 [1] RSPM (R 4.0.4)
##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.3)
##  styler                 1.4.1    2021-03-30 [1] RSPM (R 4.0.4)
##  SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
##  survival               3.2-10   2021-03-16 [2] CRAN (R 4.0.5)
##  tibble                 3.1.2    2021-05-16 [1] RSPM (R 4.0.4)
##  tidyselect             1.1.1    2021-04-30 [1] RSPM (R 4.0.4)
##  utf8                   1.2.1    2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs                  0.3.8    2021-04-29 [1] RSPM (R 4.0.4)
##  withr                  2.4.2    2021-04-18 [1] RSPM (R 4.0.4)
##  xfun                   0.23     2021-05-15 [1] RSPM (R 4.0.4)
##  XML                    3.99-0.6 2021-03-16 [1] RSPM (R 4.0.4)
##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.3)
##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.3)
##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Broad Institute Team, 2019 Gene set enrichment analysis (GSEA) user guide. https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html
Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html
Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313
Hänzelmann S., R. Castelo, and J. Guinney, 2013a GSVA: Gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14: 7. https://doi.org/10.1186/1471-2105-14-7
Hänzelmann S., R. Castelo, and J. Guinney, 2013b GSVA. https://github.com/rcastelo/GSVA/blob/master/man/gsva.Rd
Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375
Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, P. Tamayo, et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, J. P. Mesirov, et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004
Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
Malhotra S., 2018 Decoding gene set variation analysis. https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3
Slowikowski K., 2017 Make heatmaps in R with pheatmap. https://slowkow.com/notes/pheatmap-tutorial/
Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102
Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660