1 Purpose of this analysis

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions (Subramanian et al. 2005). Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small.

⬇️ Jump to the analysis code ⬇️

1.0.1 What is pathway analysis?

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

1.0.2 How to choose a pathway analysis?

This table summarizes the pathway analyses examples in this module.

Analysis What is required for input What output looks like ✅ Pros ⚠️ Cons
ORA (Over-representation Analysis) A list of gene IDs (no stats needed) A per-pathway hypergeometric test result - Simple

- Inexpensive computationally to calculate p-values
- Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis) A list of genes IDs with gene-level summary statistics A per-pathway enrichment score - Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes
- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis) A gene expression matrix (like what you get from refine.bio directly) Pathway-level scores on a per-sample basis - Does not require two groups to compare upfront

- Normally distributed scores
- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the gene set for this example

In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).

We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

2.4 About the dataset we are using for this example

For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

2.5 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called data (currently empty)
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 Gene set enrichment analysis - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011). In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).

We will also need the org.Dr.eg.db package to perform gene identifier conversion (Carlson 2019).

if (!("clusterProfiler" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("clusterProfiler", update = FALSE)
}

if (!("msigdbr" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("msigdbr", update = FALSE)
}

if (!("org.Dr.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Dr.eg.db", update = FALSE)
}

Attach the packages we need for this analysis.

# Attach the library
library(clusterProfiler)

# Package that contains MSigDB gene sets in tidy format
library(msigdbr)

# Zebrafish annotation package we'll use for gene identifier conversion
library(org.Dr.eg.db)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Download data file

We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

# Define the url to your differential expression results file
dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

dge_results_file <- file.path(
  results_dir,
  "GSE71270_limma_results.tsv"
)

Using the URL (dge_url) and file path (dge_results_file) we can download the file and use the destfile argument to specify where it should be saved.

download.file(
  dge_url,
  # The file will be saved to this location and with this name
  destfile = dge_results_file
)

Now let’s double check that the results file is in the right place.

# Check if the file exists
file.exists(dge_results_file)
## [1] TRUE

4.3 Import data

Read in the file that has differential expression results.

# Read in the contents of the differential expression results file
dge_df <- readr::read_tsv(dge_results_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   Gene = col_character(),
##   logFC = col_double(),
##   AveExpr = col_double(),
##   t = col_double(),
##   P.Value = col_double(),
##   adj.P.Val = col_double(),
##   B = col_double()
## )

Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url) and skip the download steps.

Let’s take a look at what these results from the differential expression analysis look like.

dge_df