1 Purpose of this analysis

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions (Subramanian et al. 2005). Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small.

⬇️ Jump to the analysis code ⬇️

1.0.1 What is pathway analysis?

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

1.0.2 How to choose a pathway analysis?

This table summarizes the pathway analyses examples in this module.

Analysis	What is required for input	What output looks like	✅ Pros	⚠️ Cons
ORA (Over-representation Analysis)	A list of gene IDs (no stats needed)	A per-pathway hypergeometric test result	- Simple - Inexpensive computationally to calculate p-values	- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)	A list of genes IDs with gene-level summary statistics	A per-pathway enrichment score	- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes	- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)	A gene expression matrix (like what you get from refine.bio directly)	Pathway-level scores on a per-sample basis	- Does not require two groups to compare upfront - Normally distributed scores	- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the `.Rmd` file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the gene set for this example

In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).

We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

2.4 About the dataset we are using for this example

For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

2.5 Check out our file structure!

Your new analysis folder should contain:

The example analysis .Rmd you downloaded
A folder called data (currently empty)
A folder for plots (currently empty)
A folder for results (currently empty)

Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

4 Gene set enrichment analysis - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Subramanian et al. 2005; Liberzon et al. 2011; Yu et al. 2012; Dolgalev 2020). In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Subramanian et al. 2005; Liberzon et al. 2011; Yu et al. 2012; Dolgalev 2020).

We will also need the org.Dr.eg.db package to perform gene identifier conversion (Carlson 2019).

if (!("clusterProfiler" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("clusterProfiler", update = FALSE)
}

if (!("msigdbr" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("msigdbr", update = FALSE)
}

if (!("org.Dr.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Dr.eg.db", update = FALSE)
}

Attach the packages we need for this analysis.

# Attach the library
library(clusterProfiler)

# Package that contains MSigDB gene sets in tidy format
library(msigdbr)

# Zebrafish annotation package we'll use for gene identifier conversion
library(org.Dr.eg.db)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Download data file

We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

# Define the url to your differential expression results file
dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

dge_results_file <- file.path(
  results_dir,
  "GSE71270_limma_results.tsv"
)

Using the URL (dge_url) and file path (dge_results_file) we can download the file and use the destfile argument to specify where it should be saved.

download.file(
  dge_url,
  # The file will be saved to this location and with this name
  destfile = dge_results_file
)

Now let’s double check that the results file is in the right place.

# Check if the file exists
file.exists(dge_results_file)

## [1] TRUE

4.3 Import data

Read in the file that has differential expression results.

# Read in the contents of the differential expression results file
dge_df <- readr::read_tsv(dge_results_file)

## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   Gene = col_character(),
##   logFC = col_double(),
##   AveExpr = col_double(),
##   t = col_double(),
##   P.Value = col_double(),
##   adj.P.Val = col_double(),
##   B = col_double()
## )

Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url) and skip the download steps.

Let’s take a look at what these results from the differential expression analysis look like.

dge_df

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020). The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

Let’s take a look at what organisms the package supports with msigdbr_species().

msigdbr_species()

MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

In this example, we will use a collection called Hallmark gene sets for GSEA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.

The data we’re interested in here comes from zebrafish samples, so we can obtain only the Hallmarks gene sets relevant to D. rerio by specifying category = "H" and species = "Danio rerio", respectively, to the msigdbr() function.

dr_hallmark_df <- msigdbr(
  species = "Danio rerio", # Replace with species name relevant to your data
  category = "H"
)

If you run the chunk above without specifying a category to the msigdbr() function, it will return all of the MSigDB gene sets for zebrafish.

Let’s preview what’s in dr_hallmark_df.

head(dr_hallmark_df)

Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs.

In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA.

4.5 Gene identifier conversion

We’re going to convert our identifiers in dge_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

The annotation package org.Dr.eg.db contains information for different identifiers (Carlson 2019). org.Dr.eg.db is specific to Danio rerio – this is what the Dr in the package name is referencing.

We can see what types of IDs are available to us in an annotation package with keytypes().

keytypes(org.Dr.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [17] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"      
## [21] "SYMBOL"       "UNIGENE"      "UNIPROT"      "ZFIN"

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds() and comes from the AnnotationDbi package.

Let’s create a data frame that shows the mapped Entrez IDs along with the differential expression stats for the respective Ensembl IDs.

# First let's create a mapped data frame we can join to the differential
# expression stats
dge_mapped_df <- data.frame(
  entrez_id = mapIds(
    # Replace with annotation package for the organism relevant to your data
    org.Dr.eg.db,
    keys = dge_df$Gene,
    # Replace with the type of gene identifiers in your data
    keytype = "ENSEMBL",
    # Replace with the type of gene identifiers you would like to map to
    column = "ENTREZID",
    # This will keep only the first mapped value for each Ensembl ID
    multiVals = "first"
  )
) %>%
  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier,
  # drop that from the data frame
  dplyr::filter(!is.na(entrez_id)) %>%
  # Make an `Ensembl` column to store the rownames
  tibble::rownames_to_column("Ensembl") %>%
  # Now let's join the rest of the expression data
  dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene"))

## 'select()' returned 1:many mapping between keys and columns

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

Let’s see a preview of dge_mapped_df.

head(dge_mapped_df)

4.6 Perform gene set enrichment analysis (GSEA)

The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude (Subramanian et al. 2005).

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings (Subramanian et al. 2005; Yu 2020). Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. In this example, the enrichment score for a pathway is the running sum’s maximum deviation from zero. GSEA also assesses statistical significance of the scores for each pathway through permutation testing. As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing (Subramanian et al. 2005; Yu 2020).

The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we’ll use the t-statistic calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

4.6.1 Determine our pre-ranked genes list

The GSEA() function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. It requires unique gene identifiers to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. (The GSEA() function will throw a warning if we do not do this ahead of time.)

Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs in our data frame of differential expression results.

any(duplicated(dge_mapped_df$entrez_id))

## [1] TRUE

Looks like we do have duplicated Entrez IDs. Let’s find out which ones.

dup_entrez_ids <- dge_mapped_df %>%
  dplyr::filter(duplicated(entrez_id)) %>%
  dplyr::pull(entrez_id)

dup_entrez_ids

## [1] "336702" "57924"

Now let’s take a look at the rows associated with the duplicated Entrez IDs.

dge_mapped_df %>%
  dplyr::filter(entrez_id %in% dup_entrez_ids)

We can see that the associated values vary for each row.

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA() step. Let’s keep the Entrez IDs associated with the higher absolute value of the t-statistic. GSEA relies on genes’ rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings (Subramanian et al. 2005; Yu 2020).

Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

We are removing values for two genes here, so it is unlikely to have a considerable impact on our results.

filtered_dge_mapped_df <- dge_mapped_df %>%
  # Sort so that the highest absolute values of the t-statistic are at the top
  dplyr::arrange(dplyr::desc(abs(t))) %>%
  # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep
  # the first row with the duplicated value thus keeping the row with the
  # highest absolute value of the t-statistic
  dplyr::distinct(entrez_id, .keep_all = TRUE)

Let’s check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the t-statistic.

filtered_dge_mapped_df %>%
  dplyr::filter(entrez_id %in% dup_entrez_ids)

Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the t-statistic!

In the next chunk, we will create a named vector ranked based on the gene-level t-statistic values.

# Let's create a named vector ranked based on the t-statistic values
t_vector <- filtered_dge_mapped_df$t
names(t_vector) <- filtered_dge_mapped_df$entrez_id

# We need to sort the t-statistic values in descending order here
t_vector <- sort(t_vector, decreasing = TRUE)

Let’s preview our pre-ranked named vector.

# Look at first entries of the ranked t-statistic vector
head(t_vector)

##   555053   140633   407728   368924   335916   323329 
## 20.22172 14.48634 13.88657 12.45258 11.24450 10.92140

4.6.2 Run GSEA using the `GSEA()` function

Genes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list.

The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.

# Set the seed so our results are reproducible:
set.seed(2020)

We can use the GSEA() function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()).

gsea_results <- GSEA(
  geneList = t_vector, # Ordered ranked gene list
  minGSSize = 25, # Minimum gene set size
  maxGSSize = 500, # Maximum gene set set
  pvalueCutoff = 0.05, # p-value cutoff
  eps = 0, # Boundary for calculating the p-value
  seed = TRUE, # Set seed to make results reproducible
  pAdjustMethod = "BH", # Benjamini-Hochberg correction
  TERM2GENE = dplyr::select(
    dr_hallmark_df,
    gs_name,
    entrez_gene
  )
)

## preparing geneSet collections...

## GSEA analysis...

## leading edge analysis...

## done...

Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. The pAdjustMethod argument to GSEA() above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the pvalueCutoff argument tells the function to only return pathways with adjusted p-values less than that threshold in the results slot.

Let’s take a look at the table in the result slot of gsea_results.

# We can access the results from our gseaResult object using `@result`
head(gsea_results@result)

Looks like we have gene sets returned as significant at FDR (false discovery rate) of 0.05. If we did not have results that met the pvalueCutoff condition, this table would be empty.

The NES column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway.

Let’s convert the contents of result into a data frame that we can use for further analysis and write to a file later.

gsea_result_df <- data.frame(gsea_results@result)

4.7 Visualizing results

We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot(). Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated.

4.7.1 Most Positive NES

Let’s look at the 3 gene sets with the most positive NES.

gsea_result_df %>%
  # This returns the 3 rows with the largest NES values
  dplyr::slice_max(n = 3, order_by = NES)

The gene set HALLMARK_TNFA_SIGNALING_VIA_NFKB has the most positive NES score.

most_positive_nes_plot <- enrichplot::gseaplot(
  gsea_results,
  geneSetID = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
  title = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
  color.line = "#0d76ff"
)

most_positive_nes_plot

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. The red dashed line indicates the enrichment score, which is the maximum deviation from zero. As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway.

The plots returned by enrichplot::gseaplot are ggplots, so we can use ggplot2::ggsave() to save them to file.

Let’s save to PNG.

ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_positive_plot.png"),
  plot = most_positive_nes_plot
)

## Saving 7 x 5 in image

4.7.2 Most Negative NES

Let’s look for the 3 gene sets with the most negative NES.

gsea_result_df %>%
  # Return the 3 rows with the smallest (most negative) NES values
  dplyr::slice_min(n = 3, order_by = NES)

The gene set HALLMARK_E2F_TARGETS has the most negative NES.

most_negative_nes_plot <- enrichplot::gseaplot(
  gsea_results,
  geneSetID = "HALLMARK_E2F_TARGETS",
  title = "HALLMARK_E2F_TARGETS",
  color.line = "#0d76ff"
)

most_negative_nes_plot

This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph. Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. A negative enrichment score will be returned when many genes are near the bottom of the ranked list.

Let’s save this plot to PNG as well.

ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_negative_plot.png"),
  plot = most_negative_nes_plot
)

## Saving 7 x 5 in image

4.8 Write results to file

readr::write_tsv(
  gsea_result_df,
  file.path(
    results_dir,
    "GSE71270_gsea_results.tsv"
  )
)

5 Resources for further learning

clusterProfiler paper (Yu et al. 2012).
clusterProfiler book (Yu 2020).
This handy review which summarizes the different types of pathway analysis and their limitations (Khatri et al. 2012).
See this Broad Institute page for more on GSEA and MSigDB (UC San Diego and Broad Institute Team).

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()

## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package         * version  date       lib source        
##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.3)
##  babelgene         21.4     2021-04-26 [1] RSPM (R 4.0.4)
##  backports         1.2.1    2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
##  BiocGenerics    * 0.36.1   2021-04-16 [1] Bioconductor  
##  BiocManager       1.30.15  2021-05-11 [1] RSPM (R 4.0.4)
##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.3)
##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.3)
##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.3)
##  bslib             0.2.5    2021-05-12 [1] RSPM (R 4.0.4)
##  cachem            1.0.5    2021-05-15 [1] RSPM (R 4.0.4)
##  cli               2.5.0    2021-04-26 [1] RSPM (R 4.0.4)
##  clusterProfiler * 3.18.1   2021-02-09 [1] Bioconductor  
##  colorspace        2.0-1    2021-05-04 [1] RSPM (R 4.0.4)
##  cowplot           1.1.1    2020-12-30 [1] RSPM (R 4.0.3)
##  crayon            1.4.1    2021-02-08 [1] RSPM (R 4.0.3)
##  data.table        1.14.0   2021-02-21 [1] RSPM (R 4.0.3)
##  DBI               1.1.1    2021-01-15 [1] RSPM (R 4.0.3)
##  digest            0.6.27   2020-10-24 [1] RSPM (R 4.0.3)
##  DO.db             2.9      2022-03-01 [1] Bioconductor  
##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.3)
##  dplyr             1.0.6    2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis          0.3.2    2021-04-29 [1] RSPM (R 4.0.4)
##  enrichplot        1.10.2   2021-01-28 [1] Bioconductor  
##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.3)
##  fansi             0.4.2    2021-01-15 [1] RSPM (R 4.0.3)
##  farver            2.1.0    2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap           1.1.0    2021-01-25 [1] RSPM (R 4.0.3)
##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.3)
##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
##  generics          0.1.0    2020-10-31 [1] RSPM (R 4.0.3)
##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
##  ggforce           0.3.3    2021-03-05 [1] RSPM (R 4.0.3)
##  ggplot2           3.3.3    2020-12-30 [1] RSPM (R 4.0.3)
##  ggraph            2.0.5    2021-02-23 [1] RSPM (R 4.0.3)
##  ggrepel           0.9.1    2021-01-15 [1] RSPM (R 4.0.3)
##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.3)
##  GO.db             3.12.1   2022-03-01 [1] Bioconductor  
##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
##  graphlayouts      0.7.1    2020-10-26 [1] RSPM (R 4.0.4)
##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.3)
##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.3)
##  highr             0.9      2021-04-16 [1] RSPM (R 4.0.4)
##  hms               1.0.0    2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools         0.5.1.1  2021-01-22 [1] RSPM (R 4.0.3)
##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.3)
##  IRanges         * 2.24.1   2020-12-12 [1] Bioconductor  
##  jquerylib         0.1.4    2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite          1.7.2    2020-12-09 [1] RSPM (R 4.0.3)
##  knitr             1.33     2021-04-24 [1] RSPM (R 4.0.4)
##  labeling          0.4.2    2020-10-20 [1] RSPM (R 4.0.3)
##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.5)
##  lifecycle         1.0.0    2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr        * 2.0.1    2020-11-17 [1] RSPM (R 4.0.3)
##  MASS              7.3-53.1 2021-02-12 [2] CRAN (R 4.0.5)
##  Matrix            1.3-2    2021-01-06 [2] CRAN (R 4.0.5)
##  memoise           2.0.0    2021-01-26 [1] RSPM (R 4.0.3)
##  msigdbr         * 7.4.1    2021-05-05 [1] RSPM (R 4.0.4)
##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.3)
##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
##  org.Dr.eg.db    * 3.12.0   2022-03-01 [1] Bioconductor  
##  pillar            1.6.1    2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.3)
##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.3)
##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.3)
##  ps                1.6.0    2021-02-28 [1] RSPM (R 4.0.3)
##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.3)
##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
##  R.cache           0.15.0   2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R6                2.5.0    2020-10-28 [1] RSPM (R 4.0.3)
##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.3)
##  Rcpp              1.0.6    2021-01-15 [1] RSPM (R 4.0.3)
##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.3)
##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.3)
##  rlang             0.4.11   2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown         2.8      2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite           2.2.7    2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi        0.13     2020-11-12 [1] RSPM (R 4.0.3)
##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
##  S4Vectors       * 0.28.1   2020-12-09 [1] Bioconductor  
##  sass              0.4.0    2021-05-12 [1] RSPM (R 4.0.4)
##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.3)
##  scatterpie        0.1.6    2021-04-23 [1] RSPM (R 4.0.4)
##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.3)
##  shadowtext        0.0.8    2021-04-23 [1] RSPM (R 4.0.4)
##  stringi           1.6.1    2021-05-10 [1] RSPM (R 4.0.4)
##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.3)
##  styler            1.4.1    2021-03-30 [1] RSPM (R 4.0.4)
##  tibble            3.1.2    2021-05-16 [1] RSPM (R 4.0.4)
##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.3)
##  tidyr             1.1.3    2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect        1.1.1    2021-04-30 [1] RSPM (R 4.0.4)
##  tweenr            1.0.2    2021-03-23 [1] RSPM (R 4.0.4)
##  utf8              1.2.1    2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs             0.3.8    2021-04-29 [1] RSPM (R 4.0.4)
##  viridis           0.6.1    2021-05-11 [1] RSPM (R 4.0.4)
##  viridisLite       0.4.0    2021-04-13 [1] RSPM (R 4.0.4)
##  withr             2.4.2    2021-04-18 [1] RSPM (R 4.0.4)
##  xfun              0.23     2021-05-15 [1] RSPM (R 4.0.4)
##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, P. Tamayo, et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, J. P. Mesirov, et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

Tregnago C., E. Manara, M. Zampini, V. Bisio, C. Borga, et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

UC San Diego and Broad Institute Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

Gene set enrichment analysis - Microarray

CCDL for ALSF

October 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the `.Rmd` file

2.2 Set up your analysis folders

2.3 Obtain the gene set for this example

2.4 About the dataset we are using for this example

2.5 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Gene set enrichment analysis - Microarray

4.1 Install libraries

4.2 Download data file

4.3 Import data

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

4.5 Gene identifier conversion

4.6 Perform gene set enrichment analysis (GSEA)

4.6.1 Determine our pre-ranked genes list

4.6.2 Run GSEA using the `GSEA()` function

4.7 Visualizing results

4.7.1 Most Positive NES

4.7.2 Most Negative NES

4.8 Write results to file

5 Resources for further learning

6 Session info

References

Gene set enrichment analysis - Microarray

CCDL for ALSF

October 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the .Rmd file

2.2 Set up your analysis folders

2.3 Obtain the gene set for this example

2.4 About the dataset we are using for this example

2.5 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Gene set enrichment analysis - Microarray

4.1 Install libraries

4.2 Download data file

4.3 Import data

4.4 Getting familiar with MSigDB gene sets available via msigdbr

4.5 Gene identifier conversion

4.6 Perform gene set enrichment analysis (GSEA)

4.6.1 Determine our pre-ranked genes list

4.6.2 Run GSEA using the GSEA() function

4.7 Visualizing results

4.7.1 Most Positive NES

4.7.2 Most Negative NES

4.8 Write results to file

5 Resources for further learning

6 Session info

References

2.1 Obtain the `.Rmd` file

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

4.6.2 Run GSEA using the `GSEA()` function