1 Purpose of this analysis

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect by chance.

ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don’t fit the requirements of other pathway analyses methods. It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).

If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions.

⬇️ Jump to the analysis code ⬇️

1.0.1 What is pathway analysis?

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

1.0.2 How to choose a pathway analysis?

This table summarizes the pathway analyses examples in this module.

Analysis	What is required for input	What output looks like	✅ Pros	⚠️ Cons
ORA (Over-representation Analysis)	A list of gene IDs (no stats needed)	A per-pathway hypergeometric test result	- Simple - Inexpensive computationally to calculate p-values	- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)	A list of genes IDs with gene-level summary statistics	A per-pathway enrichment score	- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes	- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)	A gene expression matrix (like what you get from refine.bio directly)	Pathway-level scores on a per-sample basis	- Does not require two groups to compare upfront - Normally distributed scores	- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the `.Rmd` file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the gene set for this example

In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes, and adjusted p-values (FDR in this case).

We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

2.4 About the dataset we are using for this example

For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

2.5 Check out our file structure!

Your new analysis folder should contain:

The example analysis .Rmd you downloaded
A folder called data (currently empty)
A folder for plots (currently empty)
A folder for results (currently empty)

Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

The file we use here has several columns of differential expression summary statistics. If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend replacing the dge_results_file with a different file path to a read in a similar table of genes with the information that you are interested in. If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the Determine our genes of interest list section).

We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

4 Over-Representation Analysis with `clusterProfiler` - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using clusterProfiler package to perform ORA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Subramanian et al. 2005; Liberzon et al. 2011; Yu et al. 2012; Dolgalev 2020).

We will also need the org.Dr.eg.db package to perform gene identifier conversion and ggupset to make an UpSet plot (Carlson 2019; Ahlmann-Eltze 2020).

if (!("clusterProfiler" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("clusterProfiler", update = FALSE)
}

# This is required to make one of the plots that clusterProfiler will make
if (!("ggupset" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("ggupset", update = FALSE)
}

if (!("msigdbr" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("msigdbr", update = FALSE)
}

if (!("org.Dr.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Dr.eg.db", update = FALSE)
}

Attach the packages we need for this analysis.

# Attach the library
library(clusterProfiler)

# Package that contains MSigDB gene sets in tidy format
library(msigdbr)

# Danio rerio annotation package we'll use for gene identifier conversion
library(org.Dr.eg.db)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Download data file

For ORA, we only need a list of gene IDs as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways it shares genes with.

For this example, we will read in results from a differential expression analysis that we have already performed. Rather than reading from a local file, we will download the results table directly from a URL. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA.

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

# Define the url to your differential expression results file
dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

dge_results_file <- file.path(
  results_dir,
  "GSE71270_limma_results.tsv"
)

Using the URL (dge_url) and file path (dge_results_file) we can download the file and use the destfile argument to specify where it should be saved.

download.file(
  dge_url,
  # The file will be saved to this location and with this name
  destfile = dge_results_file
)

Now let’s double check that the results file is in the right place.

# Check if the file exists
file.exists(dge_results_file)

## [1] TRUE

4.3 Import data

Read in the file that has differential expression results.

# Read in the contents of the differential expression results file
dge_df <- readr::read_tsv(dge_results_file)

## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   Gene = col_character(),
##   logFC = col_double(),
##   AveExpr = col_double(),
##   t = col_double(),
##   P.Value = col_double(),
##   adj.P.Val = col_double(),
##   B = col_double()
## )

Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url) and skip the download steps.

Let’s take a look at what these results from the differential expression analysis look like.

dge_df

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

Let’s take a look at what organisms the package supports with msigdbr_species().

msigdbr_species()

The data we’re interested in here comes from zebrafish samples, so we can obtain only the gene sets relevant to D. rerio with the species argument to msigdbr().

dr_msigdb_df <- msigdbr(species = "Danio rerio")

MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated). In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets (Subramanian et al. 2005; Liberzon et al. 2011).

Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).

First, let’s take a look at what information is included in the data frame returned by msigdbr().

head(dr_msigdb_df)

We will need to use gs_cat and gs_subcat columns to construct a filter step that will only keep curated gene sets and KEGG pathways.

# Filter the zebrafish data frame to the KEGG pathways that are included in the
# curated gene sets
dr_kegg_df <- dr_msigdb_df %>%
  dplyr::filter(
    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
  )

The clusterProfiler() function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment.

Our data frame with KEGG terms contains Entrez IDs and gene symbols.

In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs.

4.5 Gene identifier conversion

We’re going to convert our identifiers in dge_df to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

The annotation package org.Dr.eg.db contains information for different identifiers (Carlson 2019). org.Dr.eg.db is specific to Danio rerio – this is what the Dr in the package name is referencing.

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

We can see what types of IDs are available to us in an annotation package with keytypes().

keytypes(org.Dr.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [17] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"      
## [21] "SYMBOL"       "UNIGENE"      "UNIPROT"      "ZFIN"

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to gene symbols (SYMBOL), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to Entrez IDs (ENTREZID).

The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds() and comes from the AnnotationDbi package.

# This returns a named vector which we can convert to a data frame, where
# the keys (Ensembl IDs) are the names
symbols_vector <- mapIds(
  # Replace with annotation package for the organism relevant to your data
  org.Dr.eg.db,
  # The vector of gene identifiers we want to map
  keys = dge_df$Gene,
  # Replace with the type of gene identifiers in your data
  keytype = "ENSEMBL",
  # Replace with the type of gene identifiers you would like to map to
  column = "SYMBOL",
  # In the case of 1:many mappings, return the
  # first one. This is default behavior!
  multiVals = "first"
)

## 'select()' returned 1:many mapping between keys and columns

This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

Let’s create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side.

# We would like a data frame we can join to the differential expression stats
gene_key_df <- data.frame(
  ensembl_id = names(symbols_vector),
  gene_symbol = symbols_vector,
  stringsAsFactors = FALSE
) %>%
  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
  # from the data frame
  dplyr::filter(!is.na(gene_symbol))

Let’s see a preview of gene_key_df.

head(gene_key_df)

Now we are ready to add the gene_key_df to our data frame with the differential expression stats, dge_df. Here we’re using a dplyr::left_join() because we only want to retain the genes that have gene symbols and this will filter out anything in our dge_df that does not have gene symbols when we join using the Ensembl gene identifiers.

dge_annot_df <- gene_key_df %>%
  # Using a left join removes the rows without gene symbols because those rows
  # have already been removed in `gene_symbols_df`
  dplyr::left_join(dge_df,
    # The name of the column that contains the Ensembl gene IDs
    # in the left data frame and right data frame
    by = c("ensembl_id" = "Gene")
  )

Let’s take a look at what this data frame looks like.

# Print out a preview
head(dge_annot_df)

4.6 Over-representation Analysis (ORA)

Over-representation testing using clusterProfiler is based on a hypergeometric test (often referred to as Fisher’s exact test) (Yu 2020). For more background on hypergeometric tests, this handy tutorial explains more about how hypergeometric tests work (Puthier and van Helden 2015).

We will need to provide to clusterProfiler two genes lists:

Our genes of interest
What genes were in our total background set. (All genes that originally had an opportunity to be measured).

4.7 Determine our genes of interest list

We will use our differential expression results to get a genes of interest list. Let’s use our adjusted p values as a cutoff.

This step is highly variable depending on what your gene list is, what information it contains and what your goals are. You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to adjust the cutoffs and column names.

# Select genes that are below a cutoff
genes_of_interest <- dge_annot_df %>%
  # Here we want the top differentially expressed genes and we will use
  # downregulated genes
  dplyr::filter(adj.P.Val < 0.05, logFC < -1) %>%
  # We are extracting the gene symbols as a vector
  dplyr::pull(gene_symbol)

There are a lot of ways we could make a genes of interest list, and using a p-value cutoff for differential expression analysis is just one way you can do this.

ORA generally requires you make some sort of arbitrary decision to obtain your genes of interest list and this is one of the approach’s weaknesses – to get to a gene list we’ve removed all other context.

Because one gene_symbol may map to multiple Ensembl IDs, we need to make sure we have no repeated gene symbols in this list.

# Reduce to only unique gene symbols
genes_of_interest <- unique(as.character(genes_of_interest))

# Let's print out some of these genes
head(genes_of_interest)

## [1] "si:ch1073-67j19.1" "ypel3"             "pdia4"            
## [4] "cst14a.2"          "viml"              "spink2.1"

4.8 Determine our background set gene list

Sometimes folks consider genes from the entire genome to comprise the background, but for our microarray data, we should consider all genes that were measured as our background set. In other words, if we are unable to detect a gene, it should not be in our background set.

We can obtain our detected genes list from our data frame, dge_annot_df (which we haven’t done filtering on).

background_set <- unique(as.character(dge_annot_df$gene_symbol))

4.9 Run ORA using the `enricher()` function

Now that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher() function.

kegg_ora_results <- enricher(
  gene = genes_of_interest, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    dr_kegg_df,
    gs_name,
    gene_symbol
  )
)

Note: using enrichKEGG() is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you’d like!

The information we’re most likely interested in is in the results slot. Let’s convert this into a data frame that we can write to file.

kegg_result_df <- data.frame(kegg_ora_results@result)

Let’s print out a sneak peek of it here and take a look at how many sets do we have that fit our cutoff of 0.1 FDR?

kegg_result_df %>%
  dplyr::filter(p.adjust < 0.1)

Looks like there are four KEGG sets returned as significant at FDR of 0.1.

4.10 Visualizing results

We can use a dot plot to visualize our significant enrichment results. The enrichplot::dotplot() function will only plot gene sets that are significant according to the multiple testing corrected p values (in the p.adjust column) and the pvalueCutoff you provided in the enricher() step.

enrich_plot <- enrichplot::dotplot(kegg_ora_results)

## wrong orderBy parameter; set to default `orderBy = "x"`

# Print out the plot here
enrich_plot

Use ?enrichplot::dotplot to see the help page for more about how to use this function.

This plot is arguably more useful when we have a large number of significant pathways.

Let’s save it to a PNG.

ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"),
  plot = enrich_plot
)

## Saving 7 x 5 in image

We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.

upset_plot <- enrichplot::upsetplot(kegg_ora_results)

# Print out the plot here
upset_plot

See that KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION and KEGG_LYSOSOME have all their genes in common. Gene sets or pathways aren’t independent!

Let’s also save this to a PNG.

ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"),
  plot = upset_plot
)

## Saving 7 x 5 in image

4.11 Write results to file

readr::write_tsv(
  kegg_result_df,
  file.path(
    results_dir,
    "GSE71270_pathway_analysis_results.tsv"
  )
)

5 Resources for further learning

Hypergeometric test exercises(Puthier and van Helden 2015).
clusterProfiler ORA tutorial
clusterProfiler paper (Yu et al. 2012).
clusterProfiler book (Yu 2020).
This handy review which summarizes the different types of pathway analysis and their limitations (Khatri et al. 2012).

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()

## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package         * version  date       lib source        
##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.3)
##  babelgene         21.4     2021-04-26 [1] RSPM (R 4.0.4)
##  backports         1.2.1    2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
##  BiocGenerics    * 0.36.1   2021-04-16 [1] Bioconductor  
##  BiocManager       1.30.15  2021-05-11 [1] RSPM (R 4.0.4)
##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.3)
##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.3)
##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.3)
##  bslib             0.2.5    2021-05-12 [1] RSPM (R 4.0.4)
##  cachem            1.0.5    2021-05-15 [1] RSPM (R 4.0.4)
##  cli               2.5.0    2021-04-26 [1] RSPM (R 4.0.4)
##  clusterProfiler * 3.18.1   2021-02-09 [1] Bioconductor  
##  colorspace        2.0-1    2021-05-04 [1] RSPM (R 4.0.4)
##  cowplot           1.1.1    2020-12-30 [1] RSPM (R 4.0.3)
##  crayon            1.4.1    2021-02-08 [1] RSPM (R 4.0.3)
##  data.table        1.14.0   2021-02-21 [1] RSPM (R 4.0.3)
##  DBI               1.1.1    2021-01-15 [1] RSPM (R 4.0.3)
##  digest            0.6.27   2020-10-24 [1] RSPM (R 4.0.3)
##  DO.db             2.9      2022-03-01 [1] Bioconductor  
##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.3)
##  dplyr             1.0.6    2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis          0.3.2    2021-04-29 [1] RSPM (R 4.0.4)
##  enrichplot        1.10.2   2021-01-28 [1] Bioconductor  
##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.3)
##  fansi             0.4.2    2021-01-15 [1] RSPM (R 4.0.3)
##  farver            2.1.0    2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap           1.1.0    2021-01-25 [1] RSPM (R 4.0.3)
##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.3)
##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
##  generics          0.1.0    2020-10-31 [1] RSPM (R 4.0.3)
##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
##  ggforce           0.3.3    2021-03-05 [1] RSPM (R 4.0.3)
##  ggplot2           3.3.3    2020-12-30 [1] RSPM (R 4.0.3)
##  ggraph            2.0.5    2021-02-23 [1] RSPM (R 4.0.3)
##  ggrepel           0.9.1    2021-01-15 [1] RSPM (R 4.0.3)
##  ggupset           0.3.0    2020-05-05 [1] RSPM (R 4.0.0)
##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.3)
##  GO.db             3.12.1   2022-03-01 [1] Bioconductor  
##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
##  graphlayouts      0.7.1    2020-10-26 [1] RSPM (R 4.0.4)
##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.3)
##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.3)
##  highr             0.9      2021-04-16 [1] RSPM (R 4.0.4)
##  hms               1.0.0    2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools         0.5.1.1  2021-01-22 [1] RSPM (R 4.0.3)
##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.3)
##  IRanges         * 2.24.1   2020-12-12 [1] Bioconductor  
##  jquerylib         0.1.4    2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite          1.7.2    2020-12-09 [1] RSPM (R 4.0.3)
##  knitr             1.33     2021-04-24 [1] RSPM (R 4.0.4)
##  labeling          0.4.2    2020-10-20 [1] RSPM (R 4.0.3)
##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.5)
##  lifecycle         1.0.0    2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr        * 2.0.1    2020-11-17 [1] RSPM (R 4.0.3)
##  MASS              7.3-53.1 2021-02-12 [2] CRAN (R 4.0.5)
##  Matrix            1.3-2    2021-01-06 [2] CRAN (R 4.0.5)
##  memoise           2.0.0    2021-01-26 [1] RSPM (R 4.0.3)
##  msigdbr         * 7.4.1    2021-05-05 [1] RSPM (R 4.0.4)
##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.3)
##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
##  org.Dr.eg.db    * 3.12.0   2022-03-01 [1] Bioconductor  
##  pillar            1.6.1    2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.3)
##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.3)
##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.3)
##  ps                1.6.0    2021-02-28 [1] RSPM (R 4.0.3)
##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.3)
##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
##  R.cache           0.15.0   2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R6                2.5.0    2020-10-28 [1] RSPM (R 4.0.3)
##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.3)
##  Rcpp              1.0.6    2021-01-15 [1] RSPM (R 4.0.3)
##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.3)
##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.3)
##  rlang             0.4.11   2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown         2.8      2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite           2.2.7    2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi        0.13     2020-11-12 [1] RSPM (R 4.0.3)
##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
##  S4Vectors       * 0.28.1   2020-12-09 [1] Bioconductor  
##  sass              0.4.0    2021-05-12 [1] RSPM (R 4.0.4)
##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.3)
##  scatterpie        0.1.6    2021-04-23 [1] RSPM (R 4.0.4)
##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.3)
##  shadowtext        0.0.8    2021-04-23 [1] RSPM (R 4.0.4)
##  stringi           1.6.1    2021-05-10 [1] RSPM (R 4.0.4)
##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.3)
##  styler            1.4.1    2021-03-30 [1] RSPM (R 4.0.4)
##  tibble            3.1.2    2021-05-16 [1] RSPM (R 4.0.4)
##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.3)
##  tidyr             1.1.3    2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect        1.1.1    2021-04-30 [1] RSPM (R 4.0.4)
##  tweenr            1.0.2    2021-03-23 [1] RSPM (R 4.0.4)
##  utf8              1.2.1    2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs             0.3.8    2021-04-29 [1] RSPM (R 4.0.4)
##  viridis           0.6.1    2021-05-11 [1] RSPM (R 4.0.4)
##  viridisLite       0.4.0    2021-04-13 [1] RSPM (R 4.0.4)
##  withr             2.4.2    2021-04-18 [1] RSPM (R 4.0.4)
##  xfun              0.23     2021-05-15 [1] RSPM (R 4.0.4)
##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Ahlmann-Eltze C., 2020 ggupset: Combination matrix axis for ’ggplot2’ to create ’UpSet’ plots. https://github.com/const-ae/ggupset

Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

Kanehisa M., and S. Goto, 2000 KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27–30. https://doi.org/10.1093/nar/28.1.27

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, P. Tamayo, et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

Puthier D., and J. van Helden, 2015 Statistics for Bioinformatics - Practicals - Gene enrichment statistics. https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

Tregnago C., E. Manara, M. Zampini, V. Bisio, C. Borga, et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

Over-representation analysis - Microarray

CCDL for ALSF

November 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the `.Rmd` file

2.2 Set up your analysis folders

2.3 Obtain the gene set for this example

2.4 About the dataset we are using for this example

2.5 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Over-Representation Analysis with `clusterProfiler` - Microarray

4.1 Install libraries

4.2 Download data file

4.3 Import data

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

4.5 Gene identifier conversion

4.6 Over-representation Analysis (ORA)

4.7 Determine our genes of interest list

4.8 Determine our background set gene list

4.9 Run ORA using the `enricher()` function

4.10 Visualizing results

4.11 Write results to file

5 Resources for further learning

6 Session info

References

Over-representation analysis - Microarray

CCDL for ALSF

November 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the .Rmd file

2.2 Set up your analysis folders

2.3 Obtain the gene set for this example

2.4 About the dataset we are using for this example

2.5 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Over-Representation Analysis with clusterProfiler - Microarray

4.1 Install libraries

4.2 Download data file

4.3 Import data

4.4 Getting familiar with MSigDB gene sets available via msigdbr

4.5 Gene identifier conversion

4.6 Over-representation Analysis (ORA)

4.7 Determine our genes of interest list

4.8 Determine our background set gene list

4.9 Run ORA using the enricher() function

4.10 Visualizing results

4.11 Write results to file

5 Resources for further learning

6 Session info

References

2.1 Obtain the `.Rmd` file

4 Over-Representation Analysis with `clusterProfiler` - Microarray

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

4.9 Run ORA using the `enricher()` function