1 Purpose of this analysis

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes shares more or fewer genes with gene sets/pathways than we would expect by chance.

ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don’t fit the requirements of other pathway analyses methods. It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).

If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions. ⬇️ Jump to the analysis code ⬇️

1.0.1 What is pathway analysis?

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

1.0.2 How to choose a pathway analysis?

This table summarizes the pathway analyses examples in this module.

Analysis	What is required for input	What output looks like	✅ Pros	⚠️ Cons
ORA (Over-representation Analysis)	A list of gene IDs (no stats needed)	A per-pathway hypergeometric test result	- Simple - Inexpensive computationally to calculate p-values	- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)	A list of genes IDs with gene-level summary statistics	A per-pathway enrichment score	- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes	- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)	A gene expression matrix (like what you get from refine.bio directly)	Pathway-level scores on a per-sample basis	- Does not require two groups to compare upfront - Normally distributed scores	- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the `.Rmd` file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 About the dataset we are using for this example

For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).

We used this dataset to identify modules of co-expressed genes in an example analysis using WGCNA (Langfelder and Horvath 2008).

We have provided this file for you and the code in this notebook will read in the results that are stored online. If you’d like to follow the steps for creating this results file from the refine.bio data, we suggest going through that WGCNA example.

Module 19 was the most differentially expressed between the datasets’ two time points (during illness and recovering from illness).

The heatmap below summarizes the expression of the genes that make up module 19.

Each row is a gene that is a member of module 19, and the composite expression of these genes, called an eigengene, is shown in the barplot below. This plot demonstrates how these genes, together as a module, are differentially expressed between the two time points.

2.4 Check out our file structure!

Your new analysis folder should contain:

The example analysis .Rmd you downloaded
A folder called data (currently empty)
A folder for plots (currently empty)
A folder for results (currently empty)

Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

The file we use here has two columns from our WGCNA module results: the id of each gene and the module it is part of. If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend replacing the gene_module_url with a different file path to a read in a similar table of genes with the information that you are interested in. If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the Determine our genes of interest list section).

We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

4 Over-Representation Analysis with `clusterProfiler` - RNA-seq

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using clusterProfiler package to perform ORA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Subramanian et al. 2005; Liberzon et al. 2011; Yu et al. 2012; Dolgalev 2020).

We will also need the org.Hs.eg.db package (Carlson 2020) to perform gene identifier conversion and ggupset to make an UpSet plot (Ahlmann-Eltze 2020).

if (!("clusterProfiler" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("clusterProfiler", update = FALSE)
}

# This is required to make one of the plots that clusterProfiler will make
if (!("ggupset" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("ggupset", update = FALSE)
}

if (!("msigdbr" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("msigdbr", update = FALSE)
}

if (!("org.Hs.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Hs.eg.db", update = FALSE)
}

Attach the packages we need for this analysis.

# Attach the library
library(clusterProfiler)

# Package that contains MSigDB gene sets in tidy format
library(msigdbr)

# Homo sapiens annotation package we'll use for gene identifier conversion
library(org.Hs.eg.db)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Download data file

For ORA, we only need a list of genes of interest and a background gene set as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways are significantly represented.

For this example, we will read in results from a co-expression network analysis that we have already performed. Rather than reading from a local file, we will download the results table directly from a URL. These results are from a acute bronchiolitis experiment we used for an example WGCNA analysis (Langfelder and Horvath 2008).

The table contains two columns, one with Ensembl gene IDs, and the other with the name of the module they are a part of. We will perform ORA on one of the modules we identified in the WGCNA analysis but the rest of the genes will be used as “background genes.”

Instead of using this URL below, you can use a file path to a TSV file with your desired gene list. First we will assign the URL to its own variable called, gene_module_url.

# Define the url to your gene list file
gene_module_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/04-advanced-topics/results/SRP140558_wgcna_gene_to_module.tsv"

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

gene_module_file <- file.path(
  results_dir,
  "SRP140558_wgcna_gene_to_module.tsv"
)

Using the URL (gene_module_url) and file path (gene_module_file) we can download the file and use the destfile argument to specify where it should be saved.

download.file(
  gene_module_url,
  # The file will be saved to this location and with this name
  destfile = gene_module_file
)

Now let’s double check that the results file is in the right place.

# Check if the file exists
file.exists(gene_module_file)

## [1] TRUE

4.3 Import data

Read in the file that has WGCNA gene and module results.

# Read in the contents of the WGCNA gene modules file
gene_module_df <- readr::read_tsv(gene_module_file)

## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   gene = col_character(),
##   module = col_character()
## )

Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(gene_module_url) and skip the download steps.

Let’s take a look at this gene module table.

gene_module_df

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

Let’s take a look at what organisms the package supports with msigdbr_species().

msigdbr_species()

The data we’re interested in here comes from human samples, so we can obtain only the gene sets relevant to H. sapiens with the species argument to msigdbr().

hs_msigdb_df <- msigdbr(species = "Homo sapiens")

MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated). In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets (Subramanian et al. 2005; Liberzon et al. 2011).

Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).

First, let’s take a look at what information is included in this data frame.

head(hs_msigdb_df)

We will need to use gs_cat and gs_subcat columns to construct a filter step that will only keep curated gene sets and KEGG pathways.

# Filter the human data frame to the KEGG pathways that are included in the
# curated gene sets
hs_kegg_df <- hs_msigdb_df %>%
  dplyr::filter(
    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
  )

The clusterProfiler() function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment.

Our data frame with KEGG terms contains Entrez IDs and gene symbols.

In our differential expression results data frame, gene_module_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs.

4.5 Gene identifier conversion

We’re going to convert our identifiers in gene_module_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

We can see what types of IDs are available to us in an annotation package with keytypes().

keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

The function we will use to map from Ensembl gene IDs to Entrez IDs is called mapIds() and comes from the AnnotationDbi package.

# This returns a named vector which we can convert to a data frame, where
# the keys (Ensembl IDs) are the names
entrez_vector <- mapIds(
  # Replace with annotation package for the organism relevant to your data
  org.Hs.eg.db,
  # The vector of gene identifiers we want to map
  keys = gene_module_df$gene,
  # Replace with the type of gene identifiers in your data
  keytype = "ENSEMBL",
  # Replace with the type of gene identifiers you would like to map to
  column = "ENTREZID",
  # In the case of 1:many mappings, return the
  # first one. This is default behavior!
  multiVals = "first"
)

## 'select()' returned 1:many mapping between keys and columns

This message is letting us know that sometimes Ensembl gene identifiers will map to multiple Entrez IDs. In this case, it’s also possible that an Entrez ID will map to multiple Ensembl IDs. For more about how to explore this, take a look at our RNA-seq gene ID conversion example.

Let’s create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side.

# We would like a data frame we can join to the differential expression stats
gene_key_df <- data.frame(
  ensembl_id = names(entrez_vector),
  entrez_id = entrez_vector,
  stringsAsFactors = FALSE
) %>%
  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
  # from the data frame
  dplyr::filter(!is.na(entrez_id))

Let’s see a preview of entrez_id.

head(gene_key_df)

Now we are ready to add the gene_key_df to our data frame with the module labels, gene_module_df. Here we’re using a dplyr::left_join() because we only want to retain the genes that have Entrez IDs and this will filter out anything in our gene_module_df that does not have an Entrez ID when we join using the Ensembl gene identifiers.

module_annot_df <- gene_key_df %>%
  # Using a left join removes the rows without gene symbols because those rows
  # have already been removed in `gene_key_df`
  dplyr::left_join(gene_module_df,
    # The name of the column that contains the Ensembl gene IDs
    # in the left data frame and right data frame
    by = c("ensembl_id" = "gene")
  )

Let’s take a look at what this data frame looks like.

# Print out a preview
head(module_annot_df)

4.6 Over-representation Analysis (ORA)

Over-representation testing using clusterProfiler is based on a hypergeometric test (often referred to as Fisher’s exact test) (Yu 2020). For more background on hypergeometric tests, this handy tutorial explains more about how hypergeometric tests work (Puthier and van Helden 2015).

We will need to provide to clusterProfiler two genes lists:

Our genes of interest
What genes were in our total background set. (All genes that originally had an opportunity to be measured).

4.7 Determine our genes of interest list

This step is highly variable depending on the format of your gene table, what information it contains and what your goals are. You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to introduce cutoffs and filters that we don’t need here, given the nature of our data.

Here, we will focus on one module, module 19, to identify pathways associated with it. We previously identified this module as differentially expressed between our dataset’s two time points (during acute illness and during recovery). See the previous section for more background on the structure and content of the data table we are using.

module_19_genes <- module_annot_df %>%
  dplyr::filter(module == "ME19") %>%
  dplyr::pull("entrez_id")

Because one entrez_id may map to multiple Ensembl IDs, we need to make sure we have no repeated Entrez IDs in this list.

# Reduce to only unique Entrez IDs
genes_of_interest <- unique(as.character(module_19_genes))

# Let's print out some of these genes
head(genes_of_interest)

## [1] "5704"  "578"   "23471" "5255"  "4171"  "8898"

4.8 Determine our background set gene list

Sometimes folks consider genes from the entire genome to comprise the background, but for this RNA-seq example, we will consider all detectable genes as our background set. The dataset that these genes were selected from already had unreliably detected, low count genes removed. Because of this, we can obtain our detected genes list from our data frame, module_annot_df (which we have not done any further filtering on in this notebook).

# Remove any duplicated entrez_ids
background_set <- unique(as.character(module_annot_df$entrez_id))

4.9 Run ORA using the `enricher()` function

Now that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher() function.

kegg_ora_results <- enricher(
  gene = genes_of_interest, # A vector of your genes of interest
  pvalueCutoff = 0.1, # Can choose a FDR cutoff
  pAdjustMethod = "BH", # Method to be used for multiple testing correction
  universe = background_set, # A vector containing your background set genes
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(
    hs_kegg_df,
    gs_name,
    human_entrez_gene
  )
)

Note: using enrichKEGG() is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you’d like!

The information we’re most likely interested in is in the results slot. Let’s convert this into a data frame that we can write to file.

kegg_result_df <- data.frame(kegg_ora_results@result)

Let’s print out a sneak peek of the results here and take a look at how many gene sets we have using an FDR cutoff of 0.1.

kegg_result_df %>%
  dplyr::filter(p.adjust < 0.1)

Looks like there are four KEGG sets returned as significant at FDR of 0.1.

4.10 Visualizing results

We can use a dot plot to visualize our significant enrichment results. The enrichplot::dotplot() function will only plot gene sets that are significant according to the multiple testing corrected p values (in the p.adjust column) and the pvalueCutoff you provided in the enricher() step.

enrich_plot <- enrichplot::dotplot(kegg_ora_results)

## wrong orderBy parameter; set to default `orderBy = "x"`

# Print out the plot here
enrich_plot

Use ?enrichplot::dotplot to see the help page for more about how to use this function.

This plot is arguably more useful when we have a large number of significant pathways.

Let’s save it to a PNG.

ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_enrich_plot_module_19.png"),
  plot = enrich_plot
)

## Saving 7 x 5 in image

We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.

upset_plot <- enrichplot::upsetplot(kegg_ora_results)

# Print out the plot here
upset_plot

See that KEGG_CELL_CYCLE and KEGG_OOCYTE_MEIOSIS have genes in common, as do KEGG_CELL_CYCLE and KEGG_DNA_REPLICATION. Gene sets or pathways aren’t independent! Based on the context of your samples, you may be able to narrow down which ones make sense. In this instance, we are dealing with PBMCs, so the oocyte meiosis is not relevant to the biology of the samples at hand, and all of the identified genes in that pathway are also part of the cell cycle pathway.

Let’s also save this to a PNG.

ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_upset_plot_module_19.png"),
  plot = upset_plot
)

## Saving 7 x 5 in image

4.11 Write results to file

readr::write_tsv(
  kegg_result_df,
  file.path(
    results_dir,
    "SRP140558_module_19_pathway_analysis_results.tsv"
  )
)

5 Resources for further learning

Hypergeometric test exercises(Puthier and van Helden 2015).
clusterProfiler ORA tutorial
clusterProfiler paper (Yu et al. 2012).
clusterProfiler book (Yu 2020).
This handy review which summarizes the different types of pathway analysis and their limitations (Khatri et al. 2012).

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()

## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package         * version  date       lib source        
##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.3)
##  babelgene         21.4     2021-04-26 [1] RSPM (R 4.0.4)
##  backports         1.2.1    2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
##  BiocGenerics    * 0.36.1   2021-04-16 [1] Bioconductor  
##  BiocManager       1.30.15  2021-05-11 [1] RSPM (R 4.0.4)
##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.3)
##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.3)
##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.3)
##  bslib             0.2.5    2021-05-12 [1] RSPM (R 4.0.4)
##  cachem            1.0.5    2021-05-15 [1] RSPM (R 4.0.4)
##  cli               2.5.0    2021-04-26 [1] RSPM (R 4.0.4)
##  clusterProfiler * 3.18.1   2021-02-09 [1] Bioconductor  
##  colorspace        2.0-1    2021-05-04 [1] RSPM (R 4.0.4)
##  cowplot           1.1.1    2020-12-30 [1] RSPM (R 4.0.3)
##  crayon            1.4.1    2021-02-08 [1] RSPM (R 4.0.3)
##  data.table        1.14.0   2021-02-21 [1] RSPM (R 4.0.3)
##  DBI               1.1.1    2021-01-15 [1] RSPM (R 4.0.3)
##  digest            0.6.27   2020-10-24 [1] RSPM (R 4.0.3)
##  DO.db             2.9      2022-03-01 [1] Bioconductor  
##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.3)
##  dplyr             1.0.6    2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis          0.3.2    2021-04-29 [1] RSPM (R 4.0.4)
##  enrichplot        1.10.2   2021-01-28 [1] Bioconductor  
##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.3)
##  fansi             0.4.2    2021-01-15 [1] RSPM (R 4.0.3)
##  farver            2.1.0    2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap           1.1.0    2021-01-25 [1] RSPM (R 4.0.3)
##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.3)
##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
##  generics          0.1.0    2020-10-31 [1] RSPM (R 4.0.3)
##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
##  ggforce           0.3.3    2021-03-05 [1] RSPM (R 4.0.3)
##  ggplot2           3.3.3    2020-12-30 [1] RSPM (R 4.0.3)
##  ggraph            2.0.5    2021-02-23 [1] RSPM (R 4.0.3)
##  ggrepel           0.9.1    2021-01-15 [1] RSPM (R 4.0.3)
##  ggupset           0.3.0    2020-05-05 [1] RSPM (R 4.0.0)
##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.3)
##  GO.db             3.12.1   2022-03-01 [1] Bioconductor  
##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
##  graphlayouts      0.7.1    2020-10-26 [1] RSPM (R 4.0.4)
##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.3)
##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.3)
##  highr             0.9      2021-04-16 [1] RSPM (R 4.0.4)
##  hms               1.0.0    2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools         0.5.1.1  2021-01-22 [1] RSPM (R 4.0.3)
##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.3)
##  IRanges         * 2.24.1   2020-12-12 [1] Bioconductor  
##  jquerylib         0.1.4    2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite          1.7.2    2020-12-09 [1] RSPM (R 4.0.3)
##  knitr             1.33     2021-04-24 [1] RSPM (R 4.0.4)
##  labeling          0.4.2    2020-10-20 [1] RSPM (R 4.0.3)
##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.5)
##  lifecycle         1.0.0    2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr        * 2.0.1    2020-11-17 [1] RSPM (R 4.0.3)
##  MASS              7.3-53.1 2021-02-12 [2] CRAN (R 4.0.5)
##  Matrix            1.3-2    2021-01-06 [2] CRAN (R 4.0.5)
##  memoise           2.0.0    2021-01-26 [1] RSPM (R 4.0.3)
##  msigdbr         * 7.4.1    2021-05-05 [1] RSPM (R 4.0.4)
##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.3)
##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
##  org.Hs.eg.db    * 3.12.0   2022-03-01 [1] Bioconductor  
##  pillar            1.6.1    2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.3)
##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.3)
##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.3)
##  ps                1.6.0    2021-02-28 [1] RSPM (R 4.0.3)
##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.3)
##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
##  R.cache           0.15.0   2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R6                2.5.0    2020-10-28 [1] RSPM (R 4.0.3)
##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.3)
##  Rcpp              1.0.6    2021-01-15 [1] RSPM (R 4.0.3)
##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.3)
##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.3)
##  rlang             0.4.11   2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown         2.8      2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite           2.2.7    2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi        0.13     2020-11-12 [1] RSPM (R 4.0.3)
##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
##  S4Vectors       * 0.28.1   2020-12-09 [1] Bioconductor  
##  sass              0.4.0    2021-05-12 [1] RSPM (R 4.0.4)
##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.3)
##  scatterpie        0.1.6    2021-04-23 [1] RSPM (R 4.0.4)
##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.3)
##  shadowtext        0.0.8    2021-04-23 [1] RSPM (R 4.0.4)
##  stringi           1.6.1    2021-05-10 [1] RSPM (R 4.0.4)
##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.3)
##  styler            1.4.1    2021-03-30 [1] RSPM (R 4.0.4)
##  tibble            3.1.2    2021-05-16 [1] RSPM (R 4.0.4)
##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.3)
##  tidyr             1.1.3    2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect        1.1.1    2021-04-30 [1] RSPM (R 4.0.4)
##  tweenr            1.0.2    2021-03-23 [1] RSPM (R 4.0.4)
##  utf8              1.2.1    2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs             0.3.8    2021-04-29 [1] RSPM (R 4.0.4)
##  viridis           0.6.1    2021-05-11 [1] RSPM (R 4.0.4)
##  viridisLite       0.4.0    2021-04-13 [1] RSPM (R 4.0.4)
##  withr             2.4.2    2021-04-18 [1] RSPM (R 4.0.4)
##  xfun              0.23     2021-05-15 [1] RSPM (R 4.0.4)
##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Ahlmann-Eltze C., 2020 ggupset: Combination matrix axis for ’ggplot2’ to create ’UpSet’ plots. https://github.com/const-ae/ggupset

Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

Kanehisa M., and S. Goto, 2000 KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27–30. https://doi.org/10.1093/nar/28.1.27

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

Langfelder P., and S. Horvath, 2008 WGCNA: An r package for weighted correlation network analysis. BMC Bioinformatics 9. https://doi.org/10.1186/1471-2105-9-559

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, P. Tamayo, et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

Puthier D., and J. van Helden, 2015 Statistics for Bioinformatics - Practicals - Gene enrichment statistics. https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

Over-representation analysis - RNA-Seq

CCDL for ALSF

December 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the `.Rmd` file

2.2 Set up your analysis folders

2.3 About the dataset we are using for this example

2.4 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Over-Representation Analysis with `clusterProfiler` - RNA-seq

4.1 Install libraries

4.2 Download data file

4.3 Import data

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

4.5 Gene identifier conversion

4.6 Over-representation Analysis (ORA)

4.7 Determine our genes of interest list

4.8 Determine our background set gene list

4.9 Run ORA using the `enricher()` function

4.10 Visualizing results

4.11 Write results to file

5 Resources for further learning

6 Session info

References

Over-representation analysis - RNA-Seq

CCDL for ALSF

December 2020

1 Purpose of this analysis

1.0.1 What is pathway analysis?

1.0.2 How to choose a pathway analysis?

2 How to run this example

2.1 Obtain the .Rmd file

2.2 Set up your analysis folders

2.3 About the dataset we are using for this example

2.4 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Over-Representation Analysis with clusterProfiler - RNA-seq

4.1 Install libraries

4.2 Download data file

4.3 Import data

4.4 Getting familiar with MSigDB gene sets available via msigdbr

4.5 Gene identifier conversion

4.6 Over-representation Analysis (ORA)

4.7 Determine our genes of interest list

4.8 Determine our background set gene list

4.9 Run ORA using the enricher() function

4.10 Visualizing results

4.11 Write results to file

5 Resources for further learning

6 Session info

References

2.1 Obtain the `.Rmd` file

4 Over-Representation Analysis with `clusterProfiler` - RNA-seq

4.4 Getting familiar with MSigDB gene sets available via `msigdbr`

4.9 Run ORA using the `enricher()` function