Objectives

This notebook will demonstrate how to:


In this notebook, we’ll cover a type of pathway or gene set analysis called over-representation analysis (ORA). The idea behind ORA is relatively straightforward: given a set of genes, do these genes overlap with a pathway more than we expect by chance? The simplicity of only requiring an input gene set (sort of, more on that below) can be attractive.

ORA has some limitations, outlined nicely (and more extensively!) in Khatri et al. (2012). One of the main issues with ORA is that typically all genes are treated as equal – the context of the magnitude of a change we may be measuring is removed and each gene is treated as independent, which can sometimes result in an incorrect estimate of significance.

We will use the clusterProfiler package (Yu et al. 2012) to perform ORA. clusterProfiler has many built-in functions that will run a specific type of analysis using a specific source of pathways/gene sets automatically, but for our purposes we’re going to keep things as general as possible. See the clusterProfiler book for more information about the package’s full suite of functionality.

Because different bioinformatics tools often require different types of gene identifiers, we’ll also cover how to convert between gene identifiers using AnnotationDbi Bioconductor packages in this notebook. Check out the AnnotationDbi: Introduction To Bioconductor Annotation Packages (Carlson 2020.) vignette for more information.

Other resources

Set up

Libraries

# Package we'll use for ORA
library(clusterProfiler)
clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141

Attaching package: 'clusterProfiler'
The following object is masked from 'package:stats':

    filter
# Package that contains MSigDB gene sets in tidy format
library(msigdbr)
# Mus musculus annotation package we'll use for gene identifier conversion
library(org.Mm.eg.db)
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'
The following object is masked from 'package:clusterProfiler':

    rename
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname

Attaching package: 'IRanges'
The following object is masked from 'package:clusterProfiler':

    slice

Attaching package: 'AnnotationDbi'
The following object is masked from 'package:clusterProfiler':

    select

Directories and files

Directories

# We'll create a directory to specifically hold the ORA results if it doesn't
# exist yet
results_dir <- file.path("results", "leukemia")
if (!dir.exists(results_dir)) {
  dir.create(results_dir, recursive = TRUE)
}

Input files

For our ORA example, we’re going to use two tables of differential gene expression (DGE) analysis results.

input_dir <- file.path("data", "leukemia")

# This file contains the DGE results for a cell population with high stem cell
# capacity as compared to a cell population
vs_low_file <- file.path(input_dir,
                         "high_capacity_vs_low_capacity_results.tsv")

# This file contains the DGE results for the high stem cell capacity population
# vs. unsorted cells (e.g., a mixture of capacities)
vs_unsorted_file <- file.path(input_dir,
                              "high_capacity_vs_unsorted_results.tsv")

Output files

We’ll save the table of ORA results (e.g., p-values).

kegg_results_file <- file.path(results_dir, "leukemia_kegg_ora_results.tsv")

Gene sets

We will use gene sets from the Molecular Signatures Database (MSigDB) from the Broad Institute (Subramanian, Tamayo et al. 2005). The msigdbr package contains MSigDB datasets already in the tidy format required by clusterProfiler and supports multiple organisms.

Let’s take a look at what organisms the package supports.

msigdbr_species()

The results we’re interested in here come from mouse samples, so we can obtain just the gene sets relevant to M. musculus with the species argument to msigdbr().

mm_msigdb_df <- msigdbr(species = "Mus musculus")

MSigDB contains 8 different gene set collections.

H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures

In this example, we will use canonical pathways which are (ref):

Gene sets from pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts.

And are a subset of C2: curated gene sets. Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways.

# Filter the mouse data frame to the KEGG pathways that are included in the
# curated gene sets
mm_kegg_df <- mm_msigdb_df |>
  dplyr::filter(gs_cat == "C2",  # curated gene sets
                gs_subcat == "CP:KEGG")  # KEGG pathways

Note: We could specified that we wanted the KEGG gene sets using the category and subcategory arguments of msigdbr(), but we’re going for general steps!

colnames(mm_kegg_df)
 [1] "gs_cat"               "gs_subcat"            "gs_name"             
 [4] "gene_symbol"          "entrez_gene"          "ensembl_gene"        
 [7] "human_gene_symbol"    "human_entrez_gene"    "human_ensembl_gene"  
[10] "gs_id"                "gs_pmid"              "gs_geoid"            
[13] "gs_exact_source"      "gs_url"               "gs_description"      
[16] "taxon_id"             "ortholog_sources"     "num_ortholog_sources"
gs_cat
gs_subcat
gs_name
gene_symbol
entrez_gene
ensembl_gene
human_gene_symbol
human_entrez_gene
human_ensembl_gene
gs_id
gs_pmid
gs_geoid
gs_exact_source
gs_url
gs_description
taxon_id
ortholog_sources
num_ortholog_sources

The clusterProfiler function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment. Our data frame with KEGG terms contains Entrez IDs and gene symbols.

Read in DGE results and prep

vs_low_df <- readr::read_tsv(vs_low_file)
Rows: 35429 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s take a peek at the top of the DGE results data frame.

head(vs_low_df)

Gene identifier conversion

Our data frame of DGE results contains Ensembl gene identifiers. So we will need to convert from these identifiers into either the gene symbols or Entrez IDs that are present in the data we extracted with msigdbr().

We’re going to convert our identifiers to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

The annotation package org.Mm.eg.db contains information for different identifiers. org.Mm.eg.db is specific to Mus musculus – this is what the Mm in the package name is referencing. To perform gene identifier conversion in human (Homo sapiens) we could use org.Hs.eg.db; we would use org.Dr.eg.db for zebrafish (Danio rerio).

We can see what types of IDs are available to us in an annotation package with keytypes().

keytypes(org.Mm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"     
ACCNUM
ALIAS
ENSEMBL
ENSEMBLPROT
ENSEMBLTRANS
ENTREZID
ENZYME
EVIDENCE
EVIDENCEALL
GENENAME
GENETYPE
GO
GOALL
IPI
MGI
ONTOLOGY
ONTOLOGYALL
PATH
PFAM
PMID
PROSITE
REFSEQ
SYMBOL
UNIPROT

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to gene symbols (SYMBOL), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to Entrez IDs (ENTREZID).

The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds().

# This returns a named vector which we can convert to a data frame, where
# the keys (Ensembl IDs) are the names
symbols_vector <- mapIds(org.Mm.eg.db,  # Specify the annotation package
                         # The vector of gene identifiers we want to
                         # map
                         keys = vs_low_df$Gene,
                         # What type of gene identifiers we're starting
                         # with
                         keytype = "ENSEMBL",
                         # The type of gene identifier we want returned
                         column = "SYMBOL",
                         # In the case of 1:many mappings, return the
                         # first one. This is default behavior!
                         multiVals = "first")
'select()' returned 1:many mapping between keys and columns
# We would like a data frame we can join to the DGE results
symbols_df <- data.frame(
  ensembl_id = names(symbols_vector),
  gene_symbol = symbols_vector
)

This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs.

Now we are ready to add the gene symbols to our data frame with the DGE results. We can use a join function from the dplyr package to do this, which will use the Ensembl gene IDs in both data frames to determine how to join the rows.

Let’s do this first for the comparison to the low stem cell capacity population.

vs_low_df <- symbols_df |>
  # An *inner* join will only return rows that are in both data frames
  dplyr::inner_join(vs_low_df,
                    # The name of the column that contains the Ensembl gene IDs
                    # in the left data frame and right data frame
                    by = c("ensembl_id" = "Gene"))

Drop NA values

Some of these rows have NA values in padj, which can happen for a number of reasons including when all samples have zero counts or a gene has low mean expression.

Let’s filter to rows that do not have any NA using a function tidyr::drop_na(). This will also drop genes that have an Ensembl gene identifier but no gene symbol!

# Remove rows that are not complete (e.g., contain NAs) by filtering to only
# complete rows
vs_low_df <- vs_low_df |>
  tidyr::drop_na()

Now we’ll read in our data frame of DGE results from another comparison. To save us some time during instruction, we’ve already done the gene identifier conversion and filtering to remove NA values in this notebook. We took a different series of steps to achieve the same thing, which is often possible in R!

vs_unsorted_df <- readr::read_tsv(vs_unsorted_file)
Rows: 17151 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Gene, gene_symbol
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Over-representation Analysis (ORA)

To test for over-representation, we can calculate a p-value with a hypergeometric test (ref).

\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\)

Where N is the number of genes in the background distribution, M is the number of genes in a pathway, n is the number of genes we are interested in (our marker genes), and k is the number of genes that overlap between the pathway and our marker genes.

Borrowing an example from clusterProfiler: universal enrichment tool for functional and comparative study (Yu ):

Example: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set.

We’ll call genes that are differentially expressed gene_in_interest and genes that are in the gene set in_gene_set.

gene_table <- data.frame(
  gene_not_interest = c(2613, 15310),
  gene_in_interest = c(28, 29)
)
rownames(gene_table) <- c("in_gene_set", "not_in_gene_set")

gene_table

We can assess if the 28 overlapping genes mean that the differentially expressed genes are over-represented in the gene set with the hypergeometric distribution. This corresponds to a one-sided Fisher’s exact test.

fisher.test(gene_table, alternative = "greater")

    Fisher's Exact Test for Count Data

data:  gene_table
p-value = 1
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.110242      Inf
sample estimates:
odds ratio 
 0.1767937 

When we test multiple pathways or gene sets, the p-values then need to be adjusted for multiple hypothesis testing.

High stem cell capacity ORA

Our DGE results are from data published as part of Sachs et al. (2014). The authors sorted populations of primary leukemia cells and examined the stem cell capacity of these cell populations. (This study may sound familiar if you’ve worked on one of our bulk RNA-seq exercise notebooks in the past!)

We compared the population that the authors identified as having high stem cell capacity to a low stem cell capacity population. We also compared the high stem cell capacity cells to a mix of populations (e.g., unsorted cells). You can see the code in here.

We’re interested in what pathways are over-represented in genes that specifically distinguish the high capacity population from the low capacity population.

Let’s generate a list of genes that have higher expression in the high stem cell capacity population compared to the low stem cell capacity population, but we’ll also want to exclude genes that show up in our other comparison to unsorted cells.

We’ll start with the high stem cell capacity vs. low stem cell capacity population comparison. Genes with positive log2 fold-changes (LFC) will be more highly expressed in the high stem cell capacity cells based on how we set up the analysis.

vs_low_genes <- vs_low_df |>
  # Filter to the positive LFC and filter based on significance too (padj)
  dplyr::filter(log2FoldChange > 0,
                padj < 0.05) |>
  # Return a vector of gene symbols
  dplyr::pull(gene_symbol)

Although we’re picking a commonly used cutoff (FDR < 0.05), it’s still arbitrary and we could just as easily pick a different threshold for our LFC values. When we generate lists of genes of interest for ORA, we typically pick an arbitrary cutoff. This is one of the approach’s weaknesses – we’ve removed all other context.

Now, we’ll take the same steps for our other results.

vs_unsorted_genes <- vs_unsorted_df |>
  dplyr::filter(log2FoldChange > 0,
                padj < 0.05) |>
  dplyr::pull(gene_symbol)

We want genes that are in the first comparison but not in the second! We can use setdiff(), a base R function for set operations, to get the list that we want.

# What genes are in the first set but *not* in the second set
genes_for_ora <- setdiff(vs_low_genes, vs_unsorted_genes)

Background set

As we saw above, calculating the p-value relies on the number of genes in the background distribution. Sometimes folks consider genes from the entire genome to comprise the background, but in the example borrowed from the clusterProfiler authors, they state:

17,980 genes detected in a Microarray study

Where the key phrase is genes detected.

If we were unable to include a gene in one of our differential expression comparisons because, for example, it had low mean expression in our experiment and therefore was filtered out in our tidyr::drop_na() step, we shouldn’t include in our background set.

We can use another function for set operations, intersect(), to get our background set of genes that were included in both comparisons.

# intersect() will return the genes in both sets - we are using the entire data
# frame here (complete cases), not just the significant genes
background_set <- intersect(vs_low_df$gene_symbol,
                            vs_unsorted_df$gene_symbol)

# Remove anything that couldn't reliably be measured/assessed in both from the
# genes of interest list - using intersect() will drop anything in the first set
# that isn't also in the second set
genes_for_ora <- intersect(genes_for_ora, background_set)

Run enricher()

Now that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher() function.

kegg_ora_results <- enricher(
  gene = genes_for_ora,  # Genes of interest
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",  # FDR
  universe = background_set,  # Background set
  # The pathway information should be a data frame with a term name or
  # identifier and the gene identifiers
  TERM2GENE = dplyr::select(mm_kegg_df,
                            gs_name,
                            gene_symbol)
)

Note: using enrichKEGG() is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you’d like!

What is returned by enricher()?

View(kegg_ora_results)

The information we’re most likely interested in is in the results slot. Let’s convert this into a data frame that we can write to file.

kegg_result_df <- data.frame(kegg_ora_results@result)

Visualizing results

We can use a dot plot to visualize our significant enrichment results.

enrichplot::dotplot(kegg_ora_results)