Adapted from the
AUCell vignette and the
cell-type-ewings
module that is part of the Open
Single-cell Pediatric Cancer Atlas project.
AUCell
R packageIn this notebook, we’ll demonstrate how to use the AUCell method, introduced in Aibar et al. 2017..
We can use AUCell when we are interested in a gene set’s relative expression or activity in an individual cell. Gene sets can come from a curated collection of prior knowledge, like the Hallmark collection we used in the last notebook, or we can use our own custom gene sets (e.g., a set of marker genes for a cell type of interest).
A nice feature of AUCell is that it is based on ranking genes from highest to lowest expression value in an individual cell, which is helpful in the following ways (AUCell vignette):
AUCell calculates the area under the recovery curve (AUC), which “represents the proportion of expressed genes in the signature and their relative expression value compared to the other genes within the cell” (Aibar et al. 2017.). We will visualize some recovery curves in the notebook to give you a better intuition about the AUC and its meaning.
The AUC values we get out of AUCell can be used in a number of ways (Aibar et al. 2017.):
We will use an snRNA-seq of a Ewing sarcoma sample from the SCPCP000015
project on the Single-cell Pediatric Cancer Atlas Portal and two
relevant gene sets from the Molecular Signatures Database (MSigDB) to
demonstrate this method.
# We will be loading a SingleCellExperiment object into our environment but don't need to see the startup messages
suppressPackageStartupMessages({
library(SingleCellExperiment)
})
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Library we'll use for the gene set analysis itself
library(AUCell)
# Libraries for accessing and working with gene sets
library(GSEABase)
Loading required package: annotate
Loading required package: AnnotationDbi
Loading required package: XML
Loading required package: graph
Attaching package: 'graph'
The following object is masked from 'package:XML':
addNode
library(msigdbr)
# Input data
ewing_data_dir <- fs::path("data", "ewing-sarcoma")
processed_dir <- fs::path(ewing_data_dir, "processed")
# Directory for holding pathway analysis results
analysis_dir <- fs::path("analysis", "ewing-sarcoma", "pathway-analysis")
# Create if it doesn't exist yet
fs::dir_create(analysis_dir)
The input will be a SingleCellExperiment
for an
individual Ewing sarcoma library.
sce_file <- fs::path(processed_dir,
"SCPCS000490",
"SCPCL000822_processed.rds")
We will save the AUCell results as a table in the analysis directory.
output_file <- fs::path(analysis_dir,
"ewing_sarcoma_aucell_results.tsv")
The source()
function allows us to load in custom
functions we saved in an .R
file.
source(fs::path("util", "aucell_functions.R"))
This loads one custom function, called
plot_recovery_curve()
, into our environment. This function
is adapted from the
AUCell vignette.
We are going to use two gene sets pertaining to Ewing sarcoma.
ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION
,
which are genes that were highly expressed in a rhabdomyosarcoma cell
line engineered to express the EWSR1-FLI1 fusion.RIGGI_EWING_SARCOMA_PROGENITOR_UP
,
which are genes that were highly expressed in mesenchymal stem cells
engineered to express the EWS-FLI1 fusion protein.We would expect both of these gene sets to have high expression in tumor cells.
# Create a named vector with the relevant gene set names
ewing_gene_set_names <- c(zhang = "ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION",
riggi = "RIGGI_EWING_SARCOMA_PROGENITOR_UP")
ewing_gene_set_names
zhang riggi
"ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION" "RIGGI_EWING_SARCOMA_PROGENITOR_UP"
ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION
RIGGI_EWING_SARCOMA_PROGENITOR_UP
These gene sets come from the C2 gene set collection from MSigDB.
Let’s retrieve them using msigdbr()
.
ewing_gene_sets_df <- msigdbr(species = "Homo sapiens",
category = "C2",
subcategory = "CGP") |>
dplyr::filter(gs_name %in% ewing_gene_set_names)
AUCell
uses gene sets in a particular format that comes
from the GSEABase
package. We need to create a
GeneSetCollection
.
ewing_gene_set_collection <- ewing_gene_set_names |>
purrr::map(
# For each gene set
\(gene_set_name) {
ewing_gene_sets_df |>
# Subset to the rows in that gene set
dplyr::filter(gs_name == gene_set_name) |>
# Grab the Ensembl gene identifiers
dplyr::pull(ensembl_gene) |>
# Create a GeneSet object
GeneSet(setName = gene_set_name,
geneIdType = ENSEMBLIdentifier())
}
) |>
# Turn the list of GeneSet objects into a GeneSet collection
GeneSetCollection()
sce <- readr::read_rds(sce_file)
The AUCell
functions takes an expression matrix with
genes as rows and cells as column. We can extract a counts matrix in
sparse format for use with AUCell
.
# Extract counts matrix
counts_matrix <- counts(sce)
There may be genes in our gene set that do not appear in the
SingleCellExperiment object. We can remove them using the
subsetGeneSets()
function.
# Remove genes from gene sets if they are not in the SCE
ewing_gene_set_collection <- subsetGeneSets(ewing_gene_set_collection,
rownames(counts_matrix))
AUCell relies on ranking genes from highest to lowest expression value to calculate the AUC. The AUC is the area under the recovery curve, which captures the number of genes in a gene set that are present in the rankings above some threshold (i.e., it is the area under the curve to the left of this gene rank). By default, the top 5% of genes are used as the threshold.
Some genes will not be detected (i.e., have 0 counts). Genes can also have the same expression level (i.e., ties). These undetected genes and ties will be randomly ordered in our ranking. To make our rankings – and therefore results – reproducible, we will set a seed.
set.seed(2024)
The first step in AUCell is to rank genes for each cell from highest
to lowest expression value. We can do this using the
AUCell_buildRankings()
function, which will output a
visualization showing the distribution of the number of genes detected
in the cells in our SingleCellExperiment object.
cell_rankings <- AUCell_buildRankings(counts_matrix)
Quantiles for the number of genes detected by cell:
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
min 1% 5% 10% 50% 100%
213.00 465.76 795.80 1102.20 2238.00 6608.00
The AUCell authors recommend making sure most cells have at least the number of genes we will use as the max rank to calculate the AUC.
The AUC max rank value tells AUCell the cutoff in the gene rankings to use for calculating AUC; we will visualize this curve and max rank in just a moment. If we picked a max rank higher than the number of genes detected in most cells, the non-detected genes that are randomly ordered would play an outsized role in our AUC values.
By default, the max rank is the top 5% highest expressed genes. We can calculate the default max rank by taking into account the number of genes.
nrow(cell_rankings) * 0.05
[1] 3015.95
This number is probably too high, given the distribution of the
number of genes detected by cell we visualized with
AUCell_buildRankings()
.
What if we chose a 1% threshold?
nrow(cell_rankings) * 0.01
[1] 603.19
That is probably a more reasonable choice for this dataset.
We can use a function called ceiling()
to round this and
save it to a variable for later use.
auc_max_rank <- ceiling(nrow(cell_rankings) * 0.01)
The AUC values we get out of AUCell are the area under a recovery curve and estimate the proportion of genes in the gene set that are highly expressed (i.e., highly ranked).
Let’s plot the recovery curve for a cell with high AUC and a cell
with low AUC to get a better intuition about AUC values. Earlier, we
loaded a custom function we adapted from the
AUCell vignette called plot_recovery_curve()
with
source()
.
First, we’ll start with a cell with a high AUC. We picked this barcode ahead of time when we wrote the notebook.
plot_recovery_curve(cell_rankings,
ewing_gene_set_collection,
gene_set_name = "ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION",
barcode = "CTGAGCGGTCTTTATC",
auc_max_rank = auc_max_rank) # 1% threshold
The x-axis is the gene ranks for all genes. The y-axis is the number of genes in the signature at a given point in the gene ranking – the line will rise when a gene in the gene set is encountered in the ranking from highest to lowest. The AUC is the area under this recovery curve at the max rank threshold chosen for this dataset.
Now, let’s look at an example with a low AUC.
plot_recovery_curve(cell_rankings,
ewing_gene_set_collection,
gene_set_name = "ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION",
barcode = "AGATAGAGTCACAATC",
auc_max_rank = auc_max_rank) # 1% threshold
Far fewer genes in the gene set are ranked above the threshold, yielding a lower AUC value.
Once we have the rankings, we can calculate the AUC scores for both
gene sets in all cells with the AUCell_calcAUC()
function.
cell_auc <- AUCell_calcAUC(geneSets = ewing_gene_set_collection,
rankings = cell_rankings,
aucMaxRank = auc_max_rank)
This function returns an aucellResults
object.
str(cell_auc)
Formal class 'aucellResults' [package "AUCell"] with 6 slots
..@ nGenesDetected : num(0)
..@ colData :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
.. .. ..@ rownames : chr [1:4277] "AAGCATCTCGTTGCCT" "CTGTATTTCCAAGAGG" "CAGCAGCTCCTCAGAA" "AGGTCTAAGGGACTGT" ...
.. .. ..@ nrows : int 4277
.. .. ..@ elementType : chr "ANY"
.. .. ..@ elementMetadata: NULL
.. .. ..@ metadata : list()
.. .. ..@ listData : Named list()
..@ assays :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
.. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
.. .. .. .. ..@ listData :List of 1
.. .. .. .. .. ..$ AUC: num [1:2, 1:4277] 0.1103 0.1206 0.0352 0.0538 0.1671 ...
.. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. .. .. .. ..$ gene sets: chr [1:2] "ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION" "RIGGI_EWING_SARCOMA_PROGENITOR_UP"
.. .. .. .. .. .. .. ..$ cells : chr [1:4277] "AAGCATCTCGTTGCCT" "CTGTATTTCCAAGAGG" "CAGCAGCTCCTCAGAA" "AGGTCTAAGGGACTGT" ...
.. .. .. .. ..@ elementType : chr "ANY"
.. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. ..@ metadata : list()
..@ NAMES : chr [1:2] "ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION" "RIGGI_EWING_SARCOMA_PROGENITOR_UP"
..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
.. .. ..@ rownames : NULL
.. .. ..@ nrows : int 2
.. .. ..@ elementType : chr "ANY"
.. .. ..@ elementMetadata: NULL
.. .. ..@ metadata : list()
.. .. ..@ listData : Named list()
..@ metadata : list()
It can be much more convenient to work with this in a tabular format.
# Extract AUC
auc_df <- cell_auc@assays@data$AUC |>
# Transpose
t() |>
# Convert to data frame
as.data.frame() |>
# Make the barcodes a column
tibble::rownames_to_column("barcodes")
# Look at first few rows
head(auc_df)
AUCell can assign cells as having an active gene set or not by
picking a threshold automatically. We’ll explore these in a later plot,
but for now, let’s calculate the threshold and assign cells with
AUCell_exploreThresholds()
.
auc_assignments <- AUCell_exploreThresholds(cell_auc,
plotHist = FALSE,
assignCells = TRUE)
We’re going to plot the distribution of AUC values with
ggplot2
, so we will want the AUC values in a longer
format.
auc_plotting_df <- auc_df |>
tidyr::pivot_longer(!barcodes,
names_to = "gene_set",
values_to = "auc") |>
dplyr::mutate(
# Create a new logical column called assigned
assigned = dplyr::case_when(
# For Zhang gene set rows, set to TRUE when the barcode is in the
# assignment list
gene_set == ewing_gene_set_names[["zhang"]] &
barcodes %in% auc_assignments[[ewing_gene_set_names[["zhang"]]]]$assignment ~ TRUE,
# For Riggi gene set rows, set to TRUE when the barcode is in the
# assignment list
gene_set == ewing_gene_set_names[["riggi"]] &
barcodes %in% auc_assignments[[ewing_gene_set_names[["riggi"]]]]$assignment ~ TRUE,
# Otherwise, set to FALSE
.default = FALSE
)
)
auc_plotting_df
To draw vertical lines representing the automatically chosen threshold, we can create a separate data frame.
auc_threshold_df <- data.frame(
gene_set = ewing_gene_set_names,
# Grab thresholds associated with each gene set from assignements object
threshold = c(auc_assignments[[ewing_gene_set_names["zhang"]]]$aucThr$selected,
auc_assignments[[ewing_gene_set_names["riggi"]]]$aucThr$selected)
)
auc_threshold_df
Now let’s make a density plot, plotting the density of the assigned and unassigned cells separately and drawing a vertical line for the threshold.
auc_plotting_df |>
ggplot2::ggplot(
ggplot2::aes(
x = auc, # AUC values
color = assigned, # Group by assignment
fill = assigned, # Group by assignment
)
) +
ggplot2::geom_density(alpha = 0.2) +
# Draw a vertical dotted line showing the threshold for each gene set
ggplot2::geom_vline(data = auc_threshold_df,
mapping = ggplot2::aes(xintercept = threshold),
lty = 2) +
# Plot each gene set in its own facet
ggplot2::facet_grid(cols = ggplot2::vars(gene_set)) +
# Use a built-in theme
ggplot2::theme_bw()
For these particular gene sets, the AUC values appear to be bimodally distributed, and we can easily identify cells where the genes are highly expressed.
Let’s write this table to the output file.
auc_plotting_df |>
readr::write_tsv(output_file)
colData
We can also add the AUC values back into the SingleCellExperiment for
convenience, e.g., for plotting. We’ll add it to the existing
colData
.
First, let’s rename the gene set columns to something more easily typed.
auc_df <- auc_df |>
# Use shorter names
dplyr::rename(zhang_auc = ewing_gene_set_names[["zhang"]],
riggi_auc = ewing_gene_set_names[["riggi"]])
And join it to the existing colData
.
# Extract the existing colData, and left join it with the AUC values by the
# barcodes
coldata_df <- colData(sce) |>
as.data.frame() |>
dplyr::left_join(
auc_df,
by = "barcodes"
)
Now, we’re ready to add it back to the object.
# We need to save this as a DataFrame
colData(sce) <- DataFrame(
coldata_df,
row.names = colData(sce)$barcodes
)
We can use the plotUMAP()
function from the
scater
package to plot a UMAP with the points colored by
the AUC value
scater::plotUMAP(sce, colour_by = "zhang_auc") +
# Use the gene set name, replacing underscores with spaces
ggplot2::ggtitle(stringr::str_replace_all(ewing_gene_set_names[["zhang"]],
"\\_",
" "))
Let’s color the points by the AUC values for the other gene set.
scater::plotUMAP(sce, colour_by = "riggi_auc") +
ggplot2::ggtitle(stringr::str_replace_all(ewing_gene_set_names[["riggi"]],
"\\_",
" "))
We would want to do something more formal to confirm, but it seems like the same cells have high AUC values for both gene sets!
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] msigdbr_7.5.1 GSEABase_1.66.0
[3] graph_1.82.0 annotate_1.82.0
[5] XML_3.99-0.16.1 AnnotationDbi_1.66.0
[7] AUCell_1.26.0 SingleCellExperiment_1.26.0
[9] SummarizedExperiment_1.34.0 Biobase_2.64.0
[11] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
[13] IRanges_2.38.0 S4Vectors_0.42.0
[15] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
[17] matrixStats_1.3.0 optparse_1.7.5
loaded via a namespace (and not attached):
[1] jsonlite_1.8.8 magrittr_2.0.3
[3] ggbeeswarm_0.7.2 farver_2.1.1
[5] rmarkdown_2.26 fs_1.6.4
[7] zlibbioc_1.50.0 vctrs_0.6.5
[9] memoise_2.0.1 DelayedMatrixStats_1.26.0
[11] htmltools_0.5.8.1 S4Arrays_1.4.0
[13] BiocNeighbors_1.22.0 SparseArray_1.4.0
[15] sass_0.4.9 bslib_0.7.0
[17] htmlwidgets_1.6.4 plotly_4.10.4
[19] cachem_1.0.8 lifecycle_1.0.4
[21] pkgconfig_2.0.3 rsvd_1.0.5
[23] Matrix_1.7-0 R6_2.5.1
[25] fastmap_1.1.1 GenomeInfoDbData_1.2.12
[27] digest_0.6.35 colorspace_2.1-0
[29] scater_1.32.0 irlba_2.3.5.1
[31] RSQLite_2.3.6 beachmat_2.20.0
[33] labeling_0.4.3 fansi_1.0.6
[35] httr_1.4.7 abind_1.4-5
[37] compiler_4.4.1 bit64_4.0.5
[39] withr_3.0.0 BiocParallel_1.38.0
[41] viridis_0.6.5 DBI_1.2.2
[43] highr_0.10 R.utils_2.12.3
[45] MASS_7.3-60.2 DelayedArray_0.30.0
[47] tools_4.4.1 vipor_0.4.7
[49] beeswarm_0.4.0 R.oo_1.26.0
[51] glue_1.7.0 nlme_3.1-164
[53] grid_4.4.1 generics_0.1.3
[55] gtable_0.3.5 tzdb_0.4.0
[57] R.methodsS3_1.8.2 tidyr_1.3.1
[59] data.table_1.15.4 hms_1.1.3
[61] BiocSingular_1.20.0 ScaledMatrix_1.12.0
[63] utf8_1.2.4 XVector_0.44.0
[65] ggrepel_0.9.5 pillar_1.9.0
[67] stringr_1.5.1 babelgene_22.9
[69] vroom_1.6.5 splines_4.4.1
[71] dplyr_1.1.4 getopt_1.20.4
[73] lattice_0.22-6 survival_3.5-8
[75] bit_4.0.5 tidyselect_1.2.1
[77] Biostrings_2.72.0 scuttle_1.14.0
[79] knitr_1.46 gridExtra_2.3
[81] xfun_0.43 mixtools_2.0.0
[83] stringi_1.8.3 UCSC.utils_1.0.0
[85] lazyeval_0.2.2 yaml_2.3.8
[87] evaluate_0.23 codetools_0.2-20
[89] kernlab_0.9-33 tibble_3.2.1
[91] cli_3.6.2 xtable_1.8-4
[93] segmented_2.1-3 munsell_0.5.1
[95] jquerylib_0.1.4 Rcpp_1.0.12
[97] png_0.1-8 parallel_4.4.1
[99] ggplot2_3.5.1 readr_2.1.5
[ reached getOption("max.print") -- omitted 9 entries ]