This notebook will demonstrate how to:
SingleR
SingleR
classification to groups of cellsIn this notebook, we will attempt to annotate cell types to each of the cells in a dataset, using some of the automated tools that are available within the Bioconductor universe.
Much of the material in this notebook is directly inspired by, and draws heavily on, material presented in the book Orchestrating Single Cell Analysis with Bioconductor.
The data we will use for this notebook is derived from a 10x
Genomics dataset of human peripheral blood mononuclear cells
(PBMCs). These data include both single cell RNA-seq counts and
quantification of antibody-derived tags (ADTs) performed by sequencing
short DNA barcodes attached to specific antibodies. This type of ADT
sequencing with single cells is commonly known as CITE-seq, after the
protocol developed by Stoeckius et al.
(2017).
The antibodies used here are the The
TotalSeq™-B Human TBNK Cocktail, a set of antibodies designed to
react with immune cell surface markers.
The data here have already been filtered, normalized, and had dimension reductions calculated for the single-cell RNA-seq data. The ADT data has also been separately filtered and normalized. For details about how to perform these tasks with data that has been processed with Cell Ranger, you may want to look at the “Integrating with protein abundance” chapter of OSCA.
The processed gene expression and ADT data were saved into a combined
SingleCellExperiment
(SCE) object, and we will start with
that object for our exploration here.
To start, we will load some of the libraries we will need later, and set a random number seed for reproducibility.
# Load libraries
library(ggplot2) # plotting functions
library(SingleCellExperiment) # Bioconductor single-cell data class
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Setting the seed for reproducibility
set.seed(12345)
As mentioned, our input file here is a single normalized and
processed SCE object, stored as an rds
file. That should be
all we need to read in!
Our output will be a table of per-cell information, which will
include the cell type assignments we have made throughout this notebook.
We aren’t planning any significant modifications of the underlying data,
so we won’t bother re-saving the whole SCE object as a new
.rds
file this time.
# directory for the input data
data_dir <- file.path("data",
"PBMC-TotalSeqB",
"normalized")
# the input file itself
sce_file <- file.path(data_dir,
"PBMC_TotalSeqB_normalized_sce.rds")
# A directory to store outputs
analysis_dir <- file.path("analysis",
"PBMC-TotalSeqB")
# Create directory if it doesn't exist
fs::dir_create(analysis_dir)
# output table path
cellinfo_file <- file.path(analysis_dir,
"PBMC_TotalSeqB_cellinfo.tsv")
SingleCellExperiment
Now that the preliminary setup is out of the way, we can get started.
First we will read in the SingleCellExperiment
from the
input file we defined earlier.
# read in the SCE file
sce <- readr::read_rds(sce_file)
# print a summary of the SCE
sce
class: SingleCellExperiment
dim: 36601 7924
metadata(1): Samples
assays(2): counts logcounts
rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
ENSG00000277196
rowData names(3): ID Symbol Type
colnames: NULL
colData names(13): Sample Barcode ... prob_compromised sizeFactor
reducedDimNames(2): PCA UMAP
mainExpName: Gene Expression
altExpNames(1): ADT
This should look similar to the SCE objects that we have seen before,
containing counts
and logcounts
assays where
each cell is a column and each row is a gene. We also have some of the
rowData
, colData
and reduced dimension
matrices that we have seen before.
But where are the data from the ADTs? We wouldn’t necessarily want those stored in the main data matrices, as the characteristics of ADT barcode data is going to be quite different from gene expression data.
To keep the ADT data separate from the RNA gene expression data, we
have split this data off into an alternative experiment
(altExp
) slot. You can see the name of this
altExp
on the line altExpNames
above. We
could have more than one type of alternative experiment (such
as spike-in or ATAC-seq), but in this case, just the one.
To access the contents of the altExp
slot, we can use
the altExp()
function. Let’s look at what we have in that
slot:
# print a summary of the 'ADT' altExp
altExp(sce, "ADT")
class: SingleCellExperiment
dim: 10 7924
metadata(1): Samples
assays(2): counts logcounts
rownames(10): CD3 CD4 ... CD45 IgG1
rowData names(3): ID Symbol Type
colnames: NULL
colData names(1): sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
It is another SingleCellExperiment
! Inception! Let’s
look at that embedded SCE more closely.
The first thing to note is that this altExp
has the same
number of columns as did the main SCE object. Those corresponded to the
individual cells before, and still do!
There are only 10 rows, however, and these correspond to the ADTs
that were assayed by this particular experiment. Just as we did with the
full SCE, we can use rowData()
to view the table containing
metadata associated with each of these rows. We’ll add the
altExp()
function to point it to the embedded object we are
interested in. Since there is only one altExp
, we don’t
need the second (name) argument ("ADT"
) that we used above;
the default behavior of altExp()
is to just give us the
first altExp
, and that is the one (and only) that we
need.
# What proteins were assayed?
rowData(altExp(sce))
DataFrame with 10 rows and 3 columns
ID Symbol Type
<character> <character> <character>
CD3 CD3 CD3 Antibody Capture
CD4 CD4 CD4 Antibody Capture
CD8 CD8 CD8 Antibody Capture
CD11c CD11c CD11c Antibody Capture
CD14 CD14 CD14 Antibody Capture
CD16 CD16 CD16 Antibody Capture
CD19 CD19 CD19 Antibody Capture
CD56 CD56 CD56 Antibody Capture
CD45 CD45 CD45 Antibody Capture
IgG1 IgG1 IgG1_control_TotalSeqC Antibody Capture
You can see here the names and symbols of the tags used, along with
the designation that all have an “Antibody Capture” type (as opposed to
“Gene Expression” for the RNA data). One you might note looks different
is the IgG1
control, which is actually a mouse antibody
used as a negative control.
While dimension reduction was performed on this data, we have not yet performed any clustering.
Let’s assign some clusters to our cells, using graph-based clustering and default parameters, taking as input the PCA matrix that was previously calculated. Note that this PCA matrix and the UMAP built from it were derived from the gene expression data, so the clustering is going to reflect the gene expression data only. While we have the ADT data, it is not being used for this stage of the analysis.
# perform clustering
nn_clusters <- bluster::clusterRows(
# PCA input
reducedDim(sce, "PCA"),
# graph clustering & parameters
bluster::NNGraphParam()
)
# add clusters to colData
sce$nn_cluster <- nn_clusters
Now we can plot the clusters we have identified with
scater::plotUMAP()
. This is a shortcut for
scater::plotReducedDim(dimred = "UMAP", ...)
, which can
save us a lot of typing as we do this repeatedly!
# plot clusters
scater::plotUMAP(sce, color_by = "nn_cluster") +
# rename the legend
guides(color = guide_legend(title = "Cluster"))
But what are these clusters, really? Do they correspond to particular cell types that we are interested in?
Does it bother you that we just used the default nearest-neighbor graph clustering parameters? Do you know what those were?
The first way we will identify cell types of individual cells is to use the ADT normalized counts. These antibody markers were (hopefully) chosen for their relevance to the sequenced cell population.
The first marker we will look at is CD3
, which is a
protein complex that is found on the surface of T cells. We can again
use the plotUMAP()
function to color cells by
CD3
ADT levels.
Note that this function can plot data from the colData
table (as we used it above when plotting clusters), in the main gene
expression matrix (as we used it in the previous notebook), AND
in altExp
tables and matrices! So to color by the ADT
levels (as normalized in the logcounts
matrix) we only need
to provide the tag name that we want to plot in the
color_by
argument.
# plot CD3 expression
scater::plotUMAP(sce, color_by = "CD3")
It appears that we have a number of potential T cells down in the lower left!
Let’s look at a couple of other markers to try to break those up more specifically.
Two other markers of relevance to the T cells are CD4
and CD8
. The CD4
complex is present in helper
T cells (hence their other common name, CD4+ T cells). By contrast, the
CD8
complex is found on killer T cells (CD8+ cells).
Let’s plot the ADT results for those two markers as well below:
# plot CD4 marker
scater::plotUMAP(sce,
color_by = "CD4")
# plot CD8 marker
scater::plotUMAP(sce,
color_by = "CD8")
Plotting the levels of the ADTs provides a nice visual representation, but what we really want to do is to turn these values into specific cell-type assignments for each cell. Such classification could be considered as analogous to a cell-sorter assay, where we would set up some rules to look at a few markers for each cell and use those to assign a cell type. The simplest type of rule might be one where we use a threshold to call a marker as present or absent, and then use the presence of a marker to indicate a specific cell type.
To do this, we will need to make some decisions, such as the thresholds we should use to determine whether a cell is or is not expressing a particular marker. In general, markers that are useful for this cell-typing approach will have a bimodal distribution of expression levels which can be used to separate the population into two groups of cells. One group of cells will have only a background level signal for each marker (due to non-specific binding or other factors), while the other group, those that express the protein, will have a much higher level of binding and higher counts.
To assess whether the ADTs we have chosen have a useful distribution of expression values, and to identify thresholds we might use, we would like to plot each ADT tag. To do this, we will pull out the expression values for these markers from the SCE object and do some data wrangling.
We are interested in the normalized counts for the ADT tags, which
are stored in the logcounts
assay of the
altExp
. If you recall, this matrix is stored with the
columns as cells and rows as markers, but we really want it with each
row a cell and each column a marker. So we will first transpose the
data, then convert it to a data frame for our next steps. Because the
SCE object stores the assay data matrices in a specialized format, we
have to do one extra step convert it first to a “regular” R matrix or R
won’t know how to convert it to a data frame.
# convert logcounts data to a data frame
adt_df <- logcounts(altExp(sce)) |>
t() |> # transpose
as.matrix() |> # convert to matrix
as.data.frame() # convert to data frame
# view the data frame
head(adt_df)
If we just wanted to plot one of these tags, we could do so right
away, but with a bit more data wrangling, we can convert these results
into a “tidier” format, that will allow us to take full advantage of
tidyverse
tools! In particular, it will let us plot them
all at once with ggplot2
faceting.
Right now the data is in a “wide” format, such that each column is a different tag. But the data in all of the columns is the same type, and measures something similar: the normalized count of an ADT. One could even argue that each row contains 10 different observations, where the “tidy” data ideal, as espoused by Wickham (2014), requires a single observation per row, a “long” format. This long format will have one column that tells us which ADT was measured and a second column with the measurement value itself.
We can perform this conversion using the tidyr::pivot_longer()
function, which allows us to convert our data frame with one column per
tag into a data frame with separate columns for the tag id
(ADT
) and the expression value (logcount
).
Following conversion, we will filter to just the ADTs that we care
about.
adt_df_long <- adt_df |>
# pivot to long format
tidyr::pivot_longer(
everything(), # use all columns
names_to = "ADT", # convert row names to a column called "ADT"
values_to = "logcount" # name the value column "logcount"
) |>
# filter to tags we are interested in
dplyr::filter(ADT %in% c("CD3", "CD4", "CD8"))
# look at the resulting df
head(adt_df_long)
Now we can make a density plot with ggplot2
for all
three ADTs we are interested in at once.
# plot logcounts by ADT
ggplot(adt_df_long, aes(x = logcount, fill = ADT)) +
geom_density() + # density plot
facet_grid(rows = vars(ADT)) + # facet by ADT
theme_bw() + # nicer theme
theme(legend.position = "none") # no legend needed
These look pretty good! Each of these markers has a bimodal distribution: A lower peak consisting of cells that do not express the protein but which still have a background level of antibody binding, and an upper peak of cells that do express the protein of interest. The background level does vary by antibody marker, so we will need a different threshold value for each one.
We can now use the values from these plots to construct a set of rules to classify the T cells. We will do this using the “wide” data frame from earlier.
The thresholds we are using here were identified just “by eye”, so
this is not a particularly principled method of cell type assignment,
but it can be fairly effective. Here we are assigning only three cell
types; cells that do not fit any of these criteria will be set as
NA
.
# add cell type column by thresholding
adt_df <- adt_df |>
dplyr::mutate(
celltype = dplyr::case_when(
CD3 > 6.7 & CD4 > 8 ~ "CD4+ T-cell",
CD3 > 6.7 & CD8 > 6 ~ "CD8+ T-cell",
CD3 > 6.7 ~ "T-cell"
)
)
adt_df
Now we will want to add the cell types we have assigned back to our
original SCE object. We can do that by defining a new column name,
threshold_celltype
that will be added to the
colData
object. Creating and assigning values to this
column can be done with the $
shortcut, and then we can
plot our results with the plotUMAP()
function as
before.
sce$threshold_celltype <- adt_df$celltype
scater::plotUMAP(sce,
color_by = "threshold_celltype") +
guides(color = guide_legend(title = "Cell type"))
How did we do?
Note that while we applied this technique to assign cell types using the ADT data, we could use the same type of procedure using gene expression data alone, or a combination of gene expression data and tag data.
However, what we did here was very ad-hoc and quite manual! We didn’t calculate any statistics, and we had to look at every tag we were interested in to pick thresholds. A different dataset might have different background levels, which would require different thresholds.
While this technique might be good for some simple experiments, and can be useful for manual curation, it might not translate well to more complex datasets with multiple samples. We also looked at each marker separately, which might not be the most efficient or robust method of analysis.
For a more principled approach that allows identification of cell
types by looking at the expression of sets of genes that are known to
characterize each cell type, you might look at the AUCell
package. For more on that method, the OSCA section Assigning
cell labels from gene sets is a very good reference.
SingleR
An alternative approach to using known marker genes for classification is to instead classify cells by comparing them to a reference expression dataset. To do this, we will find a well-curated gene expression dataset that contains samples with known cell types. We can then train a model based on this dataset and look at each of the cells in our new dataset to determine which (if any) of the known cell types has the most similar expression pattern. The details of how such a model may be constructed and trained will vary by the specific method, but this overall approach is widely applied.
For this section, we will focus on the SingleR
package
and its methods, which are described in detail in The SingleR
Book.
Selecting a reference dataset is one of the more critical steps for this enterprise. At the most basic level, if the reference dataset does not include the types of cells that we expect to see in our sample, it won’t be useful. So we will want a reference dataset that has as many as possible of the cell types that we expect to find in our dataset, at a level of granularity that aligns with our goals.
For SingleR
that reference data can be from bulk RNA
sequencing or from other single-cell experiments. SingleR
is also fairly robust to the method used for gene expression
quantification, which means that we can use either RNA-seq datasets or
microarrays, if those are more readily available.
One convenient source of cell reference data is the
celldex
package, which is what we will use here. This
package includes functions to download a variety of well-annotated
reference datasets in a common format.
For more information on the datasets available, you will want to refer
to the
celldex
summary vignette.
We will start by using a reference dataset of sorted immune cells from GSE107011 (Monaco et al. 2019). This particular reference was chosen because it is well-suited to PBMC datasets, with a good level of granularity.
The celldex
functions also have a convenient option to
convert gene symbols to Ensembl ids, which we will use here so that our
reference data uses the same gene identifiers as the single-cell
data.
# Bioconductor "Hub" packages provide the option to cache
# downloads, but the interactive prompt can be annoying
# when working with notebooks.
# These options disable the prompt by giving permission
# to create the cache automatically
ExperimentHub::setExperimentHubOption("ASK", FALSE)
AnnotationHub::setAnnotationHubOption("ASK", FALSE)
# Get Monaco 2019 data from celldex with Ensembl ids.
monaco_ref <- celldex::MonacoImmuneData(ensembl = TRUE)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'HDF5Array'
downloading 1 resources
retrieving 1 resource
loading from cache
require("ensembldb")
What is this monaco_ref
object?
monaco_ref
class: SummarizedExperiment
dim: 46077 114
metadata(0):
assays(1): logcounts
rownames(46077): ENSG00000121410 ENSG00000268895 ... ENSG00000159840
ENSG00000074755
rowData names(0):
colnames(114): DZQV_CD8_naive DZQV_CD8_CM ... G4YW_Neutrophils
G4YW_Basophils
colData names(3): label.main label.fine label.ont
A SummarizedExperiment
is very similar to a
SingleCellExperiment
, except rather than having one column
per cell, each column is a sample. Otherwise, the components
are very similar: each row is still a gene, for example, and additional
data about the samples are stored in the colData
. In fact,
the SingleCellExperiment
object is derived from a
SummarizedExperiment
, with some extra slots that are more
relevant to single-cell data.
What information do we have for the samples?
colData(monaco_ref)
DataFrame with 114 rows and 3 columns
label.main label.fine label.ont
<character> <character> <character>
DZQV_CD8_naive CD8+ T cells Naive CD8 T cells CL:0000900
DZQV_CD8_CM CD8+ T cells Central memory CD8 T.. CL:0000907
DZQV_CD8_EM CD8+ T cells Effector memory CD8 .. CL:0000913
DZQV_CD8_TE CD8+ T cells Terminal effector CD.. CL:0001062
DZQV_MAIT T cells MAIT cells CL:0000940
... ... ... ...
G4YW_NK NK cells Natural killer cells CL:0000623
G4YW_pDC Dendritic cells Plasmacytoid dendrit.. CL:0000784
G4YW_mDC Dendritic cells Myeloid dendritic ce.. CL:0000782
G4YW_Neutrophils Neutrophils Low-density neutroph.. CL:0000096
G4YW_Basophils Basophils Low-density basophils CL:0000043
There are three main columns for the sample data:
label.main
is a more general cell type
assignment.
label.fine
is a fine-level cell type with more
specific labels. The exact level of granularity of these
main
and fine
designations (and indeed the
label names themselves) will vary among datasets, so it is important to
look at the reference to see whether it is suitable for your
application.
label.ont
is a standardized Cell Ontology
identifier. Using the cell ontology can allow for more complex
representations of the relationships among different cell types, but
investigating that is beyond the scope of this workshop.
Another component we would like to explore is how many of each of
these cell types we have in the reference dataset. A bit of quick
dplyr
wrangling can give us the answer.
colData(monaco_ref) |>
as.data.frame() |>
dplyr::count(label.main, label.fine)
This is pretty good! Most cell types have 4 replicates, which is more replicates than we often find.
SingleR
do?As mentioned earlier, SingleR
builds a model from a set
of training data, and then uses that model to classify cells (or groups
of cells) in new datasets.
SingleR
works by first identifying a set of marker genes
that can be used to differentiate among the cell types in the reference
dataset. It does this by performing pairwise comparisons among all of
the cell types, and retaining the top set of genes differentiating each
pair. The idea is that this set of genes will be the most informative
for differentiating cell types.
Then, for each cell, SingleR
calculates the Spearman
correlation between expression of that cell and each cell type (using
the only the genes chosen earlier). Notably, this is a non-parametric
correlation, so the scaling and normalization that we apply (or don’t)
should not matter! Note that if you used a single-cell technology that
produces full-length transcripts (i.e., SMART-seq), you will probably
want to convert your counts to Transcripts per Million (TPM), to allow
more consistent ranking among transcripts of different lengths.
The reference cell type with the highest correlation is then chosen as the cell type assignment for that cell. If there are multiple cell types with high scores, an optional fine-tuning step repeats the process using only the most relevant genes for those cell types.
SingleR
For our first run, we will do the marker gene selection (training)
and classification in a single step, using the convenience function
SingleR::SingleR()
. For this we need only supply three main
arguments: Our SCE object, a reference matrix (here in
SummarizedExperiment
format), and the labels for each of
the samples in the reference that we want to use. We also need to be
sure that our sample and the reference data use the same gene IDs, which
is why we requested the Ensembl IDs when getting the reference
dataset.
Because this function is doing many repetitive calculations (lots of
correlations!), we can speed it up by including the BPPARAM
argument. This is a common argument in Bioconductor
packages where BP
stands for the BiocParallel
package, which provides multiprocessing capabilities to many
Bioconductor functions. In this case, we will use the argument
BiocParallel::MulticoreParam(4)
to specify we want to use
local multicore processing with 4 “workers”.
# calculate SingleR results in one step
singler_result <- SingleR::SingleR(
sce, # our query SCE
ref = monaco_ref, # reference dataset
labels = monaco_ref$label.main, # reference labels to use
BPPARAM = BiocParallel::MulticoreParam(4) # multiprocessing
)
SingleR
provides a few nice visualizations for
evaluating the statistics it calculated and the assignments it makes.
One is a heatmap of the scores for each cell, arranged by the cell type
that was assigned to each. This is created with the
SingleR::plotScoreHeatmap()
function.
SingleR::plotScoreHeatmap(singler_result)
We can also pull out individual components of the results object for
plotting in the context of our input SCE object. Here we will save the
pruned labels (where low-quality assignments have been given an
NA
label), storing them back in our SCE object
(specifically to a new column of the colData
table).
sce$celltype_main <- singler_result$pruned.labels
Now we can plot the cell type assignments onto our UMAP to see how they compare to the patterns we saw there before.
scater::plotUMAP(sce, color_by = "celltype_main")
Annoyingly, the NA
and T cells
labels are
quite close in color, and the scater
and
SingleR
packages don’t agree on color choices. Luckily,
since plotUMAP()
returns a ggplot
object, we
can modify the color palette using ggplot2
functions. Still
annoyingly, however, when we change the palette, the legend title
defaults to the uninformative name "colour_by"
, so we’ll
also specify a matching legend title with our new color palette.
scater::plotUMAP(sce, color_by = "celltype_main") +
scale_color_brewer(name = "Cell type", # legend title
palette = "Dark2", # color palette
na.value = "gray80") # use light gray for NA values
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
We seem to have a pretty good set of cell type assignments, with most falling into groupings consistent with what we see in the UMAP plot.
We can thank the fact that this is a PBMC sample and that we have a good reference dataset for these cell types for the cleanliness of this plot. Quite often with other kinds of samples (especially cancer cells!) things will be much less clean!
We can also look to see how the cell type assignments are distributed
using the base R function table()
. Since we like to keep
track of the cells that ended up as NA
in the pruned
labels, we will include the useNA = "ifany"
argument.
table(singler_result$pruned.labels, useNA = "ifany")
B cells CD4+ T cells CD8+ T cells Dendritic cells Monocytes
692 1291 904 232 3622
NK cells Progenitors T cells <NA>
345 47 646 145
In the previous cell typing, we used the label.main
column, but we also had label.fine
, so let’s use that to
explore the dataset in a bit more detail.
We will also take this time to dive a bit deeper into the steps that
SingleR
performed. As mentioned, the first step is training
the model, during which we identify the genes that will be used for the
correlation analysis later. While this step is not particularly slow, if
we were classifying multiple samples, we would not want to have to
repeat it for every sample.
To do the training, we will use the trainSingleR()
function. For this we will start with our reference and the labels we
want to train the model with.
We can then specify the method used to select the genes that will be
used for classification. The default method is "de"
, which
performs a differential expression analysis for each pair of labels, but
we could also use "sd"
to select the genes which are most
variable across labels, or "all"
to use all genes. If we
want to get really fancy, we could even provide a specific list of genes
to use.
We should note here that the reference dataset for
SingleR
does not need to be from a compendium like
celldex
! If you have any well-classified dataset that you
want to use as a reference, you can, as long as you can create a gene by
sample expression matrix and a vector of cell types for each sample. You
will want to ensure that the cell types you expect to see in your sample
are present in the reference dataset, and data should be normalized, but
otherwise the method can be quite flexible. You can even use a
previously-annotated SingleCellExperiment
as a reference
for a new dataset. For more details about custom references, see the OSCA
chapter on cell type annotation
We do want to be sure that the genes selected for the model will be
among those present in our SCE object, so we will use the
restrict
argument with a vector of the genes in our SCE.
This step would happen automatically with the
SingleR::SingleR()
function, but we need to add it manually
for this use case.
# build fine model
singler_finemodel <- SingleR::trainSingleR(
monaco_ref, # reference dataset
labels = monaco_ref$label.fine, # labels for training dataset
# use DE to select genes (default)
genes = "de",
# only use genes in the sce object
restrict = rownames(sce),
# parallel processing
BPPARAM = BiocParallel::MulticoreParam(4)
)
Now we can perform the classification step, using our SCE object and
the SingleR
model that we just created.
# classify with fine model
singler_result_fine <- SingleR::classifySingleR(
sce, # our SCE object
singler_finemodel, # the trained model object
# perform fine tuning (default)
fine.tune = TRUE,
# parallel processing
BPPARAM = BiocParallel::MulticoreParam(4)
)
What labels were assigned, and how many of each?
table(singler_result_fine$pruned.labels, useNA = "ifany")
Central memory CD8 T cells Classical monocytes
121 2926
Effector memory CD8 T cells Exhausted B cells
31 48
Follicular helper T cells Intermediate monocytes
135 424
Low-density basophils MAIT cells
5 112
Myeloid dendritic cells Naive B cells
162 311
Naive CD4 T cells Naive CD8 T cells
600 752
Natural killer cells Non classical monocytes
320 189
Non-switched memory B cells Non-Vd2 gd T cells
250 163
Plasmablasts Plasmacytoid dendritic cells
3 93
Progenitor cells Switched memory B cells
36 81
T regulatory cells Terminal effector CD4 T cells
163 39
Terminal effector CD8 T cells Th1 cells
115 97
Th1/Th17 cells Th17 cells
135 133
Th2 cells Vd2 gd T cells
144 146
<NA>
190
# add fine labels to SCE
sce$celltype_fine <- singler_result_fine$pruned.labels
# plot UMAP with fine labels
scater::plotUMAP(sce, color_by = "celltype_fine")
Warning: Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).
That’s a pretty messy plot. Mostly that is because there are
lots of cell types here, and not enough colors to represent
them all. The NA
cells also got taken off completely, which
is not ideal.
One thing we can do is to use some functions from the
tidyverse
package forcats
, which can
be very handy for dealing with categorical variables like these cell
types.
We will use two of these functions in the chunk below: First we will
use fct_collapse
to take some of the finer labels that we
might not be as interested in and collapse them into logical groupings
(in this case, the main
label that they were part of).
After that, we will use fct_relevel
to put the remaining
factor levels in the order we would like them to appear for
plotting.
collapsed_labels <- singler_result_fine$pruned.labels |>
forcats::fct_collapse(
"Monocytes" = c(
"Classical monocytes",
"Intermediate monocytes",
"Non classical monocytes"),
"Dendritic cells" = c(
"Myeloid dendritic cells",
"Plasmacytoid dendritic cells"),
"T cells" = c(
"MAIT cells",
"Non-Vd2 gd T cells",
"Vd2 gd T cells"),
"Helper T cells" = c(
"Th1 cells",
"Th1/Th17 cells",
"Th17 cells",
"Th2 cells",
"Follicular helper T cells"),
"B cells" = c(
"Naive B cells",
"Switched memory B cells",
"Non-switched memory B cells",
"Exhausted B cells",
"Plasmablasts"
)
) |>
# order for plotting
forcats::fct_relevel(
"Helper T cells",
"T regulatory cells",
"Naive CD4 T cells",
"Terminal effector CD4 T cells",
"Naive CD8 T cells",
"Central memory CD8 T cells",
"Effector memory CD8 T cells",
"Terminal effector CD8 T cells",
"T cells",
"Natural killer cells",
"B cells",
"Monocytes",
"Dendritic cells",
"Progenitor cells",
"Low-density basophils"
)
Now that we have that set up, we can plot using our collapsed and ordered cell type labels.
sce$celltype_collapsed <- collapsed_labels
scater::plotUMAP(sce,
color_by = "celltype_collapsed")
Let’s look at how the cell type assignments we obtained using
SingleR
compare to the clusters that we found using the
unsupervised clustering at the start of this notebook.
To do this, we will again use the table()
function, but
now with two vectors as input, to build a contingency table of the cell
types and clusters that each cell was classified with.
# create a table of clusters & cell type counts
type_cluster_tab <- table(sce$celltype_fine, sce$nn_cluster, useNA = "ifany")
# look at the top corner of the results
type_cluster_tab[1:5, 1:5]
1 2 3 4 5
Central memory CD8 T cells 81 0 0 0 0
Classical monocytes 0 0 2195 698 0
Effector memory CD8 T cells 26 0 0 0 0
Exhausted B cells 0 0 0 0 0
Follicular helper T cells 93 0 0 0 0
As you can see, this produced a table with rows for each cell type and columns for each cluster number. The values are the count of cells for each cluster/cell type combination. However, these raw counts are not quite what we’ll want for visualization. Since the total number of cells differs across clusters, we’d like to convert these counts into the proportions of each cell type in each cluster.
We’ll do this by going through the table column by column and
dividing each value by the sum for that cluster. This will give us
normalized values where the values in each column now sum to 1. To do
that, we will use the apply
function, which allows us to
operate on a matrix row by row or column by column, applying a function
to each “slice”. Since the function we want to apply is very short, we
will use R’s new (as of version 4.1) anonymous function shorthand:
\(x) ...
can be used to define a function that that takes
as input values x
(where the ...
is where you
would put the expression to calculate). Here we will apply the
expression x/sum(x)
, which will divide each element of a
vector x
by the sum of its values.
# normalize by the number of cells in each cluster (columns)
type_cluster_tab <- apply(
type_cluster_tab,
2, # apply function to columns
\(x) x/sum(x) # function to apply
)
# print the normalized values
type_cluster_tab[1:5, 1:5]
1 2 3 4 5
Central memory CD8 T cells 0.08617021 0 0.0000000 0.000000 0
Classical monocytes 0.00000000 0 0.9825425 0.656015 0
Effector memory CD8 T cells 0.02765957 0 0.0000000 0.000000 0
Exhausted B cells 0.00000000 0 0.0000000 0.000000 0
Follicular helper T cells 0.09893617 0 0.0000000 0.000000 0
Now we can plot these results as a heatmap, using the
pheatmap
package. There is a lot of customization we could
do here, but pheatmap
(pretty heatmap) has good defaults,
so we won’t spend too much time on it for now.
# plot with pheatmap
pheatmap::pheatmap(type_cluster_tab)
We can see that most of our clusters are indeed defined by a single cell type, though there are some clusters (e.g., 1 & 9) that have a number of (related) cell types within them. There are also some places where single cell types are spread across a few different clusters (Classical monocytes, for example).
While most of the time we will want to classify single cells, sometimes the sparseness of the data may mean that individual cells do not provide reliable estimates of cell types.
An alternative approach is to classify the clusters as a whole, assuming that the clusters we have identified represent a single cell state. If that is the case, then we should be able to combine the data for all cells across each cluster, then apply our cell typing method to this group of cells. This is similar to an approach we will return to later in the context of differential expression.
The first step here is to create a new matrix where we sum the counts
across cells that are from the same type according to our clustering.
Because SingleR
is a non-parametric approach, we can
perform this step with the raw counts matrix. There are a few different
ways to do this, but we will use the function
DelayedArray::colsum()
, which can work directly on the
sparse matrices that are often found in SCE objects. We will provide it
with the matrix we need, and then a vector of the cluster assignments
for each column of the matrix. The function will then sum expression
values for each gene across all of the columns that have that value.
# sum count matrix by cluster
cluster_mat <- DelayedArray::colsum(counts(sce), sce$nn_cluster)
# print new dimensions
dim(cluster_mat)
[1] 36601 20
You can see that the resulting matrix still has the same number of rows we have seen before, but now only has as many columns as the number of clusters that the cells were assigned to.
Now we can apply the same SingleR
model to these
results, using the new matrix as input along with the previously trained
model. As there are only 20 clusters to classify, this will be very
quick, and we don’t need to parallelize it!
# run SingleR classification with previously trained model
singler_cluster <- SingleR::classifySingleR(
cluster_mat, # cluster expression matrix
singler_finemodel # pre-trained model
)
# view results
head(singler_cluster)
DataFrame with 6 rows and 4 columns
scores labels delta.next
<matrix> <character> <numeric>
1 0.612108:0.431222:0.615571:... Th1/Th17 cells 0.01424091
2 0.310607:0.468907:0.306530:... Naive B cells 0.55307985
3 0.275063:0.805908:0.308321:... Classical monocytes 0.35835004
4 0.276687:0.776140:0.317954:... Classical monocytes 0.08904561
5 0.289710:0.708883:0.317147:... Myeloid dendritic ce.. 0.07394915
6 0.462245:0.395730:0.451555:... Naive B cells 0.00785926
pruned.labels
<character>
1 Th1/Th17 cells
2 Naive B cells
3 Classical monocytes
4 Classical monocytes
5 Myeloid dendritic ce..
6 Naive B cells
The result is a fairly small table of results, but we are most
interested in the labels, which we would like to associate with each
cell in our SCE object for visualization. Since the cluster labels are
the row names of that table, we can perform a cute little trick to
assign labels back to each cell based on the name of the cluster that it
was assigned to. (In this case the cluster names are all numbers, but
that might not always be the case.) We’ll select values repeatedly from
the singler_cluster
table, using the cluster assignment to
pick a row, and then always picking the pruned.labels
column.
sce$celltype_cluster <- singler_cluster[sce$nn_cluster, "pruned.labels"]
Now we can plot these cluster-based cell type assignments using the
now familiar plotUMAP()
function.
scater::plotUMAP(sce, color_by = "celltype_cluster")
This sure looks nice and clean, but what have we really done here? We are assuming that each cluster has only a single cell type, which is a pretty bold assumption, as we really aren’t sure that the clusters we created were correct. You may recall that clustering algorithms are quite sensitive to parameter choice, so a different parameter choice could quite likely give a different result.
As a middle ground between the potentially messy single-cell cell type assignment and the almost-certainly overconfident cluster-based assignment above, we can take approach inspired by Baran et al. (2019) using something they called metacells. The idea is that we can perform fine-scaled clustering to identify groups of very similar cells, then sum the counts within those clusters as “metacells” to use for further analysis. The original paper includes a number of optimizations to make sure that the metacell clusters have desirable properties for downstream analysis. We won’t go into that depth here, but we can apply similar ideas.
To begin, we will perform some fine-scale clustering, using a simpler
clustering algorithm: K-means clustering. We will use the same
bluster
package, clustering based on the PCA results we
have from earlier, but this algorithm allows us to specify the number of
clusters we want to end up with. We have about 8000 cells, so let’s
cluster those into groups of approximately 80 cells, which works out to
100 clusters. While this is almost certainly more clusters than are
“real” in this dataset, our goal here is not to find differences among
clusters, just to get homogeneous groups of cells.
# perform k-means clustering
kclusters <- bluster::clusterRows(
reducedDim(sce, "PCA"),
bluster::KmeansParam(
centers = 100, # the number of clusters
iter.max = 100 # more iterations to be sure of convergence
)
)
Now we can apply exactly the same approach we did when we had the 20 clusters we had identified with the earlier graph-based clustering.
# create a "metacell" matrix by summing fine-scale clusters
metacell_mat <- DelayedArray::colsum(counts(sce), kclusters)
# apply SingleR model to metacell matrix
metacell_singler <- SingleR::classifySingleR(
metacell_mat,
singler_finemodel
)
# apply metacell cell type assignments to individual cells
sce$celltype_metacell <- metacell_singler[kclusters, "pruned.labels"]
Now we can plot the results as we have done before.
scater::plotUMAP(sce, color_by = "celltype_metacell")
Warning: Removed 208 rows containing missing values or values outside the scale range
(`geom_point()`).
What do you think of this plot? Is this more or less useful than the original cell-based clustering?
To save disk space (and time), we won’t write out the whole SCE
object, as we haven’t changed any of the core data there. Instead we
will just write out the cell information table (colData
) as
a TSV file.
colData(sce) |>
as.data.frame() |>
readr::write_tsv(file = cellinfo_file)
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ensembldb_2.28.0 AnnotationFilter_1.28.0
[3] GenomicFeatures_1.56.0 AnnotationDbi_1.66.0
[5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[7] Biobase_2.64.0 GenomicRanges_1.56.0
[9] GenomeInfoDb_1.40.0 IRanges_2.38.0
[11] S4Vectors_0.42.0 BiocGenerics_0.50.0
[13] MatrixGenerics_1.16.0 matrixStats_1.3.0
[15] ggplot2_3.5.1 optparse_1.7.5
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 jsonlite_1.8.8
[3] magrittr_2.0.3 ggbeeswarm_0.7.2
[5] gypsum_1.0.0 farver_2.1.1
[7] rmarkdown_2.26 BiocIO_1.14.0
[9] fs_1.6.4 zlibbioc_1.50.0
[11] vctrs_0.6.5 Rsamtools_2.20.0
[13] memoise_2.0.1 DelayedMatrixStats_1.26.0
[15] RCurl_1.98-1.14 forcats_1.0.0
[17] htmltools_0.5.8.1 S4Arrays_1.4.0
[19] AnnotationHub_3.12.0 curl_5.2.1
[21] BiocNeighbors_1.22.0 Rhdf5lib_1.26.0
[23] SparseArray_1.4.0 rhdf5_2.48.0
[25] sass_0.4.9 alabaster.base_1.4.0
[27] bslib_0.7.0 httr2_1.0.1
[29] cachem_1.0.8 GenomicAlignments_1.40.0
[31] igraph_2.0.3 mime_0.12
[33] lifecycle_1.0.4 pkgconfig_2.0.3
[35] rsvd_1.0.5 Matrix_1.7-0
[37] R6_2.5.1 fastmap_1.1.1
[39] GenomeInfoDbData_1.2.12 digest_0.6.35
[41] colorspace_2.1-0 paws.storage_0.5.0
[43] scater_1.32.0 irlba_2.3.5.1
[45] ExperimentHub_2.12.0 RSQLite_2.3.6
[47] beachmat_2.20.0 filelock_1.0.3
[49] labeling_0.4.3 fansi_1.0.6
[51] httr_1.4.7 abind_1.4-5
[53] compiler_4.4.1 bit64_4.0.5
[55] withr_3.0.0 BiocParallel_1.38.0
[57] viridis_0.6.5 DBI_1.2.2
[59] highr_0.10 alabaster.ranges_1.4.0
[61] HDF5Array_1.32.0 alabaster.schemas_1.4.0
[63] rappdirs_0.3.3 DelayedArray_0.30.0
[65] rjson_0.2.21 bluster_1.14.0
[67] tools_4.4.1 vipor_0.4.7
[69] beeswarm_0.4.0 glue_1.7.0
[71] restfulr_0.0.15 rhdf5filters_1.16.0
[73] grid_4.4.1 cluster_2.1.6
[75] generics_0.1.3 gtable_0.3.5
[77] tzdb_0.4.0 tidyr_1.3.1
[79] hms_1.1.3 xml2_1.3.6
[81] BiocSingular_1.20.0 ScaledMatrix_1.12.0
[83] utf8_1.2.4 XVector_0.44.0
[85] ggrepel_0.9.5 BiocVersion_3.19.1
[87] pillar_1.9.0 stringr_1.5.1
[89] vroom_1.6.5 dplyr_1.1.4
[91] getopt_1.20.4 BiocFileCache_2.12.0
[93] lattice_0.22-6 rtracklayer_1.64.0
[95] bit_4.0.5 tidyselect_1.2.1
[97] paws.common_0.7.2 Biostrings_2.72.0
[99] scuttle_1.14.0 knitr_1.46
[ reached getOption("max.print") -- omitted 35 entries ]