This notebook will demonstrate how to:
# Load libraries
library(ggplot2)
library(scater)
Loading required package: SingleCellExperiment
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
Loading required package: scuttle
library(scran)
# clustering tools
library(bluster)
# Setting the seed for reproducibility
set.seed(12345)
# main data directory
data_dir <- file.path("data", "hodgkins")
# normalized data file
normalized_rds <- file.path(data_dir, "normalized", "normalized_hodgkins_sce.rds")
# Output directory for markers
marker_dir <- file.path("analysis", "hodgkins", "markers")
fs::dir_create(marker_dir)
hodgkins_sce <- readr::read_rds(normalized_rds)
When we performed dimensionality reduction on our single cell data, we could see visually that the cells tended cluster together into groups. To the extent that such clustering is a real biological phenomenon, representing cells with similar patterns of gene expression, we might like to identify distinct groups that we can name and assign a label to. Ultimately, we would hope that these labels correspond to previously identified (or newly identified!) cell types, and that we can use that information to provide more insight into the results of our experiment.
There are a number of methods to identify clusters and assign cells to those in multidimensional data like the single cell data we have. We will explore a couple of the more common methods here.
The first method we will try is k-means clustering. The
k
here refers to the number of clusters that we will
create, and must be chosen before we start. This clustering method seeks
to find a way to divide the cells into k
clusters such that
the cells within each cluster are as similar as possible and the
differences among clusters is as large as possible.
It turns out that is a pretty hard problem to solve exactly, but we can do pretty well with an algorithm that starts with a random guess at where the clusters will be:
You might wonder: How many clusters should we use? That is a hard question! There are some heuristics we can use for deciding the “correct” number of clusters, but we will not be exploring those right now.
For an intuitive visualization of the general k-means method, you might find this StatQuest video useful, and for more discussion of the method in a single-cell context, the Orchestrating Single-Cell Analysis book section on k-means is a good reference.
We are going to use the function clusterRows()
from the
Bioconductor bluster
package for our clustering. This
function takes a matrix where each sample (cell in our case) is a row
and each column is a feature. The matrix of counts (or normalized
counts) by cells in our SingleCellExperiment
object is the
wrong orientation, so at a minimum we would have to transpose that
matrix before proceeding.
However, clustering algorithms like k-means can be a bit slow with as many features as the number of genes that we have in our data set, so we would rather not use the raw data. There is also a potential concern that noise in the raw data might disrupt the clustering algorithm, so it would be best to use some kind of dimensionality reduction algorithm first. We still want to maintain a good number of dimensions, so our old friend PCA is a good (and very standard) choice.
Thankfully, we already computed and stored a matrix with
reduced dimensions with the runPCA()
function. We will
extract that from the SingleCellExperiment
object with the
reducedDim()
function, which conveniently returns a matrix
with the cells as rows, so we can use that directly!
The other argument we need for clusterRows()
will tell
it which clustering algorithm to use, and any additional parameters
associated with that algorithm. This has to be made with a special kind
of function from the bluster
package. In this case, we will
use KmeansParam()
to specify that we want k-means
clustering, with the centers
parameter to set how many
clusters we will assign (k
).
# set the number of clusters
k <- 7
# extract the principal components matrix
hodgkins_pca <- reducedDim(hodgkins_sce, "PCA")
# perform the clustering
kclusters <- clusterRows(hodgkins_pca, KmeansParam(centers = k))
The clusterRows()
function returned a vector of cluster
assignments as integers, but the numerical values have no inherent
meaning. For plotting we will want to convert those to a factor, so R is
not tempted to treat them as a continuous variable.
We can also store them back into the column (cell) information table of the original object for convenient storage and later use.
# save clusters in the SCE object as a factor
hodgkins_sce$kcluster <- factor(kclusters)
Now we can plot the results and see how the clustering looks, using
the scater
function plotReducedDim()
that we
have used before, coloring the points by our clustering results. We will
start by using the UMAP coordinates for the plot. Note that this does
require that the cluster identities were stored in the
SingleCellExperiment
object, as we just did.
# plot clustering results
plotReducedDim(hodgkins_sce, "UMAP", color_by = "kcluster")
PCA
or TSNE
coordinates?You will have time to explore questions like these in the exercise notebooks. One thing worth noting right away though is that cluster numbers here and in the future are assigned arbitrarily. Even if we got exactly the same logical clusters across runs (unlikely!), we wouldn’t expect the label numbers to be the same or stable.
Another common type of clustering method for single cell data is graph-based clustering. This algorithm follows the following general steps:
There is a lot of hidden detail in those three steps!
To apply this clustering algorithm, we will use the same
bluster::clusterRows()
function as before, but we will
change the second argument from KmeansParam()
to
NNGraphParam()
to tell it that we want to use a
nearest-neighbor graph-based method. We can then supply additional
parameters to NNGraphParam()
to adjust the details of the
algorithm. Here we will use k
to specify the number of
neighbors to use when building the graph and cluster.fun
to
specify the algorithm for identifying the clusters within the graph.
Despite sharing a letter, k
here and the one from
k-means clustering are not the same thing! In this case, we are telling
the algorithm how many neighbor connections to make for each cell, not
the final number of clusters, which will be determined by the algorithm
we use for the cluster building step.
The options for cluster.fun
describe the algorithm
for the cluster building step described above. These include
walktrap
(the default), leiden
, and
louvain
, which is the default algorithm in Seurat
, another
common package for single cell analysis that you may have seen.
In the example below, we will use the default values for these two arguments.
# run the clustering algorithm
nnclusters <- clusterRows(
hodgkins_pca,
NNGraphParam(k = 10,
cluster.fun = "walktrap")
)
# store cluster results in the SCE object
hodgkins_sce$nncluster <- factor(nnclusters)
Now we can plot the results of our graph-based clustering. This time
we will also use the text_by
argument to include the
cluster ids directly on the plot.
plotReducedDim(hodgkins_sce,
"UMAP",
color_by = "nncluster",
text_by = "nncluster")
Again, you will have time to explore these more in the exercise notebook, and of course with your own data! Sadly, there are not always good answers to which set of inferred clusters is best! Which method and parameters you use may depend on the kind of question you are trying to answer.
For more detailed information about the methods presented here, including some ways to assess the “quality” of the clustering, I encourage you to explore at the relevant chapter of the Orchestrating Single-Cell Analysis book. A recent review by Kislev et al. (2019) also goes into some depth about the differences among algorithms and the general challenges associated with clustering single cell data.
Assigning clusters is nice for visualization, but we would also like to be able to move toward a biological interpretation of the clusters and identifying the cell types in each cluster. To that end, we can identify marker genes that are differentially expressed among clusters.
It is worth noting here that the statistical calculations here are more than a bit circular: we identified clusters first based on gene expression, then we are using those same clusters to find differences in gene expression. The result is that even if there were no true clusters, we would always find marker genes! For a much more technical exploration of this circularity (and a method to correct for it), see a preprint by Gao et al. (2020). In light of this, it is better to think about marker gene identification as an aid in interpreting the clustering results (and possibly extending insights to new data sets), rather than results that should be interpreted on their own, and we should be extremely wary of justifying cluster assignments solely based on these results! With that caveat, let’s proceed.
To identify marker genes, we will use the
scran::findMarkers()
function, which will rank genes by
their differential expression by calculating pairwise statistics among
clusters. We have a few options for how to determine the gene rankings
and marker gene list for each cluster. At one end could include genes
that are differentially expressed in any pairwise comparison
against our focal cluster, or at the other we could only include genes
that are differentially expressed in all comparisons with that
cluster. We could also do something in between, including genes that
differentiate the focal cluster from some fraction of the other
clusters. For now, we will use the findMarkers()
function
to rank the genes in each cluster by their combined scores against
all other clusters, using the pval.type
argument.
findMarkers()
will return a list (technically a
list-like object) of tables, one for each cell type, with statistics for
each gene showing how well it differentiates that cell type against
other types.
# use `findMarkers()` to calculate how well each gene
# differentiates each cluster from *all* other clusters
markers <- scran::findMarkers(hodgkins_sce,
groups = hodgkins_sce$nncluster,
pval.type = "all")
Next we can look at one of those tables. We will start with the first
cluster, which we will select from the list using the R standard double
bracket [[1]]
notation. We also doing a bit of
transformation here to pull the gene name into a column of its own.
markers[[1]] |>
as.data.frame() |> # convert to a data frame
tibble::rownames_to_column("gene") # make gene a column
You can see that this table includes values for all genes, so we would like to make a shorter list.
Because we tend to like tidy data, here we use
a tidyverse
function from the purrr
package to
apply the same operations as above to every element of the
markers
list. We will introduce purrr
briefly
here, but if you want more information and background, we recommend the
purrr
cheatsheet (PDF) and Jenny Bryan’s great purrr
tutorial.
The main functions in purrr
are the map()
functions, which take as their main arguments a list
and a function to apply to each element of the list.
The main function is purrr::map()
; if you are familiar with
the base R lapply()
function, it is very similar, but with
some different defaults. We will use it to get the top rows from each
table by applying the head()
function to each element of
the list. The results are returned as a new list.
purrr::map(
as.list(markers[1:3]), # select the first 3 clusters and convert to a 'regular' list for purrr
head # the function to apply (note no parenthesis)
)
This returns a list of data frames, which isn’t quite what we want.
There is no built-in function that will give us just the first few
row names, so we will have to define one. As of version 4.1, R
introduced a new approach to defining anonymous functions -
that is, functions you can quickly define “on-the-fly” without formally
assigning them to a function name. They are handy when you need to do a
very short task that requires a function, but it isn’t really a function
you need beyond this context. This new anonymous syntax looks like this:
\(x)...
(or for slightly longer code, use curly braces as
in \(x) {...}
). This defines a function that takes one
argument, x
, with ...
indicating where you
would put the expression to calculate.
purrr::map()
will then apply the expression in our
anonymous function to each element of the list, and return the results
as a new list.
# Get the first few row names of each table with a purrr function.
purrr::map(
# convert markers to a 'regular' list for purrr
as.list(markers),
# our custom function:
\(x) head( rownames(x) )
)
$`1`
[1] "ENSG00000153064" "ENSG00000247982" "ENSG00000042980" "ENSG00000224137"
[5] "ENSG00000211898" "ENSG00000163534"
$`2`
[1] "ENSG00000213809" "ENSG00000172543" "ENSG00000153563" "ENSG00000104660"
[5] "ENSG00000164120" "ENSG00000182871"
$`3`
[1] "ENSG00000124882" "ENSG00000137462" "ENSG00000136689" "ENSG00000103569"
[5] "ENSG00000123689" "ENSG00000043462"
$`4`
[1] "ENSG00000120875" "ENSG00000115935" "ENSG00000100629" "ENSG00000136754"
[5] "ENSG00000082074" "ENSG00000163599"
$`5`
[1] "ENSG00000229117" "ENSG00000231500" "ENSG00000147403" "ENSG00000109475"
[5] "ENSG00000105372" "ENSG00000156508"
$`6`
[1] "ENSG00000266088" "ENSG00000235576" "ENSG00000204866" "ENSG00000156234"
[5] "ENSG00000174946" "ENSG00000111796"
$`7`
[1] "ENSG00000111678" "ENSG00000275385" "ENSG00000173369" "ENSG00000164754"
[5] "ENSG00000159189" "ENSG00000197249"
$`8`
[1] "ENSG00000164236" "ENSG00000054219" "ENSG00000128487" "ENSG00000086758"
[5] "ENSG00000181163" "ENSG00000133112"
$`9`
[1] "ENSG00000156508" "ENSG00000205542" "ENSG00000171863" "ENSG00000181163"
[5] "ENSG00000186468" "ENSG00000145592"
$`10`
[1] "ENSG00000008517" "ENSG00000128340" "ENSG00000026025" "ENSG00000092820"
[5] "ENSG00000102879" "ENSG00000054267"
$`11`
[1] "ENSG00000141753" "ENSG00000085063" "ENSG00000148175" "ENSG00000115306"
[5] "ENSG00000182871" "ENSG00000085733"
$`12`
[1] "ENSG00000132465" "ENSG00000185507" "ENSG00000156675" "ENSG00000051108"
[5] "ENSG00000135916" "ENSG00000101057"
ENSG00000153064
ENSG00000247982
ENSG00000042980
ENSG00000224137
ENSG00000211898
ENSG00000163534
ENSG00000213809
ENSG00000172543
ENSG00000153563
ENSG00000104660
ENSG00000164120
ENSG00000182871
ENSG00000124882
ENSG00000137462
ENSG00000136689
ENSG00000103569
ENSG00000123689
ENSG00000043462
ENSG00000120875
ENSG00000115935
ENSG00000100629
ENSG00000136754
ENSG00000082074
ENSG00000163599
ENSG00000229117
ENSG00000231500
ENSG00000147403
ENSG00000109475
ENSG00000105372
ENSG00000156508
ENSG00000266088
ENSG00000235576
ENSG00000204866
ENSG00000156234
ENSG00000174946
ENSG00000111796
ENSG00000111678
ENSG00000275385
ENSG00000173369
ENSG00000164754
ENSG00000159189
ENSG00000197249
ENSG00000164236
ENSG00000054219
ENSG00000128487
ENSG00000086758
ENSG00000181163
ENSG00000133112
ENSG00000156508
ENSG00000205542
ENSG00000171863
ENSG00000181163
ENSG00000186468
ENSG00000145592
ENSG00000008517
ENSG00000128340
ENSG00000026025
ENSG00000092820
ENSG00000102879
ENSG00000054267
ENSG00000141753
ENSG00000085063
ENSG00000148175
ENSG00000115306
ENSG00000182871
ENSG00000085733
ENSG00000132465
ENSG00000185507
ENSG00000156675
ENSG00000051108
ENSG00000135916
ENSG00000101057
Another variant is purrr::imap()
, which allows us to use
the names of the list elements in our function. (Try
names(markers)
to see the names for the list we are working
with now.) We will use that here to name output files where we will
print each of the marker tables, one for each cell type. We are again
defining a custom function within the call to purrr:imap()
using the \(x)
syntax, but this time we need two variables:
we will use table
for the list elements (each a table of
results) and id
for their names. So, we’ll actually start
by defining the function as \(table, id)
, since there will
be two input arguments. Because we don’t know the identities of the
clusters we identified, these are just the cluster numbers for now.
Making file names from numbers can be a a bit fraught, as we really
want them to sort in numerical order, but many systems will sort by
alphabetical order. Unfortunately, that would tend to sort 10-19 before
2, 20-29 before 3, etc. To solve this, we are using the
sprintf()
function, which allows us to specify the format
of a printed string. In this case, we are using the formatting syntax of
%02d
to tell it that we will want to insert
(%
) a number (d
), with two digits and leading
zeros. To see what this does a bit more concretely, let’s look at a
simple example:
sprintf("%02d", 1:10)
[1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"
01
02
03
04
05
06
07
08
09
10
In addition to writing the tables out, we are saving the data frames we created as a new list that we can use in the next step.
marker_df_list <- purrr::imap(
as.list(markers), # convert markers to a 'regular' list for purrr
# purrr function: x is the list element, y is the element name (number here)
\(table, id) {
as.data.frame(table) |> # first convert to a data frame
tibble::rownames_to_column("gene") |> # make genes a column
dplyr::arrange(FDR) |> # sort to be sure small FDR genes are first
readr::write_tsv( # write each data frame to a file
file.path(
marker_dir, # construct the output path
sprintf("cluster%02d_markers.tsv", as.integer(id)) # format cluster numbers in file names with leading zeros
)
)
}
)
One thing we can do with this list of marker genes is to see how they
look across the cells and clusters. The
scater::plotReducedDim()
function makes this easy! We have
earlier colored points by some cell statistic, like the number of
expressed genes, but it is just as easy to color by the expression of a
single gene by using the gene identifier as the color_by
argument.
The first step is to get the gene information for the genes we might be interested in.
# get gene ids for top 10 cluster 1 markers
gene_ids <- marker_df_list[[1]] |>
head(n = 10) |>
dplyr::pull(gene)
# look at the gene info for these
gene_info <- rowData(hodgkins_sce)[gene_ids, ]
data.frame(gene_info)
Now we can pick one of the genes for plotting and go!
# get gene id and gene symbol for nicer plotting
rank <- 1
gene_id <- gene_info$ID[rank]
symbol <- gene_info$Symbol[rank]
# Plot UMAP results colored by expression
plotReducedDim(hodgkins_sce, "UMAP",
color_by = gene_id) +
# label the guide with the gene symbol
guides(color = guide_colorbar(title = symbol))
Hopefully that expression pattern aligns at least in part with your expectations!
So far we have identified clusters of cells (if you believe them), and found some genes that are associated with each cluster. What you might want to know at this point is what cell types comprise each cluster. Setting aside the thorny question of “what is a cell type?”, this is still a challenging problem, and we’ll explore some approaches to perform cell type annotation in the next notebook!
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] bluster_1.14.0 scran_1.32.0
[3] scater_1.32.0 scuttle_1.14.0
[5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[7] Biobase_2.64.0 GenomicRanges_1.56.0
[9] GenomeInfoDb_1.40.0 IRanges_2.38.0
[11] S4Vectors_0.42.0 BiocGenerics_0.50.0
[13] MatrixGenerics_1.16.0 matrixStats_1.3.0
[15] ggplot2_3.5.1 optparse_1.7.5
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.3
[3] magrittr_2.0.3 compiler_4.4.1
[5] DelayedMatrixStats_1.26.0 vctrs_0.6.5
[7] stringr_1.5.1 pkgconfig_2.0.3
[9] crayon_1.5.2 fastmap_1.1.1
[11] XVector_0.44.0 labeling_0.4.3
[13] utf8_1.2.4 rmarkdown_2.26
[15] tzdb_0.4.0 UCSC.utils_1.0.0
[17] ggbeeswarm_0.7.2 bit_4.0.5
[19] purrr_1.0.2 xfun_0.43
[21] zlibbioc_1.50.0 cachem_1.0.8
[23] beachmat_2.20.0 jsonlite_1.8.8
[25] highr_0.10 DelayedArray_0.30.0
[27] BiocParallel_1.38.0 irlba_2.3.5.1
[29] parallel_4.4.1 cluster_2.1.6
[31] R6_2.5.1 bslib_0.7.0
[33] stringi_1.8.3 limma_3.60.0
[35] jquerylib_0.1.4 Rcpp_1.0.12
[37] knitr_1.46 readr_2.1.5
[39] Matrix_1.7-0 igraph_2.0.3
[41] tidyselect_1.2.1 abind_1.4-5
[43] yaml_2.3.8 viridis_0.6.5
[45] codetools_0.2-20 lattice_0.22-6
[47] tibble_3.2.1 withr_3.0.0
[49] evaluate_0.23 getopt_1.20.4
[51] pillar_1.9.0 generics_0.1.3
[53] vroom_1.6.5 hms_1.1.3
[55] sparseMatrixStats_1.16.0 munsell_0.5.1
[57] scales_1.3.0 glue_1.7.0
[59] metapod_1.12.0 tools_4.4.1
[61] BiocNeighbors_1.22.0 ScaledMatrix_1.12.0
[63] locfit_1.5-9.9 fs_1.6.4
[65] cowplot_1.1.3 grid_4.4.1
[67] edgeR_4.2.0 colorspace_2.1-0
[69] GenomeInfoDbData_1.2.12 beeswarm_0.4.0
[71] BiocSingular_1.20.0 vipor_0.4.7
[73] cli_3.6.2 rsvd_1.0.5
[75] fansi_1.0.6 S4Arrays_1.4.0
[77] viridisLite_0.4.2 dplyr_1.1.4
[79] gtable_0.3.5 sass_0.4.9
[81] digest_0.6.35 SparseArray_1.4.0
[83] ggrepel_0.9.5 dqrng_0.3.2
[85] farver_2.1.1 htmltools_0.5.8.1
[87] lifecycle_1.0.4 httr_1.4.7
[89] statmod_1.5.0 bit64_4.0.5