Objectives

This notebook will demonstrate how to:

  • Identify clusters of cells in single-cell data
  • Compare results from different clustering methods
  • Select putative marker genes that can be used to differentiate clusters

Set Up

Load libraries

# Load libraries
library(ggplot2)
library(scater)
Loading required package: SingleCellExperiment
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors

Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
Loading required package: scuttle
library(scran)

# clustering tools
library(bluster)

# Setting the seed for reproducibility
set.seed(12345)
# main data directory
data_dir <- file.path("data", "hodgkins")

# normalized data file
normalized_rds <- file.path(data_dir, "normalized", "normalized_hodgkins_sce.rds")

# Output directory for markers
marker_dir <- file.path("analysis", "hodgkins", "markers")
fs::dir_create(marker_dir)
hodgkins_sce <- readr::read_rds(normalized_rds)

Assigning cell clusters

Roadmap: Cluster

When we performed dimensionality reduction on our single cell data, we could see visually that the cells tended cluster together into groups. To the extent that such clustering is a real biological phenomenon, representing cells with similar patterns of gene expression, we might like to identify distinct groups that we can name and assign a label to. Ultimately, we would hope that these labels correspond to previously identified (or newly identified!) cell types, and that we can use that information to provide more insight into the results of our experiment.

There are a number of methods to identify clusters and assign cells to those in multidimensional data like the single cell data we have. We will explore a couple of the more common methods here.

k-means clustering

The first method we will try is k-means clustering. The k here refers to the number of clusters that we will create, and must be chosen before we start. This clustering method seeks to find a way to divide the cells into k clusters such that the cells within each cluster are as similar as possible and the differences among clusters is as large as possible.

It turns out that is a pretty hard problem to solve exactly, but we can do pretty well with an algorithm that starts with a random guess at where the clusters will be:

  1. We start by picking random center locations (how we do this can vary)
  2. Then, we assign cells to clusters by finding which center is closest to each cell.
  3. Next we find the centers of these new clusters
  4. Go back to step 2 with these new centers, repeating until the cluster assignments stop changing.

You might wonder: How many clusters should we use? That is a hard question! There are some heuristics we can use for deciding the “correct” number of clusters, but we will not be exploring those right now.

For an intuitive visualization of the general k-means method, you might find this StatQuest video useful, and for more discussion of the method in a single-cell context, the Orchestrating Single-Cell Analysis book section on k-means is a good reference.

We are going to use the function clusterRows() from the Bioconductor bluster package for our clustering. This function takes a matrix where each sample (cell in our case) is a row and each column is a feature. The matrix of counts (or normalized counts) by cells in our SingleCellExperiment object is the wrong orientation, so at a minimum we would have to transpose that matrix before proceeding.

However, clustering algorithms like k-means can be a bit slow with as many features as the number of genes that we have in our data set, so we would rather not use the raw data. There is also a potential concern that noise in the raw data might disrupt the clustering algorithm, so it would be best to use some kind of dimensionality reduction algorithm first. We still want to maintain a good number of dimensions, so our old friend PCA is a good (and very standard) choice.

Thankfully, we already computed and stored a matrix with reduced dimensions with the runPCA() function. We will extract that from the SingleCellExperiment object with the reducedDim() function, which conveniently returns a matrix with the cells as rows, so we can use that directly!

The other argument we need for clusterRows() will tell it which clustering algorithm to use, and any additional parameters associated with that algorithm. This has to be made with a special kind of function from the bluster package. In this case, we will use KmeansParam() to specify that we want k-means clustering, with the centers parameter to set how many clusters we will assign (k).

# set the number of clusters
k <- 7

# extract the principal components matrix
hodgkins_pca <- reducedDim(hodgkins_sce, "PCA")

# perform the clustering
kclusters <- clusterRows(hodgkins_pca, KmeansParam(centers = k))

The clusterRows() function returned a vector of cluster assignments as integers, but the numerical values have no inherent meaning. For plotting we will want to convert those to a factor, so R is not tempted to treat them as a continuous variable.

We can also store them back into the column (cell) information table of the original object for convenient storage and later use.

# save clusters in the SCE object as a factor
hodgkins_sce$kcluster <- factor(kclusters)

Now we can plot the results and see how the clustering looks, using the scater function plotReducedDim() that we have used before, coloring the points by our clustering results. We will start by using the UMAP coordinates for the plot. Note that this does require that the cluster identities were stored in the SingleCellExperiment object, as we just did.

# plot clustering results
plotReducedDim(hodgkins_sce, "UMAP", color_by = "kcluster")

  • Do those clusters line up with what you might have expected if you were doing this by eye?
  • If we repeat this, do we get the same cluster assignments?
  • What happens if we change the number of clusters?
  • What do the results look like if you plot with the PCA or TSNE coordinates?

You will have time to explore questions like these in the exercise notebooks. One thing worth noting right away though is that cluster numbers here and in the future are assigned arbitrarily. Even if we got exactly the same logical clusters across runs (unlikely!), we wouldn’t expect the label numbers to be the same or stable.

Graph-based clustering

Another common type of clustering method for single cell data is graph-based clustering. This algorithm follows the following general steps:

  1. Identifying a set of nearest neighbors for each cell that have similar expression profiles to that cell.
  2. Connect each cell to its neighbors in a network graph, weighting the connections by how similar the connected cells are.
  3. Break the network up by identifying clusters of cells that are more connected to each other than they are to cells outside the clusters.

There is a lot of hidden detail in those three steps!

To apply this clustering algorithm, we will use the same bluster::clusterRows() function as before, but we will change the second argument from KmeansParam() to NNGraphParam() to tell it that we want to use a nearest-neighbor graph-based method. We can then supply additional parameters to NNGraphParam() to adjust the details of the algorithm. Here we will use k to specify the number of neighbors to use when building the graph and cluster.fun to specify the algorithm for identifying the clusters within the graph.

  • Despite sharing a letter, k here and the one from k-means clustering are not the same thing! In this case, we are telling the algorithm how many neighbor connections to make for each cell, not the final number of clusters, which will be determined by the algorithm we use for the cluster building step.

  • The options for cluster.fun describe the algorithm for the cluster building step described above. These include walktrap (the default), leiden, and louvain, which is the default algorithm in Seurat, another common package for single cell analysis that you may have seen.

In the example below, we will use the default values for these two arguments.

# run the clustering algorithm
nnclusters <- clusterRows(
  hodgkins_pca,
  NNGraphParam(k = 10,
               cluster.fun = "walktrap")
  )
# store cluster results in the SCE object
hodgkins_sce$nncluster <- factor(nnclusters)

Now we can plot the results of our graph-based clustering. This time we will also use the text_by argument to include the cluster ids directly on the plot.

plotReducedDim(hodgkins_sce,
               "UMAP",
               color_by = "nncluster",
               text_by = "nncluster")

  • How do these results compare to the k-means clustering result?
  • How sensitive is this to the parameters we choose?
  • How do the numbers of clusters change with different parameters?

Again, you will have time to explore these more in the exercise notebook, and of course with your own data! Sadly, there are not always good answers to which set of inferred clusters is best! Which method and parameters you use may depend on the kind of question you are trying to answer.

For more detailed information about the methods presented here, including some ways to assess the “quality” of the clustering, I encourage you to explore at the relevant chapter of the Orchestrating Single-Cell Analysis book. A recent review by Kislev et al. (2019) also goes into some depth about the differences among algorithms and the general challenges associated with clustering single cell data.

Identifying marker genes

Roadmap: Find markers

Assigning clusters is nice for visualization, but we would also like to be able to move toward a biological interpretation of the clusters and identifying the cell types in each cluster. To that end, we can identify marker genes that are differentially expressed among clusters.

It is worth noting here that the statistical calculations here are more than a bit circular: we identified clusters first based on gene expression, then we are using those same clusters to find differences in gene expression. The result is that even if there were no true clusters, we would always find marker genes! For a much more technical exploration of this circularity (and a method to correct for it), see a preprint by Gao et al. (2020). In light of this, it is better to think about marker gene identification as an aid in interpreting the clustering results (and possibly extending insights to new data sets), rather than results that should be interpreted on their own, and we should be extremely wary of justifying cluster assignments solely based on these results! With that caveat, let’s proceed.

To identify marker genes, we will use the scran::findMarkers() function, which will rank genes by their differential expression by calculating pairwise statistics among clusters. We have a few options for how to determine the gene rankings and marker gene list for each cluster. At one end could include genes that are differentially expressed in any pairwise comparison against our focal cluster, or at the other we could only include genes that are differentially expressed in all comparisons with that cluster. We could also do something in between, including genes that differentiate the focal cluster from some fraction of the other clusters. For now, we will use the findMarkers() function to rank the genes in each cluster by their combined scores against all other clusters, using the pval.type argument.

findMarkers() will return a list (technically a list-like object) of tables, one for each cell type, with statistics for each gene showing how well it differentiates that cell type against other types.

# use `findMarkers()` to calculate how well each gene
#  differentiates each cluster from *all* other clusters
markers <- scran::findMarkers(hodgkins_sce,
                              groups = hodgkins_sce$nncluster,
                              pval.type = "all")

Next we can look at one of those tables. We will start with the first cluster, which we will select from the list using the R standard double bracket [[1]] notation. We also doing a bit of transformation here to pull the gene name into a column of its own.

markers[[1]] |>
  as.data.frame() |> # convert to a data frame
  tibble::rownames_to_column("gene") # make gene a column

You can see that this table includes values for all genes, so we would like to make a shorter list.

Because we tend to like tidy data, here we use a tidyverse function from the purrr package to apply the same operations as above to every element of the markers list. We will introduce purrr briefly here, but if you want more information and background, we recommend the purrr cheatsheet (PDF) and Jenny Bryan’s great purrr tutorial.

The main functions in purrr are the map() functions, which take as their main arguments a list and a function to apply to each element of the list. The main function is purrr::map(); if you are familiar with the base R lapply() function, it is very similar, but with some different defaults. We will use it to get the top rows from each table by applying the head() function to each element of the list. The results are returned as a new list.

purrr::map(
  as.list(markers[1:3]), # select the first 3 clusters and convert to a 'regular' list for purrr
  head # the function to apply (note no parenthesis)
  )

This returns a list of data frames, which isn’t quite what we want.

There is no built-in function that will give us just the first few row names, so we will have to define one. As of version 4.1, R introduced a new approach to defining anonymous functions - that is, functions you can quickly define “on-the-fly” without formally assigning them to a function name. They are handy when you need to do a very short task that requires a function, but it isn’t really a function you need beyond this context. This new anonymous syntax looks like this: \(x)... (or for slightly longer code, use curly braces as in \(x) {...}). This defines a function that takes one argument, x, with ... indicating where you would put the expression to calculate.

purrr::map() will then apply the expression in our anonymous function to each element of the list, and return the results as a new list.

# Get the first few row names of each table with a purrr function.
purrr::map(
  # convert markers to a 'regular' list for purrr
  as.list(markers),
  # our custom function:
  \(x) head( rownames(x) )
)
$`1`
[1] "ENSG00000153064" "ENSG00000247982" "ENSG00000042980" "ENSG00000224137"
[5] "ENSG00000211898" "ENSG00000163534"

$`2`
[1] "ENSG00000124882" "ENSG00000136689" "ENSG00000137462" "ENSG00000123689"
[5] "ENSG00000103569" "ENSG00000043462"

$`3`
[1] "ENSG00000213809" "ENSG00000153563" "ENSG00000020633" "ENSG00000265972"
[5] "ENSG00000182871" "ENSG00000113088"

$`4`
[1] "ENSG00000115935" "ENSG00000100629" "ENSG00000136754" "ENSG00000065882"
[5] "ENSG00000107742" "ENSG00000163599"

$`5`
[1] "ENSG00000198034" "ENSG00000114942" "ENSG00000172757" "ENSG00000136942"
[5] "ENSG00000198755" "ENSG00000196262"

$`6`
[1] "ENSG00000235576" "ENSG00000266088" "ENSG00000204866" "ENSG00000156234"
[5] "ENSG00000111796" "ENSG00000165434"

$`7`
[1] "ENSG00000116016" "ENSG00000085063" "ENSG00000141753" "ENSG00000182871"
[5] "ENSG00000155366" "ENSG00000148175"

$`8`
[1] "ENSG00000164236" "ENSG00000128487" "ENSG00000054219" "ENSG00000086758"
[5] "ENSG00000119977" "ENSG00000068912"

$`9`
[1] "ENSG00000136942" "ENSG00000198034" "ENSG00000114942" "ENSG00000198755"
[5] "ENSG00000172757" "ENSG00000196262"

$`10`
[1] "ENSG00000181163" "ENSG00000125691" "ENSG00000265681" "ENSG00000204628"
[5] "ENSG00000136810" "ENSG00000198034"

$`11`
[1] "ENSG00000275385" "ENSG00000173369" "ENSG00000159189" "ENSG00000025708"
[5] "ENSG00000197249" "ENSG00000216490"

$`12`
[1] "ENSG00000008517" "ENSG00000092820" "ENSG00000128340" "ENSG00000054267"
[5] "ENSG00000198786" "ENSG00000102879"

$`13`
[1] "ENSG00000132465" "ENSG00000185507" "ENSG00000156675" "ENSG00000051108"
[5] "ENSG00000101057" "ENSG00000135916"
ENSG00000153064

ENSG00000247982

ENSG00000042980

ENSG00000224137

ENSG00000211898

ENSG00000163534
ENSG00000124882

ENSG00000136689

ENSG00000137462

ENSG00000123689

ENSG00000103569

ENSG00000043462
ENSG00000213809

ENSG00000153563

ENSG00000020633

ENSG00000265972

ENSG00000182871

ENSG00000113088
ENSG00000115935

ENSG00000100629

ENSG00000136754

ENSG00000065882

ENSG00000107742

ENSG00000163599
ENSG00000198034

ENSG00000114942

ENSG00000172757

ENSG00000136942

ENSG00000198755

ENSG00000196262
ENSG00000235576

ENSG00000266088

ENSG00000204866

ENSG00000156234

ENSG00000111796

ENSG00000165434
ENSG00000116016

ENSG00000085063

ENSG00000141753

ENSG00000182871

ENSG00000155366

ENSG00000148175
ENSG00000164236

ENSG00000128487

ENSG00000054219

ENSG00000086758

ENSG00000119977

ENSG00000068912
ENSG00000136942

ENSG00000198034

ENSG00000114942

ENSG00000198755

ENSG00000172757

ENSG00000196262
ENSG00000181163

ENSG00000125691

ENSG00000265681

ENSG00000204628

ENSG00000136810

ENSG00000198034
ENSG00000275385

ENSG00000173369

ENSG00000159189

ENSG00000025708

ENSG00000197249

ENSG00000216490
ENSG00000008517

ENSG00000092820

ENSG00000128340

ENSG00000054267

ENSG00000198786

ENSG00000102879
ENSG00000132465

ENSG00000185507

ENSG00000156675

ENSG00000051108

ENSG00000101057

ENSG00000135916

Another variant is purrr::imap(), which allows us to use the names of the list elements in our function. (Try names(markers) to see the names for the list we are working with now.) We will use that here to name output files where we will print each of the marker tables, one for each cell type. We are again defining a custom function within the call to purrr:imap() using the \(x) syntax, but this time we need two variables: we will use table for the list elements (each a table of results) and id for their names. So, we’ll actually start by defining the function as \(table, id), since there will be two input arguments. Because we don’t know the identities of the clusters we identified, these are just the cluster numbers for now.

Making file names from numbers can be a a bit fraught, as we really want them to sort in numerical order, but many systems will sort by alphabetical order. Unfortunately, that would tend to sort 10-19 before 2, 20-29 before 3, etc. To solve this, we are using the sprintf() function, which allows us to specify the format of a printed string. In this case, we are using the formatting syntax of %02d to tell it that we will want to insert (%) a number (d), with two digits and leading zeros. To see what this does a bit more concretely, let’s look at a simple example:

sprintf("%02d", 1:10)
 [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"
01

02

03

04

05

06

07

08

09

10

In addition to writing the tables out, we are saving the data frames we created as a new list that we can use in the next step.

marker_df_list <- purrr::imap(
  as.list(markers), # convert markers to a 'regular' list for purrr
  # purrr function: x is the list element, y is the element name (number here)
  \(table, id) {
    as.data.frame(table) |> # first convert to a data frame
      tibble::rownames_to_column("gene") |> # make genes a column
      dplyr::arrange(FDR) |> # sort to be sure small FDR genes are first
      readr::write_tsv( # write each data frame to a file
        file.path(
          marker_dir, # construct the output path
          sprintf("cluster%02d_markers.tsv", as.integer(id)) # format cluster numbers in file names with leading zeros
        )
      )
  }
)

Plotting marker gene expression

One thing we can do with this list of marker genes is to see how they look across the cells and clusters. The scater::plotReducedDim() function makes this easy! We have earlier colored points by some cell statistic, like the number of expressed genes, but it is just as easy to color by the expression of a single gene by using the gene identifier as the color_by argument.

The first step is to get the gene information for the genes we might be interested in.

# get gene ids for top 10 cluster 1 markers
gene_ids <- marker_df_list[[1]] |>
  head(n = 10) |>
  dplyr::pull(gene)

# look at the gene info for these
gene_info <- rowData(hodgkins_sce)[gene_ids, ]
data.frame(gene_info)

Now we can pick one of the genes for plotting and go!

# get gene id and gene symbol for nicer plotting
rank <- 1
gene_id <- gene_info$ID[rank]
symbol <- gene_info$Symbol[rank]

# Plot UMAP results colored by expression
plotReducedDim(hodgkins_sce, "UMAP",
               color_by = gene_id) +
  # label the guide with the gene symbol
  guides(color = guide_colorbar(title = symbol))

Hopefully that expression pattern aligns at least in part with your expectations!

Next steps

So far we have identified clusters of cells (if you believe them), and found some genes that are associated with each cluster. What you might want to know at this point is what cell types comprise each cluster. Setting aside the thorny question of “what is a cell type?”, this is still a challenging problem, and we’ll explore some approaches to perform cell type annotation in the next notebook!

Session Info

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bluster_1.8.0               scran_1.26.2               
 [3] scater_1.26.1               scuttle_1.8.4              
 [5] SingleCellExperiment_1.20.1 SummarizedExperiment_1.28.0
 [7] Biobase_2.58.0              GenomicRanges_1.50.2       
 [9] GenomeInfoDb_1.34.9         IRanges_2.32.0             
[11] S4Vectors_0.36.2            BiocGenerics_0.44.0        
[13] MatrixGenerics_1.10.0       matrixStats_0.63.0         
[15] ggplot2_3.4.2               optparse_1.7.3             

loaded via a namespace (and not attached):
 [1] bitops_1.0-7              fs_1.6.1                 
 [3] bit64_4.0.5               tools_4.2.3              
 [5] bslib_0.4.2               utf8_1.2.3               
 [7] R6_2.5.1                  irlba_2.3.5.1            
 [9] vipor_0.4.5               colorspace_2.1-0         
[11] withr_2.5.0               tidyselect_1.2.0         
[13] gridExtra_2.3             bit_4.0.5                
[15] compiler_4.2.3            cli_3.6.1                
[17] BiocNeighbors_1.16.0      DelayedArray_0.24.0      
[19] labeling_0.4.2            sass_0.4.5               
[21] scales_1.2.1              readr_2.1.4              
[23] stringr_1.5.0             digest_0.6.31            
[25] rmarkdown_2.21            XVector_0.38.0           
[27] pkgconfig_2.0.3           htmltools_0.5.5          
[29] sparseMatrixStats_1.10.0  highr_0.10               
[31] fastmap_1.1.1             limma_3.54.2             
[33] rlang_1.1.0               DelayedMatrixStats_1.20.0
[35] farver_2.1.1              jquerylib_0.1.4          
[37] generics_0.1.3            jsonlite_1.8.4           
[39] vroom_1.6.1               BiocParallel_1.32.6      
[41] dplyr_1.1.2               RCurl_1.98-1.12          
[43] magrittr_2.0.3            BiocSingular_1.14.0      
[45] GenomeInfoDbData_1.2.9    Matrix_1.5-4             
[47] Rcpp_1.0.10               ggbeeswarm_0.7.1         
[49] munsell_0.5.0             fansi_1.0.4              
[51] viridis_0.6.2             lifecycle_1.0.3          
[53] stringi_1.7.12            yaml_2.3.7               
[55] edgeR_3.40.2              zlibbioc_1.44.0          
[57] grid_4.2.3                parallel_4.2.3           
[59] ggrepel_0.9.3             dqrng_0.3.0              
[61] crayon_1.5.2              lattice_0.21-8           
[63] cowplot_1.1.1             beachmat_2.14.2          
[65] hms_1.1.3                 locfit_1.5-9.7           
[67] metapod_1.6.0             knitr_1.42               
[69] pillar_1.9.0              igraph_1.4.2             
[71] codetools_0.2-19          ScaledMatrix_1.6.0       
[73] glue_1.6.2                evaluate_0.20            
[75] vctrs_0.6.2               tzdb_0.3.0               
[77] purrr_1.0.1               gtable_0.3.3             
[79] getopt_1.20.3             cachem_1.0.7             
[81] xfun_0.39                 rsvd_1.0.5               
[83] viridisLite_0.4.1         tibble_3.2.1             
[85] beeswarm_0.4.0            cluster_2.1.4            
[87] statmod_1.5.0            
---
title: "Clustering cells and finding marker genes from scRNA-seq data"
author: CCDL for ALSF
date: 2021
output:
  html_notebook:
    toc: true
    toc_float: true
---

## Objectives

This notebook will demonstrate how to:

- Identify clusters of cells in single-cell data
- Compare results from different clustering methods
- Select putative marker genes that can be used to differentiate clusters

---

## Set Up

### Load libraries
```{r setup}
# Load libraries
library(ggplot2)
library(scater)
library(scran)

# clustering tools
library(bluster)

# Setting the seed for reproducibility
set.seed(12345)
```

```{r filepaths}
# main data directory
data_dir <- file.path("data", "hodgkins")

# normalized data file
normalized_rds <- file.path(data_dir, "normalized", "normalized_hodgkins_sce.rds")

# Output directory for markers
marker_dir <- file.path("analysis", "hodgkins", "markers")
fs::dir_create(marker_dir)
```

```{r read_data, live = TRUE}
hodgkins_sce <- readr::read_rds(normalized_rds)
```



## Assigning cell clusters

![Roadmap: Cluster](diagrams/roadmap_single_cluster.png)

When we performed dimensionality reduction on our single cell data, we could see visually that the cells tended cluster together into groups.
To the extent that such clustering is a real biological phenomenon, representing cells with similar patterns of gene expression, we might like to identify distinct groups that we can name and assign a label to.
Ultimately, we would hope that these labels correspond to previously identified (or newly identified!) cell types, and that we can use that information to provide more insight into the results of our experiment.

There are a number of methods to identify clusters and assign cells to those in multidimensional data like the single cell data we have.
We will explore a couple of the more common methods here.

### k-means clustering

The first method we will try is k-means clustering.
The `k` here refers to the number of clusters that we will create, and must be chosen before we start.
This clustering method seeks to find a way to divide the cells into `k` clusters such that the cells within each cluster are as similar as possible and the differences among clusters is as large as possible.

It turns out that is a pretty hard problem to solve exactly, but we can do pretty well with an algorithm that starts with a random guess at where the clusters will be:

1. We start by picking random center locations (how we do this can vary)
2. Then, we assign cells to clusters by finding which center is closest to each cell.
3. Next we find the centers of these new clusters
4. Go back to step 2 with these new centers, repeating until the cluster assignments stop changing.

You might wonder: How many clusters should we use?
That is a hard question!
There are some heuristics we can use for deciding the "correct" number of clusters, but we will not be exploring those right now.

For an intuitive visualization of the general k-means method, you might find [this StatQuest video](https://www.youtube.com/watch?v=4b5d3muPQmA) useful, and for more discussion of the method in a single-cell context, the [Orchestrating Single-Cell Analysis book section on k-means](https://bioconductor.org/books/3.16/OSCA.basic/clustering.html#vector-quantization-with-k-means) is a good reference.

We are going to use the function `clusterRows()` from the  Bioconductor `bluster` package for our clustering.
This function takes a matrix where each sample (cell in our case) is a row and each column is a feature.
The matrix of counts (or normalized counts) by cells in our `SingleCellExperiment` object is the wrong orientation, so at a minimum we would have to transpose that matrix before proceeding.

However, clustering algorithms like k-means can be a bit slow with as many features as the number of genes that we have in our data set, so we would rather not use the raw data.
There is also a potential concern that noise in the raw data might disrupt the clustering algorithm, so it would be best to use some kind of dimensionality reduction algorithm first.
We still want to maintain a good number of dimensions, so our old friend PCA is a good (and very standard) choice.

Thankfully, we already computed *and stored* a matrix with reduced dimensions with the `runPCA()` function.
We will extract that from the `SingleCellExperiment` object with the `reducedDim()` function, which conveniently returns a matrix with the cells as rows, so we can use that directly!

The other argument we need for `clusterRows()` will tell it which clustering algorithm to use, and any additional parameters associated with that algorithm.
This has to be made with a special kind of function from the `bluster` package.
In this case, we will use `KmeansParam()` to specify that we want k-means clustering, with the `centers` parameter to set how many clusters we will assign (`k`).

```{r kmeans_7, live = TRUE}
# set the number of clusters
k <- 7

# extract the principal components matrix
hodgkins_pca <- reducedDim(hodgkins_sce, "PCA")

# perform the clustering
kclusters <- clusterRows(hodgkins_pca, KmeansParam(centers = k))
```

The `clusterRows()` function returned a vector of cluster assignments as integers, but the numerical values have no inherent meaning.
For plotting we will want to convert those to a factor, so R is not tempted to treat them as a continuous variable.

We can also store them back into the column (cell) information table of the original object for convenient storage and later use.

```{r store_kclusters, live = TRUE}
# save clusters in the SCE object as a factor
hodgkins_sce$kcluster <- factor(kclusters)
```

Now we can plot the results and see how the clustering looks, using the `scater` function `plotReducedDim()` that we have used before, coloring the points by our clustering results.
We will start by using the UMAP coordinates for the plot.
Note that this does require that the cluster identities were stored in the `SingleCellExperiment` object, as we just did.

```{r plot_k, live = TRUE}
# plot clustering results
plotReducedDim(hodgkins_sce, "UMAP", color_by = "kcluster")
```

- Do those clusters line up with what you might have expected if you were doing this by eye?
- If we repeat this, do we get the same cluster assignments?
- What happens if we change the number of clusters?
- What do the results look like if you plot with the `PCA` or `TSNE` coordinates?

You will have time to explore questions like these in the exercise notebooks.
One thing worth noting right away though is that cluster numbers here and in the future are assigned arbitrarily.
Even if we got exactly the same logical clusters across runs (unlikely!), we wouldn't expect the label numbers to be the same or stable.

### Graph-based clustering

Another common type of clustering method for single cell data is graph-based clustering.
This algorithm follows the following general steps:

1. Identifying a set of nearest neighbors for each cell that have similar expression profiles to that cell.
2. Connect each cell to its neighbors in a network graph, weighting the connections by how similar the connected cells are.
3. Break the network up by identifying clusters of cells that are more connected to each other than they are to cells outside the clusters.

There is a lot of hidden detail in those three steps!

To apply this clustering algorithm, we will use the same `bluster::clusterRows()` function as before, but we will change the second argument from `KmeansParam()` to `NNGraphParam()` to tell it that we want to use a nearest-neighbor graph-based method.
We can then supply additional parameters to `NNGraphParam()` to adjust the details of the algorithm.
Here we will use `k` to specify the number of neighbors to use when building the graph and `cluster.fun` to specify the algorithm for identifying the clusters within the graph.

- Despite sharing a letter, `k` here and the one from k-means clustering are not the same thing!
In this case, we are telling the algorithm how many neighbor connections to make for each cell, not the final number of clusters, which will be determined by the algorithm we use for the cluster building step.

- The options for `cluster.fun` describe the algorithm for the cluster building step described above. These include `walktrap` (the default), `leiden`, and `louvain`, which is the default algorithm in [`Seurat`](https://satijalab.org/seurat/), another common package for single cell analysis that you may have seen.

In the example below, we will use the default values for these two arguments.

```{r nnclust, live = TRUE}
# run the clustering algorithm
nnclusters <- clusterRows(
  hodgkins_pca,
  NNGraphParam(k = 10,
               cluster.fun = "walktrap")
  )
# store cluster results in the SCE object
hodgkins_sce$nncluster <- factor(nnclusters)
```

Now we can plot the results of our graph-based clustering.
This time we will also use the `text_by` argument to include the cluster ids directly on the plot.

```{r plot_nnclust, live = TRUE}
plotReducedDim(hodgkins_sce,
               "UMAP",
               color_by = "nncluster",
               text_by = "nncluster")
```

- How do these results compare to the k-means clustering result?
- How sensitive is this to the parameters we choose?
- How do the numbers of clusters change with different parameters?

Again, you will have time to explore these more in the exercise notebook, and of course with your own data!
Sadly, there are not always good answers to which set of inferred clusters is best!
Which method and parameters you use may depend on the kind of question you are trying to answer.

For more detailed information about the methods presented here, including some ways to assess the "quality" of the clustering, I encourage you to explore at the relevant chapter of the [Orchestrating Single-Cell Analysis book](https://bioconductor.org/books/3.16/OSCA.basic/clustering.html#clustering-graph).
A recent review by [Kislev *et al.* (2019)](https://doi.org/10.1038/s41576-018-0088-9) also goes into some depth about the differences among algorithms and the general challenges associated with clustering single cell data.

## Identifying marker genes

![Roadmap: Find markers](diagrams/roadmap_single_findmarkers.png)

Assigning clusters is nice for visualization, but we would also like to be able to move toward a biological interpretation of the clusters and identifying the cell types in each cluster.
To that end, we can identify marker genes that are differentially expressed among clusters.

It is worth noting here that the statistical calculations here are more than a bit circular: we identified clusters first based on gene expression, then we are using those same clusters to find differences in gene expression.
The result is that even if there were no *true* clusters, we would always find marker genes!
For a much more technical exploration of this circularity (and a method to correct for it), see a preprint by [Gao et al. (2020)](https://arxiv.org/abs/2012.02936).
In light of this, it is better to think about marker gene identification as an aid in interpreting the clustering results (and possibly extending insights to new data sets), rather than results that should be interpreted on their own, and we should be extremely wary of justifying cluster assignments solely based on these results!
With that caveat, let's proceed.

To identify marker genes, we will use the `scran::findMarkers()` function, which will rank genes by their differential expression by calculating pairwise statistics among clusters.
We have a few options for how to determine the gene rankings and marker gene list for each cluster.
At one end could include genes that are differentially expressed in *any* pairwise comparison against our focal cluster, or at the other we could only include genes that are differentially expressed in *all* comparisons with that cluster.
We could also do something in between, including genes that differentiate the focal cluster from *some* fraction of the other clusters.
For now, we will use the `findMarkers()` function to rank the genes in each cluster by their combined scores against *all* other clusters, using the `pval.type` argument.

`findMarkers()` will return a list (technically a list-like object) of tables, one for each cell type, with statistics for each gene showing how well it differentiates that cell type against other types.


```{r find_markers, live = TRUE}
# use `findMarkers()` to calculate how well each gene
#  differentiates each cluster from *all* other clusters
markers <- scran::findMarkers(hodgkins_sce,
                              groups = hodgkins_sce$nncluster,
                              pval.type = "all")
```

Next we can look at one of those tables.
We will start with the first cluster, which we will select from the list using the R standard double bracket `[[1]]` notation.
We also doing a bit of transformation here to pull the gene name into a column of its own.

```{r marker_table}
markers[[1]] |>
  as.data.frame() |> # convert to a data frame
  tibble::rownames_to_column("gene") # make gene a column
```

You can see that this table includes values for all genes, so we would like to make a shorter list.

Because we tend to like [tidy data](https://r4ds.had.co.nz/tidy-data.html), here we use a `tidyverse` function from the [`purrr` package](https://purrr.tidyverse.org) to apply the same operations as above to every element of the `markers` list.
We will introduce `purrr` briefly here, but if you want more information and background, we recommend the [`purrr` cheatsheet (PDF)](https://github.com/rstudio/cheatsheets/raw/main/purrr.pdf) and Jenny Bryan's great [`purrr` tutorial](https://jennybc.github.io/purrr-tutorial/index.html).


The main functions in `purrr` are the `map()` functions, which take as their main arguments a **list** and a **function** to apply to each element of the list.
The main function is `purrr::map()`; if you are familiar with the base R `lapply()` function, it is very similar, but with some different defaults.
We will use it to get the top rows from each table by applying the `head()` function to each element of the list.
The results are returned as a new list.

```{r head_markers, eval = FALSE}
purrr::map(
  as.list(markers[1:3]), # select the first 3 clusters and convert to a 'regular' list for purrr
  head # the function to apply (note no parenthesis)
  )
```

This returns a list of data frames, which isn't quite what we want.

There is no built-in function that will give us just the first few _row names_, so we will have to define one.
As of version 4.1, R introduced a new approach to defining _anonymous functions_ - that is, functions you can quickly define "on-the-fly" without formally assigning them to a function name.
They are handy when you need to do a very short task that requires a function, but it isn't really a function you need beyond this context.
This new anonymous syntax looks like this: `\(x)...` (or for slightly longer code, use curly braces as in `\(x) {...}`).
This defines a function that takes one argument, `x`, with `...` indicating where you would put the expression to calculate.

`purrr::map()` will then apply the expression in our anonymous function to each element of the list, and return the results as a new list.

```{r head_markernames, live = TRUE}
# Get the first few row names of each table with a purrr function.
purrr::map(
  # convert markers to a 'regular' list for purrr
  as.list(markers),
  # our custom function:
  \(x) head( rownames(x) )
)
```

Another variant is `purrr::imap()`, which allows us to use the names of the list elements in our function.
(Try `names(markers)` to see the names for the list we are working with now.)
We will use that here to name output files where we will print each of the marker tables, one for each cell type.
We are again defining a custom function within the call to `purrr:imap()` using the `\(x)` syntax, but this time we need two variables: we will use `table` for the list elements (each a table of results) and `id` for their names.
So, we'll actually start by defining the function as `\(table, id)`, since there will be two input arguments.
Because we don't know the identities of the clusters we identified, these are just the cluster numbers for now.

Making file names from numbers can be a a bit fraught, as we really want them to sort in numerical order, but many systems will sort by alphabetical order.
Unfortunately, that would tend to sort 10-19 before 2, 20-29 before 3, etc.
To solve this, we are using the `sprintf()` function, which allows us to specify the format of a printed string.
In this case, we are using the formatting syntax of `%02d` to tell it that we will want to insert (`%`) a number (`d`), with two digits and leading zeros.
To see what this does a bit more concretely, let's look at a simple example:

```{r sprintf}
sprintf("%02d", 1:10)
```


In addition to writing the tables out, we are saving the data frames we created as a new list that we can use in the next step.

```{r write_tables}
marker_df_list <- purrr::imap(
  as.list(markers), # convert markers to a 'regular' list for purrr
  # purrr function: x is the list element, y is the element name (number here)
  \(table, id) {
    as.data.frame(table) |> # first convert to a data frame
      tibble::rownames_to_column("gene") |> # make genes a column
      dplyr::arrange(FDR) |> # sort to be sure small FDR genes are first
      readr::write_tsv( # write each data frame to a file
        file.path(
          marker_dir, # construct the output path
          sprintf("cluster%02d_markers.tsv", as.integer(id)) # format cluster numbers in file names with leading zeros
        )
      )
  }
)
```


### Plotting marker gene expression

One thing we can do with this list of marker genes is to see how they look across the cells and clusters.
The `scater::plotReducedDim()` function makes this easy!
We have earlier colored points by some cell statistic, like the number of expressed genes, but it is just as easy to color by the expression of a single gene by using the gene identifier as the `color_by` argument.

The first step is to get the gene information for the genes we might be interested in.

```{r marker_info, live = TRUE}
# get gene ids for top 10 cluster 1 markers
gene_ids <- marker_df_list[[1]] |>
  head(n = 10) |>
  dplyr::pull(gene)

# look at the gene info for these
gene_info <- rowData(hodgkins_sce)[gene_ids, ]
data.frame(gene_info)
```

Now we can pick one of the genes for plotting and go!

```{r plot_marker_expression, live = TRUE}
# get gene id and gene symbol for nicer plotting
rank <- 1
gene_id <- gene_info$ID[rank]
symbol <- gene_info$Symbol[rank]

# Plot UMAP results colored by expression
plotReducedDim(hodgkins_sce, "UMAP",
               color_by = gene_id) +
  # label the guide with the gene symbol
  guides(color = guide_colorbar(title = symbol))
```


Hopefully that expression pattern aligns at least in part with your expectations!

## Next steps

So far we have identified clusters of cells (if you believe them), and found some genes that are associated with each cluster.
What you might want to know at this point is what *cell types* comprise each cluster.
Setting aside the thorny question of "what is a cell type?", this is still a challenging problem, and we'll explore some approaches to perform cell type annotation in the next notebook!


## Session Info

```{r session}
sessionInfo()
```
