This notebook will demonstrate how to:
emptyDropsCellRanger()
miQC()
In this notebook, we will review basic processing for single-cell
RNA-seq data, starting with the output from Cell Ranger, and proceeding
through filtering, quality control, normalization, and dimension
reduction. We will perform these tasks using tools from the Bioconductor project, in particular
SingleCellExperiment
objects and functions that work with those objects. Much of the
material in this notebook is directly inspired by, and draws heavily on,
material presented in the book Orchestrating Single
Cell Analysis with Bioconductor (OSCA).
The data we will use for this notebook is derived from a human glioblastoma specimen. The sample was processed by 10x Genomics using a 3’ RNA kit (v3.1), sequenced, and quantified with Cell Ranger 6.0. Further details about the sample and processing can be found on the 10x website.
To start, we will load some of the libraries we will need later, and set a random number seed for reproducibility.
# Load libraries
# Plotting functions
library(ggplot2)
# The main class we will use for Single Cell data
library(SingleCellExperiment)
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Setting the seed for reproducibility
set.seed(12345)
Before we get too far, we like to define the input and output files that the notebook will use near the top of the document. While you might not know the names of all of the files you will need or create output files when you start an analysis, we have found it helpful to keep all file and directory names in a single place near the top of the document. This makes it easier for somebody coming to the code later to quickly see what files are needed as input and what will be produced as output. More often than not, that somebody is you!
The gene expression data were processed to create a gene-by-cell
expression matrix of counts for using Cell Ranger 6.0. We have provided
the raw data directory, raw_feature_bc_matrix
, which is
usually produced by Cell Ranger and placed in its outs
directory. This directory usually contains three files: -
barcodes.tsv.gz
, a table of the cell barcodes that 10x
uses, corresponding to the columns of the count matrix. -
features.tsv.gz
, a table of the features (genes in this
case) for which expression was quantified. This will usually also
include a bit of metadata about the features, including gene symbols (if
the features are genes) and the type of data they represent (e.g., gene
expression or antibody capture). - matrix.mtx.gz
, The
counts themselves, stored in a sparse “Matrix Exchange”
format.
Cell Ranger will also export these data in a single HDF5
format file with a .h5
extension, which can also be
imported with the same commands we will use below. However, we have
found that processing large .h5
files is often
much less efficient in R, so we prefer to start with the matrix
files when possible. In particular, we would not recommend working with
.h5
files for raw data; the filtering steps we will use
below can sometimes take hours when using those files as input.
We will also need a table of mitochondrial genes, which we have
stored in the data/reference/
directory.
Finally, we will set up the our output directory, creating it if it does not yet exist, and define the name for the files we will save after all of our initial processing is complete.
# Inputs --------------------------------------
# main data directory
data_dir <- file.path("data", "glioblastoma-10x")
# Path to the Cell Ranger matrix file
raw_matrix_dir <- file.path(data_dir, "raw_feature_bc_matrix")
# reference data directory
ref_dir <- file.path("data", "reference")
# Path to mitochondrial genes table
mito_file <- file.path(ref_dir, "hs_mitochondrial_genes.tsv")
# Outputs ------------------------------------
# Directory and file to save output
normalized_dir <- file.path(data_dir, "normalized")
# create the directory if it does not exist
fs::dir_create(normalized_dir)
# output RDS file for normalized data
output_sce_file <- file.path(normalized_dir,
"glioblastoma_normalized_sce.rds")
Whether the 10x Cell Ranger data is in Matrix Exchange format or in
an HDF5 file, we can use the read10xCounts()
function from
the DropletUtils
package to read the data into R and create
a SingleCellExperiment
object. (Though again, we do not
recommend using the .h5
file if you can avoid it,
especially for raw (unfiltered) data.)
If you used something other than Cell Ranger to process the raw data,
you would need to use a different function to read it in and create the
SingleCellExperiment
object. Some of these functions for
other common data formats are discussed in [Chapter 3 of OSCA] (http://bioconductor.org/books/3.16/OSCA.intro/getting-scrna-seq-datasets.html#reading-counts-into-r).
# read SCE from matrix directory
raw_sce <- DropletUtils::read10xCounts(
raw_matrix_dir,
col.names = TRUE # ensure barcodes are set as column names in the SCE object
)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'HDF5Array'
Let’s look at the contents of the object after reading it in:
# view SCE object
raw_sce
class: SingleCellExperiment
dim: 36601 734492
metadata(1): Samples
assays(1): counts
rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
ENSG00000277196
rowData names(3): ID Symbol Type
colnames(734492): AAACCCAAGAAACCCA-1 AAACCCAAGAAATTCG-1 ...
TTTGTTGTCTTTCTAG-1 TTTGTTGTCTTTGCAT-1
colData names(2): Sample Barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
We can see from this summary that this
SingleCellExperiment
(SCE) object contains 36,601 rows,
which correspond to the features (genes) that were analyzed, and 734,492
columns, which correspond to the possible barcode tags that were used in
the experiment. Note that not all of these barcode tags will have been
used, and many of the features may never have been seen either. One of
our first steps will be to filter out barcodes that were never seen, or
that may have only been seen in a droplet that did not contain a cell
(an “empty droplet”).
SingleCellExperiment
objectIn addition to the main counts
matrix, listed as an
assay
in the SCE summary above, the SCE object can contain
a number of other tables and matrices, each stored in a “slot” with a
particular format. The overall structure of the object can be seen in
the figure below, which comes from an OSCA
Introduction chapter.
We have just mentioned the main assay
slot, which
contains full matrices of data (such as transcript counts) with each row
a feature and each column a cell. There are also a couple of tables for
metadata, and a slot to store reduced-dimension representations (e.g.,
PCA and/or UMAP) of the expression data.
We’ll start with the rowData
slot, which is a table of
metadata for each feature in our object. For now that contains the
contents of the features.tsv.gz
file that we discussed
earlier. If we had read the data from something other than Cell Ranger
output, we might have different contents, but each row would still
correspond to a single feature of the SCE object.
Let’s look at this table, extracting it from the SCE object with the
rowData()
function and using head()
to view
only the first 6 rows.
# view rowData (features)
head(rowData(raw_sce))
DataFrame with 6 rows and 3 columns
ID Symbol Type
<character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression
ENSG00000239906 ENSG00000239906 AL627309.2 Gene Expression
You can see that this table includes an ID
for each
feature, which is usually the Ensembl gene ID, as well as the
corresponding gene symbol in the Symbol
column. Finally
there is a column for Type
, which in this case is always
“Gene Expression”, as all of the features in this data set are genes. If
there were another modality of data that had been assayed in this
experiment, there might be other values in this column, such as
“Antibody Capture” for CITE-seq experiments.
The second slot is the colData
table, which now
corresponds to the barcodes.tsv.gz
file, containing one row
per cell barcode, or, more generally, one row per column of the
counts
assay. We can look at this table using the
colData()
function (and head()
again to
prevent printing the whole table):
# view colData (cell barcodes)
head(colData(raw_sce))
DataFrame with 6 rows and 2 columns
Sample Barcode
<character> <character>
AAACCCAAGAAACCCA-1 data/glioblastoma-10.. AAACCCAAGAAACCCA-1
AAACCCAAGAAATTCG-1 data/glioblastoma-10.. AAACCCAAGAAATTCG-1
AAACCCAAGAACTGAT-1 data/glioblastoma-10.. AAACCCAAGAACTGAT-1
AAACCCAAGAAGAACG-1 data/glioblastoma-10.. AAACCCAAGAAGAACG-1
AAACCCAAGAAGCGCT-1 data/glioblastoma-10.. AAACCCAAGAAGCGCT-1
AAACCCAAGAAGGATG-1 data/glioblastoma-10.. AAACCCAAGAAGGATG-1
Here we see that there are currently two columns:
Sample
column has the path of the file that we read
in (you may not see the whole path in this display); this should be
identical in all rows from a single sample.Barcode
column contains the sequence that was used
to identify each potential droplet for sequencing (and a numeric tag, in
this case). These will be unique within a sample.As we proceed to calculate per-cell statistics, we will be adding new data to this table.
Most of the barcodes in any given 10x experiment will not be seen at all, so our first step can be to filter this raw data to only the cells where there is at least one transcript that was counted with that barcode.
To do this, we will use the colSums()
function to
quickly add up all the counts that correspond to each possible cell
barcode, then filter our raw_sce
down to just those columns
where there are non-zero total counts. We will need to extract the
counts
matrix from our SCE object, which we can do using
the counts()
function, conveniently enough.
# sum columns from counts matrix
barcode_counts <- colSums(counts(raw_sce))
# filter SCE object to only rows with counts > 0
raw_sce <- raw_sce[, which(barcode_counts > 0)]
Now we can look at how our SCE object has changed:
raw_sce
class: SingleCellExperiment
dim: 36601 462027
metadata(1): Samples
assays(1): counts
rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
ENSG00000277196
rowData names(3): ID Symbol Type
colnames(462027): AAACCCAAGAAACCCA-1 AAACCCAAGAAATTCG-1 ...
TTTGTTGTCTTTCTAG-1 TTTGTTGTCTTTGCAT-1
colData names(2): Sample Barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
But barcodes with zero counts are not the only ones that correspond to droplets without cells in them! Even if a droplet does not have a cell in it, there will often be spurious reads from RNA sequences that were present in the extracellular solution, whether from the original sample or from cells that were damaged during single-cell library preparation.
We could identify these barcodes simply as those with low transcript counts. Or, we can be a bit more clever! We can look at the transcript counts from the lowest-count droplets to create an expected distribution of transcripts in droplets that don’t contain cells. Then we can test each droplet to determine whether or not its transcript distribution deviates from that expectation. If it does, then we have pretty good evidence that there is a cell in there.
This test was first proposed by Lun et al.
(2019) and implemented as emptyDrops()
in the
DropletUtils
package. This method was then adopted, with
some modifications, as the default cell filtering method used by Cell
Ranger. Here we will use the emptyDropsCellRanger()
function to perform filtering that more closely matches the Cell
Ranger implementation.
# create a table of statistics using emptyDropsCellRanger
droplet_df <- DropletUtils::emptyDropsCellRanger(raw_sce)
Most values in this table are NA
, because individual
statistics were not calculated for the low-count droplets that were used
to generate the background distribution. (Most droplets don’t have
cells, so this makes some sense!)
We can look at just the rows without NA
values by
selected the ones where the FDR (which we will use again soon), is not
NA
.
# view rows where FDR is not `NA`
droplet_df[!is.na(droplet_df$FDR), ]
DataFrame with 2176 rows and 5 columns
Total LogProb PValue Limited FDR
<integer> <numeric> <numeric> <logical> <numeric>
AAACCCACATGTGTCA-1 15622 NA NA NA 0.0000000
AAACCCAGTGGTAATA-1 589 -1614.42 0.6597340 FALSE 0.7664609
AAACGAAAGAAACTAC-1 507 -1343.92 0.9993001 FALSE 1.0000000
AAACGAAAGAGAACCC-1 19899 NA NA NA 0.0000000
AAACGAATCCCTTCCC-1 881 -2262.85 0.0237976 FALSE 0.0311013
... ... ... ... ... ...
TTTGGTTTCCCGGTAG-1 867 -2361.80 0.00009999 TRUE 0.000139922
TTTGGTTTCCGGACTG-1 1245 -2840.48 0.04199580 FALSE 0.054297601
TTTGGTTTCTGTCCCA-1 770 -2046.57 0.05659434 FALSE 0.072654445
TTTGTTGGTCAGGCAA-1 1011 -2876.13 0.00009999 TRUE 0.000139922
TTTGTTGTCAGATTGC-1 9987 NA NA NA 0.000000000
You will notice that some cells with high counts also have
NA
values for many statistics. In those cases,
NA
values are actually present because of the high
counts - emptyDropsCellRanger()
automatically assumed cells
were present, so they were also not tested.
Now we can filter our raw_sce
object by column
to only keep the cells with a small FDR: those that are quite unlikely
to be empty droplets.
# filter droplets using `which` to prevent NA trouble
cells_to_retain <- which(droplet_df$FDR < 0.01)
filtered_sce <- raw_sce[, cells_to_retain]
How many cells do we have now?
filtered_sce
class: SingleCellExperiment
dim: 36601 1626
metadata(1): Samples
assays(1): counts
rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
ENSG00000277196
rowData names(3): ID Symbol Type
colnames(1626): AAACCCACATGTGTCA-1 AAACGAAAGAGAACCC-1 ...
TTTGTTGGTCAGGCAA-1 TTTGTTGTCAGATTGC-1
colData names(2): Sample Barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
In addition to filtering out empty droplets, we also will want to filter out cells that may have been damaged during library preparation. These will often be characterized by a high proportion of mitochondrial transcripts and a smaller overall number of unique transcripts. When a cell ruptures, cytoplasmic transcripts will leak out, but mitochondrial transcripts, still protected by the mitochondrial membrane, may remain. As a consequence, there will be an over-abundance of mitochondrial reads, and fewer unique transcripts expressed.
Our first step then, is create a vector of the mitochondrial genes
that are present in our dataset. The mitochondrial file we defined
during setup (mito_file
) is a TSV file containing all of
the human mitochondrial genes with additional annotation information for
each gene, such as the gene location and alternative names. (For more
detail on the steps we took to create this file, you can look at one
of our setup notebooks)
All we need now is the gene_id
, and only for the genes
that are present in our SCE, so we will do some filtering with
dplyr
to pull out a vector with just those ids.
# read in a table of mitochondrial genes and extract ids
mito_genes <- readr::read_tsv(mito_file) |>
# filter to only the genes that are found in our dataset
dplyr::filter(gene_id %in% rownames(filtered_sce)) |>
# create a vector from the gene_id column
dplyr::pull(gene_id)
Rows: 37 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (9): gene_id, gene_name, seqnames, strand, gene_biotype, seq_coord_syste...
dbl (4): start, end, width, entrezid
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can now use the scuttle
function
addPerCellQC()
to calculate some statistics based on the
counts matrix, which will be added to the colData
table.
In addition to calculating statistics like the total read count for
each cell and the number of transcripts that are detected, we can also
calculate those statistics for defined subsets of genes. In this case,
we will use our mito_genes
vector to define a subset called
mito
. The mito
name is important in that it is
the name that will be expected by a later function. (We could define
more subsets, but for now this one will do.)
filtered_sce <- scuttle::addPerCellQC(filtered_sce,
subsets = list(mito = mito_genes))
Now we can look at the colData to see what was added:
head(colData(filtered_sce))
DataFrame with 6 rows and 8 columns
Sample Barcode sum
<character> <character> <numeric>
AAACCCACATGTGTCA-1 data/glioblastoma-10.. AAACCCACATGTGTCA-1 15622
AAACGAAAGAGAACCC-1 data/glioblastoma-10.. AAACGAAAGAGAACCC-1 19899
AAACGCTAGATTAGAC-1 data/glioblastoma-10.. AAACGCTAGATTAGAC-1 2858
AAACGCTGTACGCTAT-1 data/glioblastoma-10.. AAACGCTGTACGCTAT-1 31363
AAAGAACAGTACAGCG-1 data/glioblastoma-10.. AAAGAACAGTACAGCG-1 12636
AAAGGATAGATTGTGA-1 data/glioblastoma-10.. AAAGGATAGATTGTGA-1 13489
detected subsets_mito_sum subsets_mito_detected
<integer> <numeric> <integer>
AAACCCACATGTGTCA-1 2822 1699 13
AAACGAAAGAGAACCC-1 3503 1141 11
AAACGCTAGATTAGAC-1 1367 108 10
AAACGCTGTACGCTAT-1 4354 1887 13
AAAGAACAGTACAGCG-1 2780 1008 11
AAAGGATAGATTGTGA-1 2533 1145 13
subsets_mito_percent total
<numeric> <numeric>
AAACCCACATGTGTCA-1 10.87569 15622
AAACGAAAGAGAACCC-1 5.73396 19899
AAACGCTAGATTAGAC-1 3.77887 2858
AAACGCTGTACGCTAT-1 6.01664 31363
AAAGAACAGTACAGCG-1 7.97721 12636
AAAGGATAGATTGTGA-1 8.48840 13489
We can also plot some of these statistics, here using the
plotMetrics()
function from the miQC
package
to plot the percent of reads that are mitochondrial (the
subsets_mito_percent
column) against the number of unique
genes detected (the detected
column) for each cell.
# use miQC::plotMetrics()
miQC::plotMetrics(filtered_sce) + theme_bw()
We can see that there is a range of mitochondrial percentages, and it does also seem that cells with high percentages of mitochondrial genes don’t seem to contain very many unique genes.
How do we filter with this information? One option is to define a cutoff for the mitochondrial percentage above which we call a cell compromised and exclude it from further analysis. However, choosing that cutoff can be a bit fraught, as the expected percentage of mitochondrial reads can vary depending on the cell type and library preparation methods. So it might be nice to have a method to determine that cutoff from the data itself.
Determining mitochondrial cutoffs is exactly what the
miQC
package does (Hippen et
al. 2021)! In truth, it does something possibly even a bit
better: it fits a mixture model to the data that consists of
distributions of healthy cells and compromised cells. Then we can
calculate whether each cell is more likely to belong to the healthy or
compromised distribution. We can then exclude the cells that are more
likely to be compromised.
To use miQC
, we first fit a model to the data in our SCE
object:
# fit the miQC model
miqc_model <- miQC::mixtureModel(filtered_sce)
Now we can plot the model results using the plotModel()
function to see how it corresponds to our data. We should expect to see
two fit lines:
This plot will also show, for each cell, the posterior probability that the cell is derived from the compromised distribution; a higher score indicates that a cell is more likely to be compromised.
It is also critical to note that this model can and does fail at times. Plotting the results as we have done here is not a step to skip. Always look at your data!
# plot the miQC model
miQC::plotModel(filtered_sce, miqc_model) +
theme_bw()
We can now filter our data based on the probability compromised as
calculated from the model. But before we do that, we might want to
quickly plot to see what would be filtered out with a given cutoff,
using the plotFiltering()
function. The default is to
exclude cells that have a posterior probability of 0.75 or greater of
being compromised. We stick with that default, but for clarity, we will
also include it in our code!
# look at miQC filtering
miQC::plotFiltering(filtered_sce, miqc_model,
posterior_cutoff = 0.75) +
theme_bw()
In this case, the line between the cells to be kept and those that will be removed seems to correspond to a mitochondrial percentage of about 12.5%, but note that this will not always be constant. The cutoff point can vary for different numbers of unique genes within a sample, and it will certainly vary among samples!
At this point, we can perform the actual filtering using the
filterCells()
function, giving us a further filtered SCE
object.
# perform miQC filtering
qcfiltered_sce <- miQC::filterCells(filtered_sce,
model = miqc_model)
Removing 387 out of 1626 cells.
While the miQC filtering is pretty good, you may have noticed that it still left some cells that had very low numbers of unique genes. While these cells may not be compromised, the information from them is also not likely to be useful, so we will filter those as well. We will only keep cells that have at least 200 unique genes.
# filter cells by unique gene count (`detected`)
qcfiltered_sce <- qcfiltered_sce[, which(qcfiltered_sce$detected >= 200)]
qcfiltered_sce
class: SingleCellExperiment
dim: 36601 1233
metadata(1): Samples
assays(1): counts
rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
ENSG00000277196
rowData names(3): ID Symbol Type
colnames(1233): AAACCCACATGTGTCA-1 AAACGAAAGAGAACCC-1 ...
TTTGGTTTCATCTACT-1 TTTGTTGTCAGATTGC-1
colData names(9): Sample Barcode ... total prob_compromised
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Now that we have done our filtering, we can start analyzing the expression counts for the remaining cells.
The next step at this point is to convert the raw counts into a measure that accounts for differences in sequencing depth between cells, and to convert the distribution of expression values from the skewed distribution we expect to see in raw counts to one that is more normally distributed.
We will do this using functions from the scran
and
scuttle
packages. The procedure we will use here is derived
from the OSCA
chapter on normalization. The idea is that the varying expression
patterns that different cell types exhibit will affect the scaling
factors that we would apply. To account for that variation, we first do
a rough clustering of cells by their expression with
scran::quickCluster()
, then use that clustering to
calculate the scaling factor for each cell within the clusters using
scran::computeSumFactors()
. Finally, we apply the scaling
factor to the expression values for each cell and calculate the
log-scaled expression values using the
scuttle::logNormCounts()
function.
# Perform rough clustering
qclust <- scran::quickCluster(qcfiltered_sce)
# use clusters to compute scaling factors and add to SCE object
qcfiltered_sce <- scran::computeSumFactors(qcfiltered_sce,
clusters = qclust)
# perform normalization using scaling factors
# and save as a new SCE object
normalized_sce <- scuttle::logNormCounts(qcfiltered_sce)
This creates a new “assay” in the normalized_sce
object,
logcounts
, which contains the normalized count values for
each cell and gene. (The data here are not actually the log of
the counts, since we also applied the scaling factors, but that name is
used for historical reasons.)
Let’s take a look:
normalized_sce
class: SingleCellExperiment
dim: 36601 1233
metadata(1): Samples
assays(2): counts logcounts
rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
ENSG00000277196
rowData names(3): ID Symbol Type
colnames(1233): AAACCCACATGTGTCA-1 AAACGAAAGAGAACCC-1 ...
TTTGGTTTCATCTACT-1 TTTGTTGTCAGATTGC-1
colData names(10): Sample Barcode ... prob_compromised sizeFactor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
Now that we have normalized expression values, we would like to produce some reduced-dimension representations of the data. These will allow us to perform some downstream calculations more quickly, reduce some of the noise in the data, and allow us to visualize overall relationships among cells more easily (though with many caveats!).
While we could calculate the reduced dimensions using all of the genes that we have assayed, in practice most of the genes will have very little variation in expression, so doing so will not provide much additional signal. Reducing the number of genes we include will also speed up some of the calculations.
To identify the most variable genes, we will use functions from the
scran
package. The first function,
modelGeneVar()
, attempts to divide the variation observed
for each gene into a biological and technical component, with the
intuition that genes with lower mean expression tend to have lower
variance for purely technical reasons. We then provide the
modelGeneVar()
output to the getTopHVGs()
function to identify the genes with the highest biological
variation, which is what we are most interested in.
# identify 2000 genes
num_genes <- 2000
# model variance, partitioning into biological and technical variation
gene_variance <- scran::modelGeneVar(normalized_sce)
# get the most variable genes
hv_genes <- scran::getTopHVGs(gene_variance,
n = num_genes)
The result is a vector of gene ids (ordered from most to least variable):
head(hv_genes)
[1] "ENSG00000118785" "ENSG00000130203" "ENSG00000204287" "ENSG00000275302"
[5] "ENSG00000204389" "ENSG00000019582"
ENSG00000118785
ENSG00000130203
ENSG00000204287
ENSG00000275302
ENSG00000204389
ENSG00000019582
Now that we have selected the genes we would like to use for the
reduced-dimension representations of the expression data, we can start
to calculate them. First we will use the scater::runPCA()
function to calculate the principal components from the expression
matrix. This representation is fast and fairly robust, but the result is
still quite multidimensional. We want keep a fair number of components
(dimensions) in order to accurately represent the variation in the data,
but doing so means that plotting only a few of these dimensions (in 2D)
is not likely to provide a full view of the data.
The default number of components is 50, which we will stick with, but let’s enter it manually just for the record.
# calculate and save PCA results
normalized_sce <- scater::runPCA(
normalized_sce,
ncomponents = 50, # how many components to keep
subset_row = hv_genes # use only the variable genes we chose
)
These reduced-dimension results will be stored in a
reducedDim
slot in the SCE object. We can see the names of
the reducedDim
s that we have by looking at the object
summary:
normalized_sce
class: SingleCellExperiment
dim: 36601 1233
metadata(1): Samples
assays(2): counts logcounts
rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
ENSG00000277196
rowData names(3): ID Symbol Type
colnames(1233): AAACCCACATGTGTCA-1 AAACGAAAGAGAACCC-1 ...
TTTGGTTTCATCTACT-1 TTTGTTGTCAGATTGC-1
colData names(10): Sample Barcode ... prob_compromised sizeFactor
reducedDimNames(1): PCA
mainExpName: NULL
altExpNames(0):
If we want to extract the PCA results, we can do that with the
reducedDim()
function: Note that for these
reduced-dimensionality matrices, the rows are the cells and the columns
are the PC dimensions.
# extract the PCA matrix
pca_matrix <- reducedDim(normalized_sce, "PCA")
# look at the shape of the matrix
dim(pca_matrix)
[1] 1233 50
Finally, we will calculate a UMAP (Uniform Manifold Approximation and Projection) representation of our data. This is a machine-learning-based method that is useful for performing dimensionality reduction suitable for visualization. It’s goal is to provide a representation of the data in two dimensions (typically, more are possible) that preserves as much of the distance relationships among cells as possible. While this does make for visually appealing and useful plots, it is important not to overinterpret the results! In particular, while you will often see some apparent clustering of cells in the resulting output, those clusters may not be particularly valid, and the spacing within or between clusters may not reflect true distances.
In many ways this is analogous to the problem of projecting a map of the earth onto a flat surface; any choice will result in some distortions. However, with UMAP, we rarely know exactly what choices were made and what distortions might have resulted. The UMAP coordinates themselves should never be used for downstream analysis.
Since the UMAP procedure would be slow to calculate with the full
data, so the runUMAP()
function first calculates a PCA
matrix and then uses that to calculate the UMAP. Since we
already have a PCA matrix, we will tell the function use that instead of
recalculating it.
normalized_sce <- scater::runUMAP(normalized_sce,
dimred = "PCA")
As before, we could extract the UMAP matrix from our SCE object with
the reducedDim()
function. We can also visualize the UMAP
results using the plotReducedDim()
function.
# plot the UMAP
scater::plotReducedDim(normalized_sce,
"UMAP",
# color by the most variable gene
color_by = hv_genes[1])
As a final analysis step at this stage, we will return to the PCA results to perform unsupervised clustering. Here we will use a graph-based clustering method, which starts by identifying cells that are close together in the multidimensional space. It then identifies “communities” of highly connected cells, and breaks them apart by regions of lower connection.
There are a number of algorithms that can perform this clustering, each with parameters that can affect how many clusters are identified and which cells belong to each cluster. It is also worth noting that these clusters may or may not correspond to “cell types” by whatever definition you might prefer to use. Interpretation of these clusters, or other measures of cell type, are something that will require more careful and likely more customized analysis.
We will perform our clustering using the bluster
package, which can perform many different types of clustering. As
mentioned earlier, we are using “graph” clustering, which we define
using the NNGraphParam()
function. Within that are a number
of further options, such as the weighting used for building the network
graph and the algorithm used for dividing the graph into clusters.
Modifying these parameters can result in quite different cluster
assignments! For the clustering below we will use Jaccard weighting and
Louvain clustering, which correspond more closely to the default methods
used by Seurat
than the default parameters. It is also
worth noting that the the cluster assignments are somewhat stochastic.
In particular, the names/numbers of the clusters can be quite
inconsistent between runs!
# perform graph-based clustering
nn_clusters <- bluster::clusterRows(
pca_matrix,
bluster::NNGraphParam(
# number of neighbors to use in network graph
k = 20,
# weighting scheme for building the network graph
# default is "rank"
type = "jaccard",
# cluster detection algorithm
# default is "walktrap"
cluster.fun = "louvain"
)
)
We can save the cluster assignments back into the
colData
of the SCE object with a little shortcut: the
$
followed by the name of the new column we want to
add.
# save clusters to SCE colData
normalized_sce$nn_cluster <- nn_clusters
Now we can plot the UMAP again, this time colored by the cluster
assignments that we just created. Here rather than the general
plotReducedDim()
function, we will use
plotUMAP()
, which is exactly the same, except it always
plots from the reducedDim
slot named UMAP
, so
we can skip that argument.
# plot UMAP with assigned clusters
scater::plotUMAP(normalized_sce,
color_by = "nn_cluster")
What do you see in these results?
What would you want to do next?
We will now save our filtered and normalized object, including the
dimension reduction and clustering results to an RDS
file,
using the file path that we defined at the start of the notebook. If we
were to want to return to this data, we could load this file directly
into a new R session and not have to repeat the processing that we have
done up to this point.
The data in these objects tends to be quite large, but very
compressible. To save space on disk (at the expense of time), we will
make sure that the data is compressed internally before writing it out
to a file. Note that the file we write is still going to be an
.rds
file with no additional extension. (Further note: The
base R function saveRDS()
uses compression by default, but
the tidyverse
function readr::write_rds()
does
not.)
# write RDS file with compression
readr::write_rds(normalized_sce, file = output_sce_file, compress = "gz")
As is our habit at the Data Lab, we will save information about the
computing environment, the packages we have used in this notebook, and
their versions using the sessionInfo()
command.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[3] Biobase_2.64.0 GenomicRanges_1.56.0
[5] GenomeInfoDb_1.40.0 IRanges_2.38.0
[7] S4Vectors_0.42.0 BiocGenerics_0.50.0
[9] MatrixGenerics_1.16.0 matrixStats_1.3.0
[11] ggplot2_3.5.1 optparse_1.7.5
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.3
[3] magrittr_2.0.3 scater_1.32.0
[5] compiler_4.4.1 flexmix_2.3-19
[7] DelayedMatrixStats_1.26.0 vctrs_0.6.5
[9] stringr_1.5.1 pkgconfig_2.0.3
[11] crayon_1.5.2 fastmap_1.1.1
[13] XVector_0.44.0 scuttle_1.14.0
[15] labeling_0.4.3 utf8_1.2.4
[17] rmarkdown_2.26 tzdb_0.4.0
[19] ggbeeswarm_0.7.2 UCSC.utils_1.0.0
[21] bit_4.0.5 bluster_1.14.0
[23] xfun_0.43 modeltools_0.2-23
[25] zlibbioc_1.50.0 cachem_1.0.8
[27] beachmat_2.20.0 jsonlite_1.8.8
[29] highr_0.10 rhdf5filters_1.16.0
[31] DelayedArray_0.30.0 Rhdf5lib_1.26.0
[33] BiocParallel_1.38.0 cluster_2.1.6
[35] irlba_2.3.5.1 parallel_4.4.1
[37] R6_2.5.1 bslib_0.7.0
[39] stringi_1.8.3 limma_3.60.0
[41] jquerylib_0.1.4 Rcpp_1.0.12
[43] knitr_1.46 R.utils_2.12.3
[45] FNN_1.1.4 readr_2.1.5
[47] igraph_2.0.3 Matrix_1.7-0
[49] splines_4.4.1 nnet_7.3-19
[51] tidyselect_1.2.1 viridis_0.6.5
[53] abind_1.4-5 yaml_2.3.8
[55] codetools_0.2-20 lattice_0.22-6
[57] tibble_3.2.1 withr_3.0.0
[59] evaluate_0.23 getopt_1.20.4
[61] pillar_1.9.0 generics_0.1.3
[63] vroom_1.6.5 hms_1.1.3
[65] sparseMatrixStats_1.16.0 munsell_0.5.1
[67] scales_1.3.0 miQC_1.12.0
[69] glue_1.7.0 metapod_1.12.0
[71] tools_4.4.1 BiocNeighbors_1.22.0
[73] ScaledMatrix_1.12.0 locfit_1.5-9.9
[75] fs_1.6.4 scran_1.32.0
[77] cowplot_1.1.3 rhdf5_2.48.0
[79] grid_4.4.1 DropletUtils_1.24.0
[81] edgeR_4.2.0 colorspace_2.1-0
[83] GenomeInfoDbData_1.2.12 beeswarm_0.4.0
[85] BiocSingular_1.20.0 HDF5Array_1.32.0
[87] vipor_0.4.7 rsvd_1.0.5
[89] cli_3.6.2 fansi_1.0.6
[91] viridisLite_0.4.2 S4Arrays_1.4.0
[93] dplyr_1.1.4 uwot_0.2.2
[95] gtable_0.3.5 R.methodsS3_1.8.2
[97] sass_0.4.9 digest_0.6.35
[99] ggrepel_0.9.5 SparseArray_1.4.0
[ reached getOption("max.print") -- omitted 8 entries ]