This notebook will demonstrate how to:
In this notebook, we’ll continue with processing the same dataset that we have been working with, moving onto normalization of scRNA-seq count data that we have already done quality-control analyses of.
For this tutorial, we will be using a pair of single-cell analysis
specific R packages: scater
and scran
to work
with our data. This tutorial is in part based on the scran
tutorial.
Load the libraries we will be using, and set the random number generation seed value for reproducibility.
# Set seed for reproducibility
set.seed(1234)
# GGPlot2 for the plots
library(ggplot2)
# Packages for single cell processing
library(scater)
Loading required package: SingleCellExperiment
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
Loading required package: scuttle
library(scran)
Now let’s set up the files we will be using:
# main data directory
data_dir <- file.path("data", "tabula-muris")
# Filtered count matrix file from previous notebook
filtered_sce_file <- file.path(data_dir, "filtered", "filtered_sce.rds")
# Metadata file location
metadata_file <- file.path(data_dir, "TM_droplet_metadata.csv")
# Output directory for normalized data
norm_dir <- file.path(data_dir, "normalized")
fs::dir_create(norm_dir)
bladder_sce <- readr::read_rds(filtered_sce_file)
sc_metadata <- readr::read_csv(metadata_file)
Rows: 70118 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): cell, channel, mouse.id, tissue, subtissue, mouse.sex, cell_ontolog...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Because the Tabula Muris project is a well-studied data set, we actually have some cell type information for this data set that we can refer to.
Note that we would normally NOT have this
information until later in the analysis pipeline! Nonetheless, adding it
here will be useful for visualizing the results of our normalization
(and demonstrating how one might add metadata to the
SingleCellExperiment
object).
# get the column (cell) metadata (this includes earlier QC stats!)
# and convert to a data frame
cell_info <- data.frame(colData(bladder_sce)) |>
# convert the row names of this data frame to a separate column
tibble::rownames_to_column("barcode")
cell_metadata <- sc_metadata |>
# filter to just the sample we are working with
dplyr::filter(channel == "10X_P4_3") |>
# extract the 16 nt cell barcodes from the `cell` column
dplyr::mutate(barcode = stringr::str_sub(cell, start= -16)) |>
# choose only the columns we want to add
dplyr::select(barcode, cell_ontology_class, free_annotation)
# Join the tables together, using `left_join()` to preserve all rows in cell_info
cell_info <- cell_info |>
dplyr::left_join(cell_metadata)
Joining with `by = join_by(barcode)`
Check that the sample info accession ids are still the same as the columns of our data.
all.equal(cell_info$barcode, colnames(bladder_sce))
[1] TRUE
Now we can add that data back to the
SingleCellExperiment
object. To keep with the format of
that object, we have to convert our table to a DataFrame
object in order for this to work. Just to keep things confusing, a
DataFrame
is not the same as a data.frame
that
we have been using throughout. We also need to be sure to include the
row.names
argument to keep those properly attached.
Note that this will replace all of the previous column (cell) metadata, which is part of the reason that we pulled out all previous column data content first.
# add new metadata data back to `bladder_sce`
colData(bladder_sce) <- DataFrame(cell_info, row.names = cell_info$barcode)
In whatever data we are working with, we are always looking to maximize biological variance and minimize technical variance. A primary source of technical variation we are concerned with is the variation in library sizes among our samples. While different cells may have different total transcript counts, it seems more likely that the primary source of variation that we see is due to library construction, amplification, and sequencing.
This is where normalization methods usually come into the workflow. The distribution of the counts that we saw in the previous notebook, and in particular the fact that the count data is noisy with many zero counts, makes normalization particularly tricky. To handle this noise, we normalize cells in groups with other cells like them; a method introduced in Lun et al. (2016).
Briefly, we first cluster the cells to find groups of similar cells,
then compute normalization factors based on the sums of expression in
those groups. The group normalization is then applied back to the
individual cells within the group to create a normalized count matrix.
In this case, we will also log-transform the normalized counts to get a
less skewed distribution of expression measures. Note that because of
the zero counts, the logNormCounts()
function will add a
pseudocount of 1 to each value before computing the log.
# Step 1) Group cells with other like cells by clustering.
qclust <- scran::quickCluster(bladder_sce)
Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
collapsing to unique 'x' values
# Step 2) Compute sum factors for each cell cluster grouping.
bladder_sce <- scran::computeSumFactors(bladder_sce, clusters = qclust)
# Step 3) Normalize using these pooled sum factors and log transform.
bladder_sce <- scater::logNormCounts(bladder_sce)
One way to determine whether our normalization yields biologically relevant results is to plot it and see if similarly labeled samples and cells end up together. Because plotting expression for thousands genes together isn’t practical, we will reduce the dimensions of our data using Principal Components Analysis (PCA).
We will also make the same plot with our unnormalized data, to visualize the effect of normalization on our sample. We’ll do this comparison twice:
Before plotting the unnormalized data, we will log transform the raw
counts to make their scaling more comparable to the normalized data. To
do this we will use the log1p()
function, which is
specifically designed for the case where we want to add 1 to all of our
values before taking the log, as we do here. (We could do something like
log(counts + 1)
, but this is both more efficient and more
accurate.)
# Use PCA for dimension reduction of cells' scran normalized data
norm_pca <- scater::calculatePCA(bladder_sce)
# PCA on the raw counts, log transformed
log_pca <- counts(bladder_sce) |> # get the raw counts
log1p() |> # log transform to make these more comparable to the normalized values
scater::calculatePCA() # calculate PCA scores
Note that we are using scater::calculatePCA()
two
different ways here: once on the full bladder_sce
object,
and once on just the counts
matrix. When we use
calculatePCA()
on the object, it automatically uses the log
normalized matrix from inside the object.
Next we will arrange the PCA scores for plotting, adding a column for each of the total UMI counts and the cell type labels so we can color each point of the plot.
# Set up the PCA scores for plotting
norm_pca_scores <- data.frame(norm_pca,
geo_accession = rownames(norm_pca),
total_umi = bladder_sce$sum,
cell_type = bladder_sce$cell_ontology_class)
log_pca_scores <- data.frame(log_pca,
geo_accession = rownames(log_pca),
total_umi = bladder_sce$sum,
cell_type = bladder_sce$cell_ontology_class)
First, we will plot the unnormalized PCA scores with their total UMI counts:
# Now plot counts pca
ggplot(log_pca_scores, aes(x = PC1, y = PC2, color = total_umi)) +
geom_point() +
labs(title = "Log counts (unnormalized) PCA scores",
color = "Total UMI count") +
scale_color_viridis_c() +
theme_bw()
We’ve plotted the unnormalized data for you. Knowing that we want the same graph, but different data, use the above template to plot the normalized data. Feel free to customize the plot with a different theme or color scheme!
Let’s plot the norm_pca_scores
data:
ggplot(norm_pca_scores, aes(x = PC1, y = PC2, color = total_umi)) +
geom_point() +
labs(title = "Normalized log counts PCA scores",
color = "Total UMI count") +
scale_color_viridis_c() +
theme_bw()
Do you see an effect from the normalization when comparing these plots?
Now, let’s plot these two sets of PCA scores again, but colored by cell type. Do you see an effect from the normalization when comparing these plots?
# First, plot the normalized pca
ggplot(norm_pca_scores, aes(x = PC1, y = PC2, color = cell_type)) +
geom_point() +
labs(title = "Normalized log counts PCA scores",
color = "Cell Type") +
scale_color_brewer(palette = "Dark2", na.value = "grey70") + # add a visually distinct color palette
theme_bw()
# Next, plot log count pca
ggplot(log_pca_scores, aes(x = PC1, y = PC2, color = cell_type)) +
geom_point() +
labs(title = "Log counts (unnormalized) PCA scores",
color = "Cell Type") +
scale_color_brewer(palette = "Dark2", na.value = "grey70") + # add a visually distinct color palette
theme_bw()
In case we wanted to return to this data later, let’s save the
normalized data to a tsv file. In order to do this we need to extract
our normalized counts from bladder_sce
. Refer back to the
SingleCellExperiment
figure above to determine why we are
using this logcounts()
function.
Recall that readr::write_tsv
requires a data frame so we
need to convert the logcounts
matrix to a data frame. We
will actually have to do this in two steps: first by making the sparse
matrix to a standard R matrix, then converting that to a data frame.
# Save this gene matrix to a tsv file
logcounts(bladder_sce) |>
as.matrix() |>
as.data.frame() |>
readr::write_tsv(file.path(norm_dir, "scran_norm_gene_matrix.tsv"))
We may want to return to our normalized bladder_sce
object in the future, so we will also save our data in an RDS file so
that we can re-load it into our R environment as a
SingleCellExperiment
object.
# Save the data as an RDS
readr::write_rds(bladder_sce, file.path(norm_dir, "normalized_bladder_sce.rds"))
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scran_1.32.0 scater_1.32.0
[3] scuttle_1.14.0 SingleCellExperiment_1.26.0
[5] SummarizedExperiment_1.34.0 Biobase_2.64.0
[7] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
[9] IRanges_2.38.0 S4Vectors_0.42.0
[11] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
[13] matrixStats_1.3.0 ggplot2_3.5.1
[15] optparse_1.7.5
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.3
[3] magrittr_2.0.3 compiler_4.4.1
[5] DelayedMatrixStats_1.26.0 vctrs_0.6.5
[7] stringr_1.5.1 pkgconfig_2.0.3
[9] crayon_1.5.2 fastmap_1.1.1
[11] XVector_0.44.0 labeling_0.4.3
[13] utf8_1.2.4 rmarkdown_2.26
[15] tzdb_0.4.0 UCSC.utils_1.0.0
[17] ggbeeswarm_0.7.2 bit_4.0.5
[19] xfun_0.43 bluster_1.14.0
[21] zlibbioc_1.50.0 cachem_1.0.8
[23] beachmat_2.20.0 jsonlite_1.8.8
[25] highr_0.10 DelayedArray_0.30.0
[27] BiocParallel_1.38.0 irlba_2.3.5.1
[29] parallel_4.4.1 cluster_2.1.6
[31] R6_2.5.1 RColorBrewer_1.1-3
[33] bslib_0.7.0 stringi_1.8.3
[35] limma_3.60.0 jquerylib_0.1.4
[37] Rcpp_1.0.12 knitr_1.46
[39] readr_2.1.5 Matrix_1.7-0
[41] igraph_2.0.3 tidyselect_1.2.1
[43] abind_1.4-5 yaml_2.3.8
[45] viridis_0.6.5 codetools_0.2-20
[47] lattice_0.22-6 tibble_3.2.1
[49] withr_3.0.0 evaluate_0.23
[51] getopt_1.20.4 pillar_1.9.0
[53] generics_0.1.3 vroom_1.6.5
[55] hms_1.1.3 sparseMatrixStats_1.16.0
[57] munsell_0.5.1 scales_1.3.0
[59] glue_1.7.0 metapod_1.12.0
[61] tools_4.4.1 BiocNeighbors_1.22.0
[63] ScaledMatrix_1.12.0 locfit_1.5-9.9
[65] fs_1.6.4 grid_4.4.1
[67] edgeR_4.2.0 colorspace_2.1-0
[69] GenomeInfoDbData_1.2.12 beeswarm_0.4.0
[71] BiocSingular_1.20.0 vipor_0.4.7
[73] cli_3.6.2 rsvd_1.0.5
[75] fansi_1.0.6 S4Arrays_1.4.0
[77] viridisLite_0.4.2 dplyr_1.1.4
[79] gtable_0.3.5 sass_0.4.9
[81] digest_0.6.35 SparseArray_1.4.0
[83] ggrepel_0.9.5 dqrng_0.3.2
[85] farver_2.1.1 htmltools_0.5.8.1
[87] lifecycle_1.0.4 httr_1.4.7
[89] statmod_1.5.0 bit64_4.0.5