This notebook will demonstrate how to:
fastMNN
and
harmony
purrr::map()
functions for iterating over
listsIn this notebook, we’ll perform integration on scRNA-seq datasets
from the Single-cell
Pediatric Cancer Atlas (ScPCA
), a database of
uniformly-processed pediatric scRNA-seq data built and maintained by the
Data Lab. The ScPCA
database currently hosts single-cell
pediatric cancer transcriptomic data generated by ALSF-funded labs, with
the goal of making this data easily accessible to investigators (like
you!). The expression data in ScPCA
were mapping and
quantified with alevin-fry
,
followed by processing with Bioconductor tools using the same general
procedures that we have covered in this workshop. The processing
pipeline used emptyDropsCellRanger()
and miQC
to filter the raw counts matrix, scuttle
to log-normalize
the counts, and scater
for dimension reduction. The
processed data are stored as .rds
files containing
SingleCellExperiment
objects. You can read more about how
data in the ScPCA
is processed in the associated
documentation.
To learn about integration, we’ll have a look at four samples from
the SCPCP000005
project (Patel et
al. 2022), an investigation of pediatric solid tumors led by
the Dyer and
Chen
labs at St. Jude Children’s Research Hospital. The particular libraries
we’ll integrate come from two rhabdomyosarcoma (RMS) patients, with two
samples from each of two patients, all sequenced with 10x Chromium v3
technology. Each library is from a separate biological sample.
We’ll be integrating these samples with two different tools, fastMNN
(Haghverdi et al.
2018) and harmony
(Korsunsky et
al. 2019). Integration corrects for batch effects that arise
from different library preparations, genetic backgrounds, and other
sample-specific factors, so that datasets can be jointly analyzed at the
cell level. fastMNN
corrects for batch effects using a
faster variant of the mutual-nearest neighbors algorithm, the technical
details of which you can learn more about from this vignette
by Lun (2019). harmony
, on the other hand, corrects for
batch effects using an iterative clustering approach, and unlike
fastMNN
, it is also able to consider additional covariates
beyond just the batch groupings.
Regardless of which integration tool is used, the
SingleCellExperiment
(SCE) objects first need to be
reformatted and merged into a single (uncorrected!) SCE object that
contains all cells from all samples. This merged SCE can then be used
for integration to obtain a formally batch-corrected SCE object.
# Load libraries
library(ggplot2) # plotting tools
library(SingleCellExperiment) # work with SCE objects
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Set the seed for reproducibility
set.seed(12345)
We have already prepared count data for the four samples we’ll be
integrating (i.e., filtered cells, normalized counts, and calculated PCA
& UMAP). These SCE objects, stored as RDS files, are available in
the data/rms/processed/
directory and are named according
to their ScPCA
library ids :
SCPCL000479.rds
(Patient A)SCPCL000480.rds
(Patient A)SCPCL000481.rds
(Patient B)SCPCL000482.rds
(Patient B)To begin, let’s set up our directories and files:
# Define directory where processed SCE objects to be integrated are stored
input_dir <- file.path("data", "rms", "processed")
# Define directory to save integrated SCE object to
output_dir <- file.path("data", "rms", "integrated")
# Create output directory if it doesn't exist
fs::dir_create(output_dir)
# Define output file name for the integrated object
integrated_sce_file <- file.path(output_dir, "rms_integrated_subset.rds")
We can use the dir()
function to list all contents of a
given directory, for example to see all the files in our
input_dir
:
dir(input_dir)
[1] "SCPCL000478.rds" "SCPCL000479.rds" "SCPCL000480.rds" "SCPCL000481.rds"
[5] "SCPCL000482.rds" "SCPCL000483.rds" "SCPCL000484.rds" "SCPCL000488.rds"
[9] "SCPCL000491.rds" "SCPCL000492.rds" "SCPCL000495.rds" "SCPCL000498.rds"
[13] "SCPCL000502.rds" "SCPCL000510.rds" "SCPCL000516.rds"
SCPCL000478.rds
SCPCL000479.rds
SCPCL000480.rds
SCPCL000481.rds
SCPCL000482.rds
SCPCL000483.rds
SCPCL000484.rds
SCPCL000488.rds
SCPCL000491.rds
SCPCL000492.rds
SCPCL000495.rds
SCPCL000498.rds
SCPCL000502.rds
SCPCL000510.rds
SCPCL000516.rds
We want to read in just four of these files, as listed previously. To
read in these files, we could use the readr::read_rds()
function (or the base R readRDS()
) four times, once for
each of the files. We could also use a for
loop, which is
the approach that many programming languages would lean toward. A
different and more modular coding approach to reading in these files
(and more!) is to leverage the purrr
tidyverse
package, which provides a convenient set of
functions for operating on lists. You can read more about the
purrr
functions and their power and utility in R in the “Functionals”
chapter of the Advanced R e-book.
Of particular interest is the purrr::map()
family of functions, which can be used to run a given function on each
element of a list (or vector) in one call. The general syntax for
purrr::map()
and friends is:
# Syntax for using the map function:
purrr::map(<input list or vector>,
<function to apply to each item in the input>,
<any additional arguments to the function can go here>,
<and also here if there are even more arguments, and so on>)
The output from running purrr::map()
is always a list
(but note that there are other purrr::map()
relatives which
return other object types, as you can read about in the
purrr::map()
documentation). If this concept sounds a
little familiar to you, that’s because it probably is! Base R’s
lapply()
function can provide similar utility, and the
purrr::map()
family of functions can (in part) be thought
of as an alternative to some of the base R apply
functions,
with more consistent behavior.
Let’s see a very simple example of purrr::map()
in
action, inspired by cancer groups the Data Lab has analyzed through the
OpenPBTA
project:
# Define a list of cancer histologies
histologies <- list(
"low-grade gliomas" = c("SEGA", "PA", "GNG", "PXA"),
"high-grade gliomas" = c("DMG", "DIPG"),
"embryonal tumors" = c("MB", "ATRT", "ETMR")
)
# The overall length of the list is 3
length(histologies)
[1] 3
# How can we run `length()` on each item of the list?
# We can use our new friend purrr::map():
purrr::map(histologies, length)
$`low-grade gliomas`
[1] 4
$`high-grade gliomas`
[1] 2
$`embryonal tumors`
[1] 3
One other new coding strategy we’ll learn in this notebook is using
the glue
package
to combine strings. This package offers a convenient function
glue::glue()
that can be used instead of the base R
paste()
function.
# Define a variable for example:
org_name <- "Data Lab"
# We can use paste to combine strings and variables:
paste("Welcome to the", org_name, "workshop on Advanced scRNA-seq!")
[1] "Welcome to the Data Lab workshop on Advanced scRNA-seq!"
Welcome to the Data Lab workshop on Advanced scRNA-seq!
We can use glue::glue()
to accomplish the same goal with
some different syntax:
# glue::glue takes a single string argument (only one set of quotes!), and
# variables can easily be included inside {curly braces}
glue::glue("Welcome to the {org_name} workshop on Advanced scRNA-seq!")
Welcome to the Data Lab workshop on Advanced scRNA-seq!
Welcome to the Data Lab workshop on Advanced scRNA-seq!
(Note that even though the glue::glue()
output isn’t in
quotes, it still behaves like a string!)
Alright, time for the good stuff! Let’s use purrr::map()
to read in our SCE objects so that they are immediately stored together
in a list.
We’ll first need to define a vector of the file paths to read in. We’ll start by creating a vector of sample names themselves and then formatting them into the correct paths. This way (foreshadowing!) we also have a stand-alone vector of just sample names, which will come in handy!
# Vector of all the samples to read in:
sample_names <- c("SCPCL000479",
"SCPCL000480",
"SCPCL000481",
"SCPCL000482")
# Now, convert these to file paths: <input_dir>/<sample_name>.rds
sce_paths <- file.path(input_dir,
glue::glue("{sample_names}.rds")
)
# Print the sce_paths vector
sce_paths
[1] "data/rms/processed/SCPCL000479.rds" "data/rms/processed/SCPCL000480.rds"
[3] "data/rms/processed/SCPCL000481.rds" "data/rms/processed/SCPCL000482.rds"
data/rms/processed/SCPCL000479.rds
data/rms/processed/SCPCL000480.rds
data/rms/processed/SCPCL000481.rds
data/rms/processed/SCPCL000482.rds
We can now read these files in and create a list of four SCE objects.
Since readr::read_rds()
can only operate on one input at a
time, we’ll need to use purrr::map()
to run it on all input
file paths in one command. Although sce_paths
is a vector
(not a list), it will still work as input to purrr:map()
.
The output from this code will still be a list, since that’s what
purrr::map()
always returns.
# Use purrr::map() to read all files into a list at once
sce_list <- purrr::map(
sce_paths,
readr::read_rds
)
Let’s have a look at our list of SCE objects:
# Print sce_list
sce_list
[[1]]
class: SingleCellExperiment
dim: 60319 1918
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(1918): GGGACCTCAAGCGGAT CACAGATAGTGAGTGC ... GTTGTCCCACGTACAT
TCCGATCGTCGTGCCA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
[[2]]
class: SingleCellExperiment
dim: 60319 4428
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(4428): TAGGGTTAGCAAGCCA AACTTCTTCCCTCAAC ... AGGGAGTAGCCTCATA
TCGGATACATTGCAAC
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
[[3]]
class: SingleCellExperiment
dim: 60319 5236
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(5236): TCATGAGAGGCCCACT GGGTATTTCGTTGTGA ... AAAGAACCACTTCAAG
CAGCAGCTCGTGCATA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
[[4]]
class: SingleCellExperiment
dim: 60319 4372
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(4372): GAGAAATCAGATTAAG CAACCTCTCCGATCGG ... TGATTTCCACAAGTTC
ACCAAACGTTCTCAGA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
We now have a list of length four, where each item is a processed SCE object! However, we’ll need to keep track of which sample each item is, so it’s helpful to add names to this list representing the relevant sample names.
# Assign the sample names as the names for sce_list
names(sce_list) <- sample_names
# Print the list to see it with names
sce_list
$SCPCL000479
class: SingleCellExperiment
dim: 60319 1918
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(1918): GGGACCTCAAGCGGAT CACAGATAGTGAGTGC ... GTTGTCCCACGTACAT
TCCGATCGTCGTGCCA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000480
class: SingleCellExperiment
dim: 60319 4428
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(4428): TAGGGTTAGCAAGCCA AACTTCTTCCCTCAAC ... AGGGAGTAGCCTCATA
TCGGATACATTGCAAC
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000481
class: SingleCellExperiment
dim: 60319 5236
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(5236): TCATGAGAGGCCCACT GGGTATTTCGTTGTGA ... AAAGAACCACTTCAAG
CAGCAGCTCGTGCATA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000482
class: SingleCellExperiment
dim: 60319 4372
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(4372): GAGAAATCAGATTAAG CAACCTCTCCGATCGG ... TGATTTCCACAAGTTC
ACCAAACGTTCTCAGA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
If you look closely at the printed SCE objects, you may notice that
they all contain colData
table columns
celltype_fine
and celltype_broad
. These
columns (which we added to SCE objects during pre-processing)
contain putative cell type annotations as assigned in Patel et
al. (2022). We will end up leveraging these cell type
annotations to explore how successful our integration is; after
integration, we expect cell types from different samples to group
together, rather than being separated by batches.
That said, the integration methods we will be applying do not actually use these cell type annotations. If we have annotations, they are a helpful “bonus” for assessing the integration’s success, but they are not part of the integration itself.
Now that we have a list of processed SCE objects, we need to merge the objects into one overall SCE object for input to integration. A word of caution before we begin: This merged SCE object is NOT an integrated SCE! Merging SCEs does not perform any batch correction, but just reorganizes the data to allow us to proceed to integration next.
To merge SCE objects, we do need to do some wrangling and bookkeeping to ensure compatibility and that we don’t lose important information. Overall we’ll want to take care of these items:
rowData
. Which sample is a given
feature statistic from?colData
for
each SCE object should have the same column names.We’ll begin by taking some time to thoroughly explore our SCE objects and figure out what wrangling steps we need to take for these specific data. Don’t skip this exploration! Bear in mind that the exact wrangling shown here will not be the same for other SCE objects you work with, but the same general principles apply.
How will we be able to tell which sample a given cell came from?
The best way to do this is simply to add a colData
column with the sample information, so that we can know which sample
each row came from.
In addition, we want to pay some attention to the SCE object’s column names (the cell ids), which must remain unique after merging since duplicate ids will cause an R error. In this case, the SCE column names are barcodes (which is usually but not always the case in SCE objects), which are only guaranteed to be unique within a sample but may be repeated across samples. So, after merging, it’s technically possible that multiple cells will have the same barcode. This would be a problem for two reasons: First, the cell id would not be able to point us back to cell’s originating sample. Second, it would literally cause an error in R, which does not allow duplicate column names.
One way to ensure that cell ids remain unique even after merging is
to actually modify them by prepending the relevant sample name.
For example, consider these barcodes for the SCPCL000479
sample:
# Look at the column names for the `SCPCL000479` sample, for example
colnames(sce_list$SCPCL000479) |>
# Only print out the first 6 for convenience
head()
[1] "GGGACCTCAAGCGGAT" "CACAGATAGTGAGTGC" "TGTGGCGGTGAATTGA" "GCCGATGGTACATACC"
[5] "ATTATCCCAGTTGGTT" "TCCGAAATCACACCGG"
GGGACCTCAAGCGGAT
CACAGATAGTGAGTGC
TGTGGCGGTGAATTGA
GCCGATGGTACATACC
ATTATCCCAGTTGGTT
TCCGAAATCACACCGG
These ids will be updated to
SCPCL000479-GGGACCTCAAGCGGAT
,
SCPCL000479-CACAGATAGTGAGTGC
, and so on, thereby ensuring
fully unique ids for all cells across all samples.
The rowData
table in SCE objects will often contain both
“general” and “library-specific” information, for example:
rowData(sce_list$SCPCL000479) |>
head()
DataFrame with 6 rows and 3 columns
gene_symbol mean detected
<character> <numeric> <numeric>
ENSG00000000003 TSPAN6 0.01772018 1.639778
ENSG00000000005 TNMD 0.00264480 0.158688
ENSG00000000419 DPM1 0.07299656 6.109495
ENSG00000000457 SCYL3 0.02300979 1.983602
ENSG00000000460 C1orf112 0.08119545 6.426871
ENSG00000000938 FGR 0.00317376 0.238032
Here, the rownames are Ensembl gene ids, and columns are
gene_symbol
, mean
, and detected
.
The gene_symbol
column is general information about all
genes, not specific to any library or experiment, but mean
and detected
are library-specific gene statistics. So,
gene_symbol
does not need to be traced back to its
originating sample, but mean
and detected
do.
To this end, we can take a similar approach to what we’ll do for cell
ids: We can change the sample-specific rowData
column names
by prepending the sample name. For example, rather than being called
mean
, this column will be named
SCPCL000479-mean
for the SCPCL000479
sample.
All our SCE objects have the same rowData
columns (as we
can see in the next chunk), so we’ll perform this renaming across all
SCEs.
# Use `purrr::map()` to quickly extract rowData column names for all SCEs
purrr::map(sce_list,
\(x) colnames(rowData(x)))
$SCPCL000479
[1] "gene_symbol" "mean" "detected"
$SCPCL000480
[1] "gene_symbol" "mean" "detected"
$SCPCL000481
[1] "gene_symbol" "mean" "detected"
$SCPCL000482
[1] "gene_symbol" "mean" "detected"
gene_symbol
mean
detected
gene_symbol
mean
detected
gene_symbol
mean
detected
gene_symbol
mean
detected
colData
Finally, we’ll need to have the same column names across all SCE
colData
tables, so let’s look at all those column names. We
can use similar syntax here to what we used to look at all the
rowData
column names.
purrr::map(sce_list,
\(x) colnames(colData(x)) )
$SCPCL000479
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
$SCPCL000480
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
$SCPCL000481
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
$SCPCL000482
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
It looks like the column names are all already matching among SCEs, so there’s no specific preparation we’ll need to do there.
As you can see, there’s a lot of moving parts to consider! Again, these moving parts may (will!) differ for SCEs that you are working with, so you have to explore your own SCEs in depth to prepare for merging.
Based on our exploration, here is a schematic of how one of the SCE objects will ultimately be modified into the final merged SCE:
We’ll write a custom function (seen in the chunk below)
tailored to our wrangling steps that prepares a single SCE object for
merging. We’ll then use our new purrr::map()
programming
skills to run this function over the sce_list
. This will
give us a new list of formatted SCEs that we can proceed to merge. It’s
important to remember that the format_sce()
function
written below is not a function for general use – it’s been precisely
written to match the processing we need to do for these SCEs,
and different SCEs you work with will require different types of
processing.
format_sce <- function(sce, sample_name) {
# Input arguments:
## sce: An SCE object to format
## sample_name: The SCE object's name
# This function returns a formatted SCE object.
###### Ensure that we can identify the originating sample information ######
# Add a column called `sample` that stores this information
# This will be stored in `colData`
sce$sample <- sample_name
###### Ensure cell ids will be unique ######
# Update the SCE object column names (cell ids) by prepending `sample_name`
colnames(sce) <- glue::glue("{sample_name}-{colnames(sce)}")
###### Ensure gene-level statistics can be identified in `rowData` ######
# We want to rename the columns `mean` and `detected` to contain the `sample_name`
# Recall the names are: "gene_symbol", "mean", "detected"
colnames(rowData(sce)) <- c("gene_symbol",
glue::glue("{sample_name}-mean"),
glue::glue("{sample_name}-detected"))
# Return the formatted SCE object
return(sce)
}
To run this function, we’ll use the purrr::map2()
function, a relative of purrr::map()
that allows you to
loop over two input lists/vectors. In our case, we want to run
format_sce()
over paired sce_list
items and
sce_list
names.
# We can use `purrr::map2()` to loop over two list/vector arguments
sce_list_formatted <- purrr::map2(
# Each "iteration" will march down the first two
# arguments `sce_list` and `names(sce_list)` in order
sce_list,
names(sce_list),
# Name of the function to run
format_sce
)
# Print resulting list
sce_list_formatted
$SCPCL000479
class: SingleCellExperiment
dim: 60319 1918
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000479-mean SCPCL000479-detected
colnames(1918): SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC ... SCPCL000479-GTTGTCCCACGTACAT
SCPCL000479-TCCGATCGTCGTGCCA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000480
class: SingleCellExperiment
dim: 60319 4428
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000480-mean SCPCL000480-detected
colnames(4428): SCPCL000480-TAGGGTTAGCAAGCCA
SCPCL000480-AACTTCTTCCCTCAAC ... SCPCL000480-AGGGAGTAGCCTCATA
SCPCL000480-TCGGATACATTGCAAC
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000481
class: SingleCellExperiment
dim: 60319 5236
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000481-mean SCPCL000481-detected
colnames(5236): SCPCL000481-TCATGAGAGGCCCACT
SCPCL000481-GGGTATTTCGTTGTGA ... SCPCL000481-AAAGAACCACTTCAAG
SCPCL000481-CAGCAGCTCGTGCATA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000482
class: SingleCellExperiment
dim: 60319 4372
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000482-mean SCPCL000482-detected
colnames(4372): SCPCL000482-GAGAAATCAGATTAAG
SCPCL000482-CAACCTCTCCGATCGG ... SCPCL000482-TGATTTCCACAAGTTC
SCPCL000482-ACCAAACGTTCTCAGA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
(Psst, like purrr
and want to dive deeper? Check out the
purrr::imap()
function!)
At long last, we are ready to merge the SCEs, which we’ll do using
the R function cbind()
. The cbind()
function
is often used to combine data frames or matrices by column, i.e. “stack”
them next to each other. The same principle applies here, but when run
on SCE objects, cbind()
will create a new SCE object by
combining counts
and logcounts
matrices by
column. Following that structure, other SCE slots (colData
,
rowData
, reduced dimensions, and other metadata) are
combined appropriately.
Since we need to apply cbind()
to a list of
objects, we need to use some slightly-gnarly syntax: We’ll use the
function do.call()
, which allows the cbind()
input to be a list of objects to combine.
# Merge SCE objects
merged_sce <- do.call(cbind, sce_list_formatted)
# Print the merged_sce object
merged_sce
class: SingleCellExperiment
dim: 60319 15954
metadata(56): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(9): gene_symbol SCPCL000479-mean ... SCPCL000482-mean
SCPCL000482-detected
colnames(15954): SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC ... SCPCL000482-TGATTTCCACAAGTTC
SCPCL000482-ACCAAACGTTCTCAGA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
We now have a single SCE object that contains all cells from all samples we’d like to integrate.
Let’s take a peek at some of the innards of this new SCE object:
# What are the unique values in the `sample` column?
unique( colData(merged_sce)$sample )
[1] "SCPCL000479" "SCPCL000480" "SCPCL000481" "SCPCL000482"
SCPCL000479
SCPCL000480
SCPCL000481
SCPCL000482
# What are the new cell ids (column names)?
head( colnames(merged_sce) )
[1] "SCPCL000479-GGGACCTCAAGCGGAT" "SCPCL000479-CACAGATAGTGAGTGC"
[3] "SCPCL000479-TGTGGCGGTGAATTGA" "SCPCL000479-GCCGATGGTACATACC"
[5] "SCPCL000479-ATTATCCCAGTTGGTT" "SCPCL000479-TCCGAAATCACACCGG"
SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC
SCPCL000479-TGTGGCGGTGAATTGA
SCPCL000479-GCCGATGGTACATACC
SCPCL000479-ATTATCCCAGTTGGTT
SCPCL000479-TCCGAAATCACACCGG
# What does rowData look like?
head( rowData(merged_sce) )
DataFrame with 6 rows and 9 columns
gene_symbol SCPCL000479-mean SCPCL000479-detected
<character> <numeric> <numeric>
ENSG00000000003 TSPAN6 0.01772018 1.639778
ENSG00000000005 TNMD 0.00264480 0.158688
ENSG00000000419 DPM1 0.07299656 6.109495
ENSG00000000457 SCYL3 0.02300979 1.983602
ENSG00000000460 C1orf112 0.08119545 6.426871
ENSG00000000938 FGR 0.00317376 0.238032
SCPCL000480-mean SCPCL000480-detected SCPCL000481-mean
<numeric> <numeric> <numeric>
ENSG00000000003 0.05209841 4.828312 0.06379838
ENSG00000000005 0.00855151 0.828838 0.00548350
ENSG00000000419 0.08919879 8.301539 0.24011389
ENSG00000000457 0.05262465 5.065123 0.13972372
ENSG00000000460 0.10906460 9.656624 0.26911315
ENSG00000000938 0.00197342 0.197342 0.00400717
SCPCL000481-detected SCPCL000482-mean SCPCL000482-detected
<numeric> <numeric> <numeric>
ENSG00000000003 5.989666 0.0289113 2.659838
ENSG00000000005 0.495624 0.0100776 0.759954
ENSG00000000419 20.056944 0.1391046 11.217578
ENSG00000000457 12.411684 0.0827689 7.054353
ENSG00000000460 21.206369 0.3092681 19.824880
ENSG00000000938 0.358536 0.0173468 1.387742
So far, we’ve created a merged_sce
object which is
(almost!) ready for integration.
The integration methods we’ll be using here actually perform batch
correction on a reduced dimension representation of the normalized gene
expression values, which is more efficient. fastMNN
and
harmony
specifically use PCA for this, but be aware that
different integration methods may use other kinds of reduced
dimensions.
You’ll notice that the merged SCE object object already contains PCA and UMAP reduced dimensions, which were calculated during our pre-processing:
# Print the reducedDimNames of the merged_sce
reducedDimNames(merged_sce)
[1] "PCA" "UMAP"
PCA
UMAP
These represent the original dimension reductions that were performed on each individual SCE before merging, but we actually need to calculate PCA (and UMAP for visualization) from the merged object directly.
Why can’t we use the sample-specific PCA and UMAP matrices? Part of these calculations themselves involves scaling the raw data to center the mean. When samples are separately centered but plotting together, you will see samples “overlapping” in space, but this placement is actually just an artifact of the individual centering. In addition, the mathematical relationship between the original expression data and reduced dimension version of that data will differ across samples, meaning we can’t interpret them all together. To see how this looks, let’s look at the UMAP when calculated from individual samples:
# Plot UMAP calculated from individual samples with separate scaling
scater::plotReducedDim(merged_sce,
dimred = "UMAP",
color_by = "sample",
point_size = 0.5,
point_alpha = 0.2) +
scale_color_brewer(palette = "Dark2", name = "sample") + # Use a CVD-friendly color scheme and specify legend name
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) + # Modify the legend key with larger, easier to see points
ggtitle("UMAP calculated on each sample separately")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
As we see in this UMAP, all samples are centered at zero and all overlapping. This visual artifact can give the incorrect impression that data is integrated - to be clear, this data is NOT integrated!
For input to integration, we’ll want the reduced dimension
calculations to consider normalized gene expression values from all
samples simultaneously. So we’ll need to recalculate PCA (and UMAP for
visualization) on the merged object. We’ll also save these new reduced
dimensions with different names, merged_PCA
and
merged_UMAP
, to distinguish them from already-present
PCA
and UMAP
.
First, as usual, we’ll determine the high-variance genes to use for
PCA from the merged_sce
object. For this, we’ll need to
provide the argument block = merged_sce$sample
when
modeling gene variance, which tells scran::modelGeneVar()
to first model variance separately for each batch and then combine those
modeling statistics.
# Specify the number of genes to identify
num_genes <- 2000
# Calculate variation for each gene
gene_variance <- scran::modelGeneVar(merged_sce,
# specify the grouping column:
block = merged_sce$sample)
# Get the top `num_genes` high-variance genes to use for dimension reduction
hv_genes <- scran::getTopHVGs(gene_variance,
n = num_genes)
To calculate the PCA matrix itself, we’ll use an approach from the
batchelor
package, which is the R package that contains the
fastMNN
method. The batchelor::multiBatchPCA()
function calculates a batch-weighted PCA matrix. This weighting ensures
that all batches, which may have very different numbers of cells,
contribute equally to the overall scaling.
# Use batchelor to calculate PCA for merged_sce, considering only
# the high-variance genes
# We'll need to include the argument `preserve.single = TRUE` to get
# a single matrix with all samples and not separate matrices for each sample
merged_pca <- batchelor::multiBatchPCA(merged_sce,
subset.row = hv_genes,
batch = merged_sce$sample,
preserve.single = TRUE)
Let’s have a look at the output:
# This output is not very interesting!
merged_pca
List of length 1
We can use indexing [[1]]
to see the PCA matrix
calculated, looking at a small subset for convenience:
merged_pca[[1]][1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
SCPCL000479-GGGACCTCAAGCGGAT -8.553693 -7.508773 -8.286024 3.9288287 -3.353172
SCPCL000479-CACAGATAGTGAGTGC -9.741452 -7.712208 -10.310277 4.7186301 -4.271634
SCPCL000479-TGTGGCGGTGAATTGA -9.578361 -6.841931 -6.185011 2.7051343 -3.083831
SCPCL000479-GCCGATGGTACATACC -8.287781 -7.592143 -10.407848 0.2007613 -8.194913
SCPCL000479-ATTATCCCAGTTGGTT -7.035384 -4.938422 -6.042540 0.4756373 -2.014974
We can now include this PCA matrix in our merged_sce
object:
# add PCA results to merged SCE object
reducedDim(merged_sce, "merged_PCA") <- merged_pca[[1]]
Now that we have the PCA matrix, we can proceed to calculate UMAP to visualize the uncorrected merged data.
We’ll calculate UMAP as “usual”, but in this case we’ll specify two additional arguments:
dimred = "merged_PCA"
, which specifies which existing
reduced dimension should be used for the calculation. We want to use the
batch-weighted PCA, which we named above as
"merged_PCA"
.name = "merged_UMAP"
, which names the final UMAP that
this function calculates. This argument will prevent us from overwriting
the existing UMAP which is already named “UMAP” and instead create a
separate "merged_UMAP"
.# add merged_UMAP from merged_PCA
merged_sce <- scater::runUMAP(merged_sce,
dimred = "merged_PCA",
name = "merged_UMAP")
Now, let’s see how this new merged_UMAP
looks compared
to the UMAP
calculated from individual samples:
# UMAPs scaled together when calculated from the merged SCE
scater::plotReducedDim(merged_sce,
dimred = "merged_UMAP",
color_by = "sample",
# Some styling to help us see the points:
point_size = 0.5,
point_alpha = 0.2) +
scale_color_brewer(palette = "Dark2", name = "sample") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP calculated on merged_sce")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Samples are now separated, which more reasonably reflects that this data is not yet batch-corrected. We can think of this UMAP as our “before” UMAP, and we can compare this to the “after” UMAP we see post-integration.
Let’s discuss a little first: What visual differences do you think the UMAP on the integrated version of data will have? What similarities do you think the integrated UMAP will have to this plot?
fastMNN
Finally, we’re ready to integrate! To start, we’ll use the
fastMNN
approach from the Bioconductor batchelor
package.
fastMNN
takes as input the merged_sce
object to integrate, and the first step it performs is actually to run
batchelor::multiBatchPCA()
on that SCE. It then uses that
batch-weighted PCA matrix to perform the actual batch correction. The
batch
argument is used to specify the different groupings
within the merged_sce
(i.e. the original sample that each
cell belongs to), and the subset.row
argument can
optionally be used to provide a vector of high-variance genes that
should be considered for this PCA calculation. fastMNN
will
return an SCE object that contains a batch-corrected PCA. Let’s run it
and save the result to a variable called
integrated_sce
.
# integrate with fastMNN, again specifying only our high-variance genes
integrated_sce <- batchelor::fastMNN(
merged_sce,
batch = merged_sce$sample,
subset.row = hv_genes
)
Let’s have a look at the result:
# Print the integrated_sce object
integrated_sce
class: SingleCellExperiment
dim: 2000 15954
metadata(2): merge.info pca.info
assays(1): reconstructed
rownames(2000): ENSG00000278996 ENSG00000157168 ... ENSG00000227232
ENSG00000258984
rowData names(1): rotation
colnames(15954): SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC ... SCPCL000482-TGATTTCCACAAGTTC
SCPCL000482-ACCAAACGTTCTCAGA
colData names(1): batch
reducedDimNames(1): corrected
mainExpName: NULL
altExpNames(0):
There are couple pieces of information here of interest:
corrected
reduced dimension represents the
batch-corrected PCA that fastMNN
calculated.reconstructed
assay represents the batch-corrected
normalized expression values, which fastMNN
“back-calculated” from the batch-corrected PCA (corrected
).
Generally speaking, these expression values are not stand-alone values
that you should use for other applications like differential gene
expression, as described in Orchestrating
Single Cell Analyses. If the subset.row
argument
is provided (as it was here), only genes present in
subset.row
will be included in these reconstructed
expression values, but this setting can be overridden so that all genes
have reconstructed expression with the argument
correct.all = TRUE
.We’re mostly interested in the PCA that fastMNN
calculated, so let’s save that information (with an informative and
unique name) into our merged_sce
object:
# Make a new reducedDim named fastmnn_PCA from the corrected reducedDim in integrated_sce
reducedDim(merged_sce, "fastmnn_PCA") <- reducedDim(integrated_sce, "corrected")
Finally, we’ll calculate UMAP from these corrected PCA matrix for visualization.
# Calculate UMAP
merged_sce <- scater::runUMAP(
merged_sce,
dimred = "fastmnn_PCA",
name = "fastmnn_UMAP"
)
First, let’s plot the integrated UMAP highlighting the different batches. A well-integrated dataset will show batch mixing, but a poorly-integrated dataset will show more separation among batches, similar to the uncorrected UMAP. Note that this is a more qualitative way to assess the success of integration, but there are formal metrics one can use to assess batch mixing, which you can read more about in this chapter of OSCA.
scater::plotReducedDim(merged_sce,
# plot the fastMNN coordinates
dimred = "fastmnn_UMAP",
# color by sample
color_by = "sample",
# Some styling to help us see the points:
point_size = 0.5,
point_alpha = 0.2) +
scale_color_brewer(palette = "Dark2", name = "sample") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with fastMNN")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
This fastmnn_UMAP
certainly looks different from the one
we made from merged_UMAP
! What different trends do you see?
Do all samples look “equally well” integrated, from a first look?
Importantly, one reason that batches may still appear separated in the corrected UMAP is if they should be separated - for example, maybe two batches contain very different cell types, have very different diagnoses, or may be from different patients.
Recall from earlier that we conveniently have cell type annotations in our SCEs, so we can explore those here! Let’s take a quick detour to see what kinds of cell types are in this data by making a barplot of the cell types across samples:
# Cell types are in the `celltype_broad` and `celltype_fine` columns
merged_sce_df <- as.data.frame(colData(merged_sce))
# Use ggplot2 to make a barplot the cell types across samples
ggplot(merged_sce_df,
aes(x = sample,
fill = celltype_broad)) +
# Barplot of celltype proportions
geom_bar(position = "fill") +
# Use a CVD-friendly color scheme
scale_fill_brewer(palette = "Dark2", na.value = "grey80") +
# nicer theme
theme_bw()
We see that Tumor cell types are by far the most prevalent across all
samples, and normal tissue cell types are not very common. We see also
that SCPCL000481
has a larger Tumor_Myocyte
population, while all other samples have larger
Tumor_Mesoderm
populations. This difference may
explain why we observe that SCPCL000481
is somewhat more
separated from the other samples in the fastMNN
UMAP.
Let’s re-plot this UMAP to highlight cell types:
scater::plotReducedDim(merged_sce,
dimred = "fastmnn_UMAP",
# color by broad celltypes
color_by = "celltype_broad",
point_size = 0.5,
point_alpha = 0.2) +
# include argument to specify color of NA values
scale_color_brewer(palette = "Dark2", name = "Broad celltype", na.value = "grey80") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with fastMNN")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
This UMAP shows that the normal tissue cell types (mostly vascular endothelium, muscle cells, and monocytes) tend to cluster together and are generally separated from the tumor cell types, which is an encouraging pattern! Tumor cell types from different samples are all also clustering together, which is even more encouraging that we had successful integration.
However, it’s a bit challenging to see all the points given the
amount of overlap in the plot. One way we can see all the points a bit
better is to facet the plot by sample, using facet_wrap()
from the ggplot2
package (which we can do because
scater::plotReducedDim()
returns a ggplot2
object):
scater::plotReducedDim(merged_sce,
dimred = "fastmnn_UMAP",
color_by = "celltype_broad",
point_size = 0.5,
point_alpha = 0.2,
# Allow for faceting by a variable using `other_fields`:
other_fields = "sample") +
scale_color_brewer(palette = "Dark2", name = "Broad celltype", na.value = "grey80") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with fastMNN") +
# Facet by sample
facet_wrap(vars(sample)) +
# Use a theme with background grid to more easily compare panel coordinates
theme_bw()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
What trends do you observe between tumor and healthy tissues among these integrated samples?
harmony
fastMNN
is only one of many approaches to perform
integration, and different methods have different capabilities and may
give different results. For example, some methods can accommodate
additional covariates (e.g., technology, patient, diagnosis, etc.) that
can influence integration. In fact the data we are using has a known
patient covariate; SCPCL000479
and
SCPCL000480
are from the first patient, and
SCPCL000481
and SCPCL000482
are from the
second patient.
So, let’s perform integration with a method that can use this
information - harmony
!
To begin setting up for harmony
integration, we need to
add explicit patient information into our merged SCE. We’ll create a new
column patient
whose value is either “A” or “B” depending
on the given sample name, using the dplyr::case_when()
function. We provide this function with a set of logical expressions and
each assigned value is designated by ~
. The expressions are
evaluated in order, stopping at the first one that evaluates as
TRUE
and returning the associated value.
# Create patient column with values "A" or "B" for the two patients
merged_sce$patient <- dplyr::case_when(
merged_sce$sample %in% c("SCPCL000479", "SCPCL000480") ~ "A",
merged_sce$sample %in% c("SCPCL000481", "SCPCL000482") ~ "B",
)
Unlike fastMNN
, harmony
does not calculate
corrected expression values nor does it return an SCE object. Like
fastMNN
, harmony
performs integration on a
merged PCA matrix. However, unlike fastMNN
,
harmony
does not “back-calculate” corrected expression from
the corrected PCA matrix and it only returns the corrected PCA matrix
itself. For input, harmony
needs a couple pieces of
information:
harmony
takes a batch-weighted PCA matrix to
perform integration. We already calculated a batch-weighted PCA matrix
(our merged_PCA
reduced dimension), we’ll provide this as
the the input.harmony
about the covariates to
use - sample
and patient
. To do this, we
provide two arguments:
meta_data
, a data frame that contains covariates across
samples. We can simply specify the SCE colData
here since
it contains sample
and patient
columns.vars_use
, a vector of which column names in
meta_data
should actually be used as covariates. Other
columns in meta_data
which are not in vars_use
are ignored.Let’s go!
# Run harmony integration
harmony_pca <- harmony::RunHarmony(
data_mat = reducedDim(merged_sce, "merged_PCA"),
meta_data = colData(merged_sce),
vars_use = c("sample", "patient")
)
Transposing data matrix
Initializing state using k-means centroids initialization
Harmony 1/10
Harmony 2/10
Harmony converged after 2 iterations
The result is a PCA matrix. Let’s print a subset of this matrix to see it:
# Print the harmony result
harmony_pca[1:5, 1:5]
[,1] [,2] [,3] [,4] [,5]
SCPCL000479-GGGACCTCAAGCGGAT -7.032044 -7.271764 -3.713022 5.477643 0.6781986
SCPCL000479-CACAGATAGTGAGTGC -8.237485 -7.436515 -5.579578 6.101144 -0.1577978
SCPCL000479-TGTGGCGGTGAATTGA -7.893770 -6.595934 -2.042903 4.706149 0.9176450
SCPCL000479-GCCGATGGTACATACC -6.741593 -7.537079 -5.997807 2.609384 -4.1100267
SCPCL000479-ATTATCCCAGTTGGTT -5.534487 -4.887054 -1.897409 2.588823 1.5458433
As we did with fastMNN
results, let’s store this PCA
matrix directly in our merged_sce
object with an
informative name that won’t overwrite any of the existing PCA matrices.
We’ll also calculate UMAP from it.
# Store PCA as `harmony_PCA`
reducedDim(merged_sce, "harmony_PCA") <- harmony_pca
# As before, calculate UMAP on this PCA matrix with appropriate names
merged_sce <- scater::runUMAP(merged_sce,
dimred = "harmony_PCA",
name = "harmony_UMAP")
Let’s see how the harmony
UMAP, colored by sample, looks
compared to the fastMNN
UMAP:
scater::plotReducedDim(merged_sce,
dimred = "harmony_UMAP",
color_by = "sample",
point_size = 0.5,
point_alpha = 0.2) +
scale_color_brewer(palette = "Dark2", name = "sample") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with harmony")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
How do you think this harmony
UMAP compares to that from
fastMNN
integration?
Let’s see how this UMAP looks colored by cell type, and faceted for visibility:
scater::plotReducedDim(merged_sce,
dimred = "harmony_UMAP",
color_by = "celltype_broad",
point_size = 0.5,
point_alpha = 0.2,
# Specify variable for faceting
other_fields = "sample") +
scale_color_brewer(palette = "Dark2", name = "Broad celltype", na.value = "grey80") +
guides(color = guide_legend(override.aes = list(size = 3))) +
ggtitle("UMAP after integration with harmony") +
facet_wrap(vars(sample))
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
What do you now notice in this faceted view that wasn’t clear
previously? Are there other patterns you see that are similar or
different from the fastMNN
UMAP? How do you think
fastMNN
vs. harmony
performed in integrating
these samples?
Finally, we’ll export the final SCE object with both
fastMNN
and harmony
integration to a file.
Since this object is very large (over 1 GB!), we’ll export it to a file
with some compression, which, in this case, will reduce the final size
to a smaller ~360 MB. This will take a couple minutes to save while
compression is performed.
# Export to RDS file with "gz" compression
readr::write_rds(merged_sce,
integrated_sce_file,
compress = "gz")
As always, we’ll print the session info to be transparent about what packages, and which versions, were used during this R session.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[3] Biobase_2.64.0 GenomicRanges_1.56.0
[5] GenomeInfoDb_1.40.0 IRanges_2.38.0
[7] S4Vectors_0.42.0 BiocGenerics_0.50.0
[9] MatrixGenerics_1.16.0 matrixStats_1.3.0
[11] ggplot2_3.5.1 optparse_1.7.5
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.3
[3] magrittr_2.0.3 scater_1.32.0
[5] RcppAnnoy_0.0.22 compiler_4.4.1
[7] flexmix_2.3-19 DelayedMatrixStats_1.26.0
[9] vctrs_0.6.5 stringr_1.5.1
[11] pkgconfig_2.0.3 crayon_1.5.2
[13] fastmap_1.1.1 XVector_0.44.0
[15] scuttle_1.14.0 labeling_0.4.3
[17] utf8_1.2.4 rmarkdown_2.26
[19] tzdb_0.4.0 UCSC.utils_1.0.0
[21] ggbeeswarm_0.7.2 purrr_1.0.2
[23] xfun_0.43 modeltools_0.2-23
[25] bluster_1.14.0 zlibbioc_1.50.0
[27] cachem_1.0.8 beachmat_2.20.0
[29] jsonlite_1.8.8 highr_0.10
[31] DelayedArray_0.30.0 BiocParallel_1.38.0
[33] irlba_2.3.5.1 parallel_4.4.1
[35] cluster_2.1.6 R6_2.5.1
[37] bslib_0.7.0 stringi_1.8.3
[39] RColorBrewer_1.1-3 limma_3.60.0
[41] jquerylib_0.1.4 Rcpp_1.0.12
[43] knitr_1.46 readr_2.1.5
[45] batchelor_1.20.0 Matrix_1.7-0
[47] splines_4.4.1 nnet_7.3-19
[49] igraph_2.0.3 tidyselect_1.2.1
[51] abind_1.4-5 yaml_2.3.8
[53] viridis_0.6.5 codetools_0.2-20
[55] lattice_0.22-6 tibble_3.2.1
[57] withr_3.0.0 evaluate_0.23
[59] getopt_1.20.4 pillar_1.9.0
[61] generics_0.1.3 hms_1.1.3
[63] sparseMatrixStats_1.16.0 munsell_0.5.1
[65] scales_1.3.0 miQC_1.12.0
[67] RhpcBLASctl_0.23-42 glue_1.7.0
[69] metapod_1.12.0 tools_4.4.1
[71] BiocNeighbors_1.22.0 ScaledMatrix_1.12.0
[73] locfit_1.5-9.9 fs_1.6.4
[75] scran_1.32.0 cowplot_1.1.3
[77] grid_4.4.1 edgeR_4.2.0
[79] colorspace_2.1-0 GenomeInfoDbData_1.2.12
[81] beeswarm_0.4.0 BiocSingular_1.20.0
[83] vipor_0.4.7 cli_3.6.2
[85] rsvd_1.0.5 fansi_1.0.6
[87] S4Arrays_1.4.0 viridisLite_0.4.2
[89] dplyr_1.1.4 uwot_0.2.2
[91] ResidualMatrix_1.14.0 gtable_0.3.5
[93] sass_0.4.9 digest_0.6.35
[95] SparseArray_1.4.0 ggrepel_0.9.5
[97] dqrng_0.3.2 farver_2.1.1
[99] htmltools_0.5.8.1 lifecycle_1.0.4
[ reached getOption("max.print") -- omitted 3 entries ]