This notebook will demonstrate how to:
fastMNN and
harmonypurrr::map() functions for iterating over
listsIn this notebook, we’ll perform integration on scRNA-seq datasets
from the Single-cell
Pediatric Cancer Atlas (ScPCA), a database of
uniformly-processed pediatric scRNA-seq data built and maintained by the
Data Lab. The ScPCA database currently hosts single-cell
pediatric cancer transcriptomic data generated by ALSF-funded labs, with
the goal of making this data easily accessible to investigators (like
you!). The expression data in ScPCA were mapping and
quantified with alevin-fry,
followed by processing with Bioconductor tools using the same general
procedures that we have covered in this workshop. The processing
pipeline used emptyDropsCellRanger() and miQC
to filter the raw counts matrix, scuttle to log-normalize
the counts, and scater for dimension reduction. The
processed data are stored as .rds files containing
SingleCellExperiment objects. You can read more about how
data in the ScPCA is processed in the associated
documentation.
To learn about integration, we’ll have a look at four samples from
the SCPCP000005
project (Patel et
al. 2022), an investigation of pediatric solid tumors led by
the Dyer and
Chen
labs at St. Jude Children’s Research Hospital. The particular libraries
we’ll integrate come from two rhabdomyosarcoma (RMS) patients, with two
samples from each of two patients, all sequenced with 10x Chromium v3
technology. Each library is from a separate biological sample.
We’ll be integrating these samples with two different tools, fastMNN
(Haghverdi et al.
2018) and harmony
(Korsunsky et
al. 2019). Integration corrects for batch effects that arise
from different library preparations, genetic backgrounds, and other
sample-specific factors, so that datasets can be jointly analyzed at the
cell level. fastMNN corrects for batch effects using a
faster variant of the mutual-nearest neighbors algorithm, the technical
details of which you can learn more about from this vignette
by Lun (2019). harmony, on the other hand, corrects for
batch effects using an iterative clustering approach, and unlike
fastMNN, it is also able to consider additional covariates
beyond just the batch groupings.
Regardless of which integration tool is used, the
SingleCellExperiment (SCE) objects first need to be
reformatted and merged into a single (uncorrected!) SCE object that
contains all cells from all samples. This merged SCE can then be used
for integration to obtain a formally batch-corrected SCE object.
# Load libraries
library(ggplot2) # plotting tools
library(SingleCellExperiment) # work with SCE objects
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Set the seed for reproducibility
set.seed(12345)
We have already prepared count data for the four samples we’ll be
integrating (i.e., filtered cells, normalized counts, and calculated PCA
& UMAP). These SCE objects, stored as RDS files, are available in
the data/rms/processed/ directory and are named according
to their ScPCA library ids :
SCPCL000479.rds (Patient A)SCPCL000480.rds (Patient A)SCPCL000481.rds (Patient B)SCPCL000482.rds (Patient B)Both Patient A (18 year old male) and Patient B (4 year old female) had recurrent embryonal rhabdomyosarcoma when samples were taken.
To begin, let’s set up our directories and files:
# Define directory where processed SCE objects to be integrated are stored
input_dir <- file.path("data", "rms", "processed")
# Define directory to save integrated SCE object to
output_dir <- file.path("data", "rms", "integrated")
# Create output directory if it doesn't exist
fs::dir_create(output_dir)
# Define output file name for the integrated object
integrated_sce_file <- file.path(output_dir, "rms_integrated_subset.rds")
We can use the dir() function to list all contents of a
given directory, for example to see all the files in our
input_dir:
dir(input_dir)
[1] "SCPCL000478.rds" "SCPCL000479.rds" "SCPCL000480.rds" "SCPCL000481.rds"
[5] "SCPCL000482.rds" "SCPCL000483.rds" "SCPCL000484.rds" "SCPCL000488.rds"
[9] "SCPCL000491.rds" "SCPCL000492.rds" "SCPCL000495.rds" "SCPCL000498.rds"
[13] "SCPCL000502.rds" "SCPCL000510.rds" "SCPCL000516.rds"
SCPCL000478.rds
SCPCL000479.rds
SCPCL000480.rds
SCPCL000481.rds
SCPCL000482.rds
SCPCL000483.rds
SCPCL000484.rds
SCPCL000488.rds
SCPCL000491.rds
SCPCL000492.rds
SCPCL000495.rds
SCPCL000498.rds
SCPCL000502.rds
SCPCL000510.rds
SCPCL000516.rds
We want to read in just four of these files, as listed previously. To
read in these files, we could use the readr::read_rds()
function (or the base R readRDS()) four times, once for
each of the files. We could also use a for loop, which is
the approach that many programming languages would lean toward. A
different and more modular coding approach to reading in these files
(and more!) is to leverage the purrr
tidyverse package, which provides a convenient set of
functions for operating on lists. You can read more about the
purrr functions and their power and utility in R in the “Functionals”
chapter of the Advanced R e-book.
Of particular interest is the purrr::map()
family of functions, which can be used to run a given function on each
element of a list (or vector) in one call. The general syntax for
purrr::map() and friends is:
# Syntax for using the map function:
purrr::map(<input list or vector>,
<function to apply to each item in the input>,
<any additional arguments to the function can go here>,
<and also here if there are even more arguments, and so on>)
The output from running purrr::map() is always a list
(but note that there are other purrr::map() relatives which
return other object types, as you can read about in the
purrr::map() documentation). If this concept sounds a
little familiar to you, that’s because it probably is! Base R’s
lapply() function can provide similar utility, and the
purrr::map() family of functions can (in part) be thought
of as an alternative to some of the base R apply functions,
with more consistent behavior.
Let’s see a very simple example of purrr::map() in
action, inspired by cancer groups the Data Lab has analyzed through the
OpenPBTA
project:
# Define a list of cancer histologies
histologies <- list(
"low-grade gliomas" = c("SEGA", "PA", "GNG", "PXA"),
"high-grade gliomas" = c("DMG", "DIPG"),
"embryonal tumors" = c("MB", "ATRT", "ETMR")
)
# The overall length of the list is 3
length(histologies)
[1] 3
# How can we run `length()` on each item of the list?
# We can use our new friend purrr::map():
purrr::map(histologies, length)
$`low-grade gliomas`
[1] 4
$`high-grade gliomas`
[1] 2
$`embryonal tumors`
[1] 3
Let’s use purrr::map() to read in our SCE objects so
that they are immediately stored together in a list.
We’ll first need to define a vector of the file paths to read in. We’ll start by creating a vector of sample names themselves and then formatting them into the correct paths. This way (foreshadowing!) we also have a stand-alone vector of just sample names, which will come in handy!
# Vector of all the samples to read in:
sample_names <- c("SCPCL000479",
"SCPCL000480",
"SCPCL000481",
"SCPCL000482")
# Now, convert these to file paths: <input_dir>/<sample_name>.rds
sce_paths <- file.path(input_dir,
glue::glue("{sample_names}.rds")
)
# Print the sce_paths vector
sce_paths
[1] "data/rms/processed/SCPCL000479.rds" "data/rms/processed/SCPCL000480.rds"
[3] "data/rms/processed/SCPCL000481.rds" "data/rms/processed/SCPCL000482.rds"
data/rms/processed/SCPCL000479.rds
data/rms/processed/SCPCL000480.rds
data/rms/processed/SCPCL000481.rds
data/rms/processed/SCPCL000482.rds
Let’s make this a named vector using the sample names. This will help us keep track of which objects are which after we read the SCE objects in:
# Assign the sample names as the names for sce_paths
names(sce_paths) <- sample_names
We can now read these files in and create a list of four SCE objects.
Since readr::read_rds() can only operate on one input at a
time, we’ll need to use purrr::map() to run it on all input
file paths in one command. Although sce_paths is a vector
(not a list), it will still work as input to purrr:map().
The output from this code will still be a list, since that’s what
purrr::map() always returns, and it will retain the sample
names as the list names for convenient bookkeeping:
# Use purrr::map() to read all files into a list at once
sce_list <- purrr::map(
sce_paths,
readr::read_rds
)
Let’s have a look at our named list of SCE objects:
# Print sce_list
sce_list
$SCPCL000479
class: SingleCellExperiment
dim: 60319 1918
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(1918): GGGACCTCAAGCGGAT CACAGATAGTGAGTGC ... GTTGTCCCACGTACAT
TCCGATCGTCGTGCCA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000480
class: SingleCellExperiment
dim: 60319 4428
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(4428): TAGGGTTAGCAAGCCA AACTTCTTCCCTCAAC ... AGGGAGTAGCCTCATA
TCGGATACATTGCAAC
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000481
class: SingleCellExperiment
dim: 60319 5236
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(5236): TCATGAGAGGCCCACT GGGTATTTCGTTGTGA ... AAAGAACCACTTCAAG
CAGCAGCTCGTGCATA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
$SCPCL000482
class: SingleCellExperiment
dim: 60319 4372
metadata(14): salmon_version reference_index ... filtering_method
miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol mean detected
colnames(4372): GAGAAATCAGATTAAG CAACCTCTCCGATCGG ... TGATTTCCACAAGTTC
ACCAAACGTTCTCAGA
colData names(12): sum detected ... celltype_fine celltype_broad
reducedDimNames(2): PCA UMAP
mainExpName: NULL
altExpNames(0):
If you look closely at the printed SCE objects, you may notice that
they all contain colData table columns
celltype_fine and celltype_broad. These
columns (which we added to SCE objects during pre-processing)
contain putative cell type annotations as assigned by Patel et
al. (2022):
For each cell subset identified by clustering, we used a combination of
SingleRversion 1.0.1 (Aran et al., 2019) and manual inspection of differentially expressed genes to annotate whether a cluster belongs to stromal, immune or malignant subpopulations. Malignant cells were confirmed in patient tumor data by inference of copy-number variation usinginferCNVversion 1.1.3 of the TrinityCTAT Project (https://github.com/broadinstitute/infercnv).
We will end up leveraging these cell type annotations to explore the integration results; after integration, we expect cell types from different samples to group together, rather than being separated by batches.
That said, the integration methods we will be applying do not actually use any existing cell type annotations. If we have annotations, they are a helpful “bonus” for assessing the integration’s performance, but they are not part of the integration itself.
Now that we have a list of processed SCE objects, we need to merge the objects into one overall SCE object for input to integration. A word of caution before we begin: This merged SCE object is NOT an integrated SCE! Merging SCEs does not perform any batch correction, but just reorganizes the data to allow us to proceed to integration next.
To merge SCE objects, we do need to do some wrangling and bookkeeping to ensure compatibility and that we don’t lose important information. Overall, we’ll want to make sure that:
colData slot columns, in the same
orderrowData slot column names and
metadata field names (if you care to keep this information
around - today, we will!)colData
slotWe’ll approach this merge in two parts:
When merging objects on your own, don’t skip these data exploration steps! The steps we take to prepare our SCEs will probably be different from the steps you need to take with other SCEs, and only by carefully exploring the objects can you figure out what steps you’ll need to take to meet all of our conditions.
As part of the custom function we’ll write, we’ll include a step to
create unique cell identifiers by attaching sample names to the SCE
column names (cell barcodes). For example, we would update the column
name for a cell from Sample1 with the barcode
ACGT to Sample1-ACGT.
When merging, there can’t be any duplicate column names (barcodes) across all the objects or R will throw an error. While you’re guaranteed to have unique barcodes in a given SCE object, there is no guarantee that they are unique across multiple samples - it is absolutely possible to have cells from two different samples share the same barcode (and we’ve seen it happen!).
Adding the sample id to the column names (barcodes) is therefore a crucial step in our merging bookkeeping.
First, we’ll compare the object’s genes (aka, their row names). We
can use some purrr magic to help us find the set of shared
genes among all objects:
# Define vector of shared genes
shared_genes <- sce_list |>
# get rownames (genes) for each SCE in sce_list
purrr::map(rownames) |>
# reduce to the _intersection_ among lists
purrr::reduce(intersect)
# How many shared genes are there?
length(shared_genes)
[1] 60319
That’s quite a lot! In fact, because these objects were all uniformly processed by the same workflow (which did not filter out any genes!), we expect them to all have the same genes. We can map over the list to confirm that indeed, they have the same number of rows (genes):
# The number of genes in an SCE corresponds to its number of rows:
sce_list |>
purrr::map(nrow)
$SCPCL000479
[1] 60319
$SCPCL000480
[1] 60319
$SCPCL000481
[1] 60319
$SCPCL000482
[1] 60319
Even though we know the genes already match, we need to also be sure
they are in the same order among all objects. So, we’ll hold
onto that shared_genes variable we defined and use it soon
in our custom formatting function to make sure all objects fully
match.
It’s worth noting that the intersection isn’t the only option here,
though! Using the intersection means a lot of genes will get discarded
if the objects have different genes. We could instead take the
union of genes so nothing gets thrown out. In this case, you’d
need to create “dummy” assay rows for genes that a given SCE doesn’t
have and fill it with NA expression values. You’ll still
have to make sure the SCEs have the same rows in the same order before
merging, so you may need to do a decent bit of matrix wrangling.
colData column namesNext up, we’ll check the colData columns: we need these
to be the same, and in the same order. Let’s print out each object’s
colData column name to see where we stand:
sce_list |>
purrr::map(
\(sce) colnames(colData(sce))
)
$SCPCL000479
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
$SCPCL000480
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
$SCPCL000481
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
$SCPCL000482
[1] "sum" "detected" "subsets_mito_sum"
[4] "subsets_mito_detected" "subsets_mito_percent" "total"
[7] "prob_compromised" "miQC_pass" "sizeFactor"
[10] "barcode" "celltype_fine" "celltype_broad"
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
sum
detected
subsets_mito_sum
subsets_mito_detected
subsets_mito_percent
total
prob_compromised
miQC_pass
sizeFactor
barcode
celltype_fine
celltype_broad
We see the same columns all around in the same order, which is great!
But what if there were different columns across objects, or they were
differently ordered? In that case, we could find the intersection of
column names like we did above for genes, and use that to re-order and
subset all colData slots in our custom formatting
function.
Next, we’ll make sure that all objects share the same assays:
# print all the assay names
sce_list |>
purrr::map(assayNames)
$SCPCL000479
[1] "counts" "logcounts"
$SCPCL000480
[1] "counts" "logcounts"
$SCPCL000481
[1] "counts" "logcounts"
$SCPCL000482
[1] "counts" "logcounts"
counts
logcounts
counts
logcounts
counts
logcounts
counts
logcounts
Again, all objects are compatible already with both having a
counts and logcounts assay.
In your own data exploration, if you encounter SCEs to merge that
have extraneous assays that you don’t need, you can remove them by
setting them to NULL in your custom formatting function,
e.g. assay(sce, "assay_to_remove") <- NULL.
rowData contentsOne of the other items we said we’d need to think about is the
rowData, which contains gene metadata. This slot is
interesting because some of its columns are specific to the given
sample, while others are general:
sce_list |>
purrr::map(
\(sce) head(rowData(sce), 3) # only print 3 rows for space!
)
$SCPCL000479
DataFrame with 3 rows and 3 columns
gene_symbol mean detected
<character> <numeric> <numeric>
ENSG00000000003 TSPAN6 0.0177202 1.639778
ENSG00000000005 TNMD 0.0026448 0.158688
ENSG00000000419 DPM1 0.0729966 6.109495
$SCPCL000480
DataFrame with 3 rows and 3 columns
gene_symbol mean detected
<character> <numeric> <numeric>
ENSG00000000003 TSPAN6 0.05209841 4.828312
ENSG00000000005 TNMD 0.00855151 0.828838
ENSG00000000419 DPM1 0.08919879 8.301539
$SCPCL000481
DataFrame with 3 rows and 3 columns
gene_symbol mean detected
<character> <numeric> <numeric>
ENSG00000000003 TSPAN6 0.0637984 5.989666
ENSG00000000005 TNMD 0.0054835 0.495624
ENSG00000000419 DPM1 0.2401139 20.056944
$SCPCL000482
DataFrame with 3 rows and 3 columns
gene_symbol mean detected
<character> <numeric> <numeric>
ENSG00000000003 TSPAN6 0.0289113 2.659838
ENSG00000000005 TNMD 0.0100776 0.759954
ENSG00000000419 DPM1 0.1391046 11.217578
The column gene_symbol is not sample-specific - it just
provides the corresponding gene symbol to the Ensembl ids seen here as
row names. The columns mean and detected,
however, are sample-specific - they contain sample-specific statistics
about gene expression.
This means we definitely need to update the column names
mean and detected to include the sample id.
But, we don’t need a separate gene_symbol column for each
sample, so we can leave that one alone as just gene_symbol.
Once we eventually merge, only one gene_symbol column will
be left in the final object since it is the same across all the
SCEs.
We’ll show one way to do this in our custom function, but it’s worth
noting there’s nothing wrong with also adding the sample id to
the gene_symbol column; you’ll just end up with a bunch of
redundant gene symbol columns.
As you can see, there’s a lot of moving parts to consider! Again, these moving parts may (will!) differ for SCEs that you are working with, so you have to explore your own SCEs in depth to prepare for merging.
Based on our exploration, here is a schematic of how one of the SCE objects will ultimately be modified into the final merged SCE:
We’ll write a custom function (seen in the chunk below)
tailored to our wrangling steps that prepares a single SCE object for
merging. We’ll then use our new purrr::map() programming
skills to run this function over the sce_list. This will
give us a new list of formatted SCEs that we can proceed to merge. It’s
important to remember that the format_sce() function
written below is not a function for general use – it’s been precisely
written to match the processing we need to do for these SCEs,
and different SCEs you work with will require different types of
processing.
We also include roxygen-style comments for this function, which can be a helpful consistent way to document your code if you like it - we’ve even written a blog post about it :) (https://www.ccdatalab.org/blog/dont-make-me-write-tips-for-avoiding-typing-in-rstudio).
#' Custom function to format an SCE before merging
#'
#' @param sce SCE object to format
#' @param sample_name Name of the sample
#' @param shared_genes Vector of shared genes across all SCE objects
#'
#' @returns An updated SCE object ready for merging
format_sce <- function(
sce,
sample_name,
shared_genes
) {
### Remove the single-sample reduced dimensions
# We do this first since it makes the object a lot smaller for the rest of this code!
reducedDims(sce) <- NULL
### Add dedicated sample indicator column to the colData slot
# Recall, the `sce$` shortcut points to the colData
sce$sample <- sample_name
### Ensure objects have the same genes in the same order
# Use the shared_genes vector to index genes to the intersection
# Doing this both subsets to just those genes, and reorders!
sce <- sce[shared_genes, ]
### There is no additional wrangling to do for the colData column names or assays.
### But if there were, you could add your custom code to do so here.
### Your custom function may need additional arguments for this, too.
### Ensure cell ids are identifiable and fully unique
# Update the SCE object column names (cell ids) by prepending the `sample_name`
colnames(sce) <- glue::glue("{sample_name}-{colnames(sce)}")
### Ensure the rowData columns can be identified
# Recall, we want to leave `gene_symbol` alone, but add the `sample_name` to the rest
rowdata_names <- colnames(rowData(sce))
# prefix rowData names with the sample name, except for gene symbols
new_rowdata_names <- ifelse(
rowdata_names == "gene_symbol",
"gene_symbol",
glue::glue("{sample_name}-{rowdata_names}")
)
colnames(rowData(sce)) <- new_rowdata_names
### Ensure metadata slot fields can be identified
# We'll simply prepend the `sample_name` to all fields for this slot
names(metadata(sce)) <- glue::glue("{sample_name}-{names(metadata(sce))}")
### Finally, we can return the formatted SCE object
return(sce)
}
To run this function, we’ll use the purrr::map2()
function, a relative of purrr::map() that allows you to
loop over two input lists/vectors. In our case, we want to run
format_sce() over paired sce_list items and
sce_list names.
# We can use `purrr::map2()` to loop over two list/vector arguments
sce_list_formatted <- purrr::map2(
# Each "iteration" will march down the first two
# arguments `sce_list` and `names(sce_list)` in order
sce_list,
names(sce_list),
\(sce, sample_name) format_sce(sce, sample_name, shared_genes)
)
# Print formatted SCE list
sce_list_formatted
$SCPCL000479
class: SingleCellExperiment
dim: 60319 1918
metadata(14): SCPCL000479-salmon_version SCPCL000479-reference_index
... SCPCL000479-filtering_method SCPCL000479-miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000479-mean SCPCL000479-detected
colnames(1918): SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC ... SCPCL000479-GTTGTCCCACGTACAT
SCPCL000479-TCCGATCGTCGTGCCA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
$SCPCL000480
class: SingleCellExperiment
dim: 60319 4428
metadata(14): SCPCL000480-salmon_version SCPCL000480-reference_index
... SCPCL000480-filtering_method SCPCL000480-miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000480-mean SCPCL000480-detected
colnames(4428): SCPCL000480-TAGGGTTAGCAAGCCA
SCPCL000480-AACTTCTTCCCTCAAC ... SCPCL000480-AGGGAGTAGCCTCATA
SCPCL000480-TCGGATACATTGCAAC
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
$SCPCL000481
class: SingleCellExperiment
dim: 60319 5236
metadata(14): SCPCL000481-salmon_version SCPCL000481-reference_index
... SCPCL000481-filtering_method SCPCL000481-miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000481-mean SCPCL000481-detected
colnames(5236): SCPCL000481-TCATGAGAGGCCCACT
SCPCL000481-GGGTATTTCGTTGTGA ... SCPCL000481-AAAGAACCACTTCAAG
SCPCL000481-CAGCAGCTCGTGCATA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
$SCPCL000482
class: SingleCellExperiment
dim: 60319 4372
metadata(14): SCPCL000482-salmon_version SCPCL000482-reference_index
... SCPCL000482-filtering_method SCPCL000482-miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(3): gene_symbol SCPCL000482-mean SCPCL000482-detected
colnames(4372): SCPCL000482-GAGAAATCAGATTAAG
SCPCL000482-CAACCTCTCCGATCGG ... SCPCL000482-TGATTTCCACAAGTTC
SCPCL000482-ACCAAACGTTCTCAGA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
(Psst, like purrr and want to dive deeper? Check out the
purrr::imap() function!)
At long last, we are ready to merge the SCEs, which we’ll do using
the R function cbind(). The cbind() function
is often used to combine data frames or matrices by column, i.e. “stack”
them next to each other. The same principle applies here, but when run
on SCE objects, cbind() will create a new SCE object by
combining counts and logcounts matrices by
column. Following that structure, other SCE slots (colData,
rowData, reduced dimensions, and other metadata) are
combined appropriately.
Since we need to apply cbind() to a list of
objects, we need to use some slightly-gnarly syntax: We’ll use the
function do.call(), which allows the cbind()
input to be a list of objects to combine.
# Merge SCE objects
merged_sce <- do.call(cbind, sce_list_formatted)
# Print the merged_sce object
merged_sce
class: SingleCellExperiment
dim: 60319 15954
metadata(56): SCPCL000479-salmon_version SCPCL000479-reference_index
... SCPCL000482-filtering_method SCPCL000482-miQC_model
assays(2): counts logcounts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(9): gene_symbol SCPCL000479-mean ... SCPCL000482-mean
SCPCL000482-detected
colnames(15954): SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC ... SCPCL000482-TGATTTCCACAAGTTC
SCPCL000482-ACCAAACGTTCTCAGA
colData names(13): sum detected ... celltype_broad sample
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
We now have a single merged SCE object that contains all cells from all samples we’d like to integrate.
Let’s take a peek at some of the innards of this new SCE object:
# How many samples, and cells per sample?
table(colData(merged_sce)$sample)
SCPCL000479 SCPCL000480 SCPCL000481 SCPCL000482
1918 4428 5236 4372
# What are the new cell ids (column names)?
head(colnames(merged_sce))
[1] "SCPCL000479-GGGACCTCAAGCGGAT" "SCPCL000479-CACAGATAGTGAGTGC"
[3] "SCPCL000479-TGTGGCGGTGAATTGA" "SCPCL000479-GCCGATGGTACATACC"
[5] "SCPCL000479-ATTATCCCAGTTGGTT" "SCPCL000479-TCCGAAATCACACCGG"
SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC
SCPCL000479-TGTGGCGGTGAATTGA
SCPCL000479-GCCGATGGTACATACC
SCPCL000479-ATTATCCCAGTTGGTT
SCPCL000479-TCCGAAATCACACCGG
tail(colnames(merged_sce))
[1] "SCPCL000482-GATCACACAGCTAACT" "SCPCL000482-GACGCTGAGACTCTAC"
[3] "SCPCL000482-GTGAGGAGTCAACCTA" "SCPCL000482-ATTCCTAGTGTACATC"
[5] "SCPCL000482-TGATTTCCACAAGTTC" "SCPCL000482-ACCAAACGTTCTCAGA"
SCPCL000482-GATCACACAGCTAACT
SCPCL000482-GACGCTGAGACTCTAC
SCPCL000482-GTGAGGAGTCAACCTA
SCPCL000482-ATTCCTAGTGTACATC
SCPCL000482-TGATTTCCACAAGTTC
SCPCL000482-ACCAAACGTTCTCAGA
# What does rowData look like?
head(rowData(merged_sce))
DataFrame with 6 rows and 9 columns
gene_symbol SCPCL000479-mean SCPCL000479-detected
<character> <numeric> <numeric>
ENSG00000000003 TSPAN6 0.01772018 1.639778
ENSG00000000005 TNMD 0.00264480 0.158688
ENSG00000000419 DPM1 0.07299656 6.109495
ENSG00000000457 SCYL3 0.02300979 1.983602
ENSG00000000460 C1orf112 0.08119545 6.426871
ENSG00000000938 FGR 0.00317376 0.238032
SCPCL000480-mean SCPCL000480-detected SCPCL000481-mean
<numeric> <numeric> <numeric>
ENSG00000000003 0.05209841 4.828312 0.06379838
ENSG00000000005 0.00855151 0.828838 0.00548350
ENSG00000000419 0.08919879 8.301539 0.24011389
ENSG00000000457 0.05262465 5.065123 0.13972372
ENSG00000000460 0.10906460 9.656624 0.26911315
ENSG00000000938 0.00197342 0.197342 0.00400717
SCPCL000481-detected SCPCL000482-mean SCPCL000482-detected
<numeric> <numeric> <numeric>
ENSG00000000003 5.989666 0.0289113 2.659838
ENSG00000000005 0.495624 0.0100776 0.759954
ENSG00000000419 20.056944 0.1391046 11.217578
ENSG00000000457 12.411684 0.0827689 7.054353
ENSG00000000460 21.206369 0.3092681 19.824880
ENSG00000000938 0.358536 0.0173468 1.387742
So far, we’ve created a merged_sce object which is
(almost!) ready for integration.
The integration methods we’ll be using here actually perform batch
correction on a reduced dimension representation of the normalized gene
expression values, which is more efficient. fastMNN and
harmony specifically use PCA for this, but be aware that
different integration methods may use other kinds of reduced
dimensions.
Before merging, our objects had reduced dimension representations calculated on each individual SCE, and we removed them when preparing for merge. We removed them because we don’t actually want to use them anymore! This is because part of their calculation involves scaling the raw data to center the mean. When samples are separately centered, all of them will be centered at zero, making it look like the datasets are already pretty overlapping when you plot their UMAPs together. But, this is just a mathematical artifact of how dimension reduction is performed.
So, we’ll begin by re-calculating PCA and UMAP on the merged object in a way that takes batches into consideration. For input to integration, we’ll want the reduced dimension calculations to consider normalized gene expression values from all samples simultaneously. So we’ll need to recalculate PCA (and UMAP for visualization) on the merged object.
First, as usual, we’ll determine the high-variance genes to use for
PCA from the merged_sce object. For this, we’ll need to
provide the argument block = merged_sce$sample when
modeling gene variance, which tells scran::modelGeneVar()
to first model variance separately for each batch and then combine those
modeling statistics. (Psst: isn’t it handy we created that
sample column when merging?!)
# Specify the number of genes to identify
num_genes <- 2000
# Calculate variation for each gene
gene_variance <- scran::modelGeneVar(merged_sce,
# specify the grouping column:
block = merged_sce$sample)
# Get the top `num_genes` high-variance genes to use for dimension reduction
hv_genes <- scran::getTopHVGs(gene_variance,
n = num_genes)
To calculate the PCA matrix itself, we’ll use an approach from the
batchelor package, which is the R package that contains the
fastMNN method. The batchelor::multiBatchPCA()
function calculates a batch-weighted PCA matrix. This weighting ensures
that all batches, which may have very different numbers of cells,
contribute equally to the overall scaling.
# Use batchelor to calculate PCA for merged_sce, considering only
# the high-variance genes
# We'll need to include the argument `preserve.single = TRUE` to get
# a single matrix with all samples and not separate matrices for each sample
merged_pca <- batchelor::multiBatchPCA(merged_sce,
subset.row = hv_genes,
batch = merged_sce$sample,
preserve.single = TRUE)
Let’s have a look at the output:
# This output is not very interesting!
merged_pca
List of length 1
We can use indexing [[1]] to see the PCA matrix
calculated, looking at a small subset for convenience:
merged_pca[[1]][1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
SCPCL000479-GGGACCTCAAGCGGAT -8.553693 -7.508773 -8.286024 3.9288287 -3.353172
SCPCL000479-CACAGATAGTGAGTGC -9.741452 -7.712208 -10.310277 4.7186301 -4.271634
SCPCL000479-TGTGGCGGTGAATTGA -9.578361 -6.841931 -6.185011 2.7051343 -3.083831
SCPCL000479-GCCGATGGTACATACC -8.287781 -7.592143 -10.407848 0.2007613 -8.194913
SCPCL000479-ATTATCCCAGTTGGTT -7.035384 -4.938422 -6.042540 0.4756373 -2.014974
We can now include this PCA matrix in our merged_sce
object:
# add PCA results to merged SCE object
reducedDim(merged_sce, "PCA") <- merged_pca[[1]]
Now that we have the PCA matrix, we can proceed to calculate UMAP to visualize the uncorrected merged data.
merged_sce <- scater::runUMAP(merged_sce)
# UMAPs scaled together when calculated from the merged SCE
scater::plotUMAP(
merged_sce,
color_by = "sample",
# Some styling to help us see the points:
point_size = 0.5,
point_alpha = 0.2
) +
scale_color_brewer(palette = "Dark2", name = "sample") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP calculated on merged_sce")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
We see (mostly) four separate clumps representing the four different merged but not yet integrated samples. We can think of this UMAP as our “before” UMAP, and we can compare this to the “after” UMAP we see post-integration.
Let’s discuss a little first: What visual differences do you think the UMAP on the integrated version of data will have? What similarities do you think the integrated UMAP will have to this plot?
fastMNNFinally, we’re ready to integrate! To start, we’ll use the
fastMNN approach from the Bioconductor batchelor
package.
fastMNN takes as input the merged_sce
object to integrate, and the first step it performs is actually to run
batchelor::multiBatchPCA() on that SCE. It then uses that
batch-weighted PCA matrix to perform the actual batch correction. The
batch argument is used to specify the different groupings
within the merged_sce (i.e. the original sample that each
cell belongs to), and the subset.row argument can
optionally be used to provide a vector of high-variance genes that
should be considered for this PCA calculation. fastMNN will
return an SCE object that contains a batch-corrected PCA. Let’s run it
and save the result to a variable called
integrated_sce.
# integrate with fastMNN, again specifying only our high-variance genes
integrated_sce <- batchelor::fastMNN(
merged_sce,
batch = merged_sce$sample,
subset.row = hv_genes
)
Let’s have a look at the result:
# Print the integrated_sce object
integrated_sce
class: SingleCellExperiment
dim: 2000 15954
metadata(2): merge.info pca.info
assays(1): reconstructed
rownames(2000): ENSG00000278996 ENSG00000157168 ... ENSG00000227232
ENSG00000258984
rowData names(1): rotation
colnames(15954): SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC ... SCPCL000482-TGATTTCCACAAGTTC
SCPCL000482-ACCAAACGTTCTCAGA
colData names(1): batch
reducedDimNames(1): corrected
mainExpName: NULL
altExpNames(0):
There are couple pieces of information here of interest:
corrected reduced dimension represents the
batch-corrected PCA that fastMNN calculated.reconstructed assay represents the batch-corrected
normalized expression values, which fastMNN
“back-calculated” from the batch-corrected PCA (corrected).
Generally speaking, these expression values are not stand-alone values
that you should use for other applications like differential gene
expression, as described in Orchestrating
Single Cell Analyses. If the subset.row argument
is provided (as it was here), only genes present in
subset.row will be included in these reconstructed
expression values, but this setting can be overridden so that all genes
have reconstructed expression with the argument
correct.all = TRUE.We’re mostly interested in the PCA that fastMNN
calculated, so let’s save that information (with an informative and
unique name) into our merged_sce object:
# Make a new reducedDim named fastmnn_PCA from the corrected reducedDim in integrated_sce
reducedDim(merged_sce, "fastmnn_PCA") <- reducedDim(integrated_sce, "corrected")
Finally, we’ll calculate UMAP from these corrected PCA matrix for visualization. In this case we need to specify two additional arguments since we’re working with non-standard reduced dimension names:
dimred = "fastmnn_PCA", which specifies the existing
reduced dimension to use for the calculationname = "fastmnn_UMAP", which names the final UMAP that
this function calculates# Calculate UMAP
merged_sce <- scater::runUMAP(
merged_sce,
dimred = "fastmnn_PCA",
name = "fastmnn_UMAP"
)
First, let’s plot the integrated UMAP highlighting the different batches. A well-integrated dataset will show batch mixing, but a poorly-integrated dataset will show more separation among batches, similar to the uncorrected UMAP. Note that this is a more qualitative way to assess the success of integration, but there are formal metrics one can use to assess batch mixing, which you can read more about in this chapter of OSCA.
scater::plotReducedDim(merged_sce,
# plot the fastMNN coordinates
dimred = "fastmnn_UMAP",
# color by sample
color_by = "sample",
# Some styling to help us see the points:
point_size = 0.5,
point_alpha = 0.2) +
scale_color_brewer(palette = "Dark2", name = "sample") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with fastMNN")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
This fastmnn_UMAP certainly looks different from the one
we made before integrating! What different trends do you see? Do all
samples look “equally well” integrated, from a first look?
Importantly, one reason that batches may still appear separated in the corrected UMAP is if they should be separated - for example, maybe two batches contain very different cell types, have very different diagnoses, or may be from different patients.
Recall from earlier that we conveniently have cell type annotations in our SCEs, so we can explore those here! Let’s take a quick detour to see what kinds of cell types are in this data by making a barplot of the cell types across samples:
# Cell types are in the `celltype_broad` and `celltype_fine` columns
merged_sce_df <- as.data.frame(colData(merged_sce))
# Use ggplot2 to make a barplot the cell types across samples
ggplot(merged_sce_df,
aes(x = sample,
fill = celltype_broad)) +
# Barplot of celltype proportions
geom_bar(position = "fill") +
# Use a CVD-friendly color scheme
scale_fill_brewer(palette = "Dark2", na.value = "grey80") +
# customize y-axis label
labs(y = "Proportion") +
# nicer theme
theme_bw()
We see that Tumor cell types are by far the most prevalent across all
samples, and normal tissue cell types are not very common. We see also
that SCPCL000481 has a larger Tumor_Myocyte
population, while all other samples have larger
Tumor_Mesoderm populations. This difference may
explain why we observe that SCPCL000481 is somewhat more
separated from the other samples in the fastMNN UMAP.
Let’s re-plot this UMAP to highlight cell types:
scater::plotReducedDim(merged_sce,
dimred = "fastmnn_UMAP",
# color by broad celltypes
color_by = "celltype_broad",
point_size = 0.5,
point_alpha = 0.2) +
# include argument to specify color of NA values
scale_color_brewer(palette = "Dark2", name = "Broad celltype", na.value = "grey80") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with fastMNN")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
This UMAP shows that the normal tissue cell types (mostly vascular endothelium, muscle cells, and monocytes) tend to cluster together and are generally separated from the tumor cell types, which is an encouraging pattern! Tumor cell types from different samples are all also clustering together, which is even more encouraging that we had successful integration.
However, it’s a bit challenging to see all the points given the
amount of overlap in the plot. One way we can see all the points a bit
better is to facet the plot by sample, using facet_wrap()
from the ggplot2 package (which we can do because
scater::plotReducedDim() returns a ggplot2
object):
scater::plotReducedDim(merged_sce,
dimred = "fastmnn_UMAP",
color_by = "celltype_broad",
point_size = 0.5,
point_alpha = 0.2,
# Allow for faceting by a variable using `other_fields`:
other_fields = "sample") +
scale_color_brewer(palette = "Dark2", name = "Broad celltype", na.value = "grey80") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with fastMNN") +
# Facet by sample
facet_wrap(vars(sample)) +
# Use a theme with background grid to more easily compare panel coordinates
theme_bw()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
What trends do you observe between tumor and healthy tissues among these integrated samples?
harmonyfastMNN is only one of many approaches to perform
integration, and different methods have different capabilities and may
give different results. For example, some methods can accommodate
additional covariates (e.g., technology, patient, diagnosis, etc.) that
can influence integration. In fact the data we are using has a known
patient covariate; SCPCL000479 and
SCPCL000480 are from Patient A, and
SCPCL000481 and SCPCL000482 are from Patient
B.
So, let’s perform integration with a method that can use this
information - harmony!
To begin setting up for harmony integration, we need to
add explicit patient information into our merged SCE. We’ll create a new
column patient whose value is either “A” or “B” depending
on the given sample name, using the dplyr::case_when()
function. We provide this function with a set of logical expressions and
each assigned value is designated by ~. The expressions are
evaluated in order, stopping at the first one that evaluates as
TRUE and returning the associated value.
# Create patient column with values "A" or "B" for the two patients
merged_sce$patient <- dplyr::case_when(
merged_sce$sample %in% c("SCPCL000479", "SCPCL000480") ~ "A",
merged_sce$sample %in% c("SCPCL000481", "SCPCL000482") ~ "B",
)
Unlike fastMNN, harmony does not calculate
corrected expression values nor does it return an SCE object. Like
fastMNN, harmony performs integration on a
merged PCA matrix. However, unlike fastMNN,
harmony does not “back-calculate” corrected expression from
the corrected PCA matrix and it only returns the corrected PCA matrix
itself. For input, harmony needs a couple pieces of
information:
harmony takes a batch-weighted PCA matrix to
perform integration. We already calculated a batch-weighted PCA matrix
so we’ll provide this as the the input.harmony about the covariates to
use - sample and patient. To do this, we
provide two arguments:
meta_data, a data frame that contains covariates across
samples. We can simply specify the SCE colData here since
it contains sample and patient columns.vars_use, a vector of which column names in
meta_data should actually be used as covariates. Other
columns in meta_data which are not in vars_use
are ignored.Let’s go!
# Run harmony integration
harmony_pca <- harmony::RunHarmony(
data_mat = reducedDim(merged_sce, "PCA"),
meta_data = colData(merged_sce),
vars_use = c("sample", "patient")
)
Transposing data matrix
Initializing state using k-means centroids initialization
Harmony 1/10
Harmony 2/10
Harmony converged after 2 iterations
The result is a PCA matrix. Let’s print a subset of this matrix to see it:
# Print the harmony result
harmony_pca[1:5, 1:5]
[,1] [,2] [,3] [,4] [,5]
SCPCL000479-GGGACCTCAAGCGGAT -7.036850 -7.280253 -3.825712 5.564675 0.5984375
SCPCL000479-CACAGATAGTGAGTGC -8.232417 -7.422768 -5.641379 6.128005 -0.2208036
SCPCL000479-TGTGGCGGTGAATTGA -7.928602 -6.591079 -2.140173 4.775223 0.8639596
SCPCL000479-GCCGATGGTACATACC -6.821359 -7.465958 -5.872961 2.313765 -4.1492218
SCPCL000479-ATTATCCCAGTTGGTT -5.611779 -4.916862 -1.894365 2.626925 1.5535017
As we did with fastMNN results, let’s store this PCA
matrix directly in our merged_sce object with an
informative name that won’t overwrite any of the existing PCA matrices.
We’ll also calculate UMAP from it.
# Store PCA as `harmony_PCA`
reducedDim(merged_sce, "harmony_PCA") <- harmony_pca
# As before, calculate UMAP on this PCA matrix with appropriate names
merged_sce <- scater::runUMAP(merged_sce,
dimred = "harmony_PCA",
name = "harmony_UMAP")
Let’s see how the harmony UMAP, colored by sample, looks
compared to the fastMNN UMAP:
scater::plotReducedDim(merged_sce,
dimred = "harmony_UMAP",
color_by = "sample",
point_size = 0.5,
point_alpha = 0.2) +
scale_color_brewer(palette = "Dark2", name = "sample") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with harmony")
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
How do you think this harmony UMAP compares to that from
fastMNN integration?
Let’s see how this UMAP looks colored by cell type, and faceted for visibility:
scater::plotReducedDim(merged_sce,
dimred = "harmony_UMAP",
color_by = "celltype_broad",
point_size = 0.5,
point_alpha = 0.2,
# Specify variable for faceting
other_fields = "sample") +
scale_color_brewer(palette = "Dark2", name = "Broad celltype", na.value = "grey80") +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) +
ggtitle("UMAP after integration with harmony") +
facet_wrap(vars(sample))
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
What do you now notice in this faceted view that wasn’t clear
previously? Are there other patterns you see that are similar or
different from the fastMNN UMAP? How do you think
fastMNN vs. harmony performed in integrating
these samples?
Finally, we’ll export the final SCE object with both
fastMNN and harmony integration to a file.
Since this object is very large (over 1 GB!), we’ll export it to a file
with some compression, which, in this case, will reduce the final size
to a smaller ~360 MB. This will take a couple minutes to save while
compression is performed.
# Export to RDS file with "gz" compression
readr::write_rds(
merged_sce,
integrated_sce_file,
compress = "gz"
)
As always, we’ll print the session info to be transparent about what packages, and which versions, were used during this R session.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[3] Biobase_2.64.0 GenomicRanges_1.56.0
[5] GenomeInfoDb_1.40.0 IRanges_2.38.0
[7] S4Vectors_0.42.0 BiocGenerics_0.50.0
[9] MatrixGenerics_1.16.0 matrixStats_1.3.0
[11] ggplot2_3.5.1 optparse_1.7.5
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.3
[3] magrittr_2.0.3 scater_1.32.0
[5] RcppAnnoy_0.0.22 compiler_4.4.1
[7] flexmix_2.3-19 DelayedMatrixStats_1.26.0
[9] vctrs_0.6.5 stringr_1.5.1
[11] pkgconfig_2.0.3 crayon_1.5.2
[13] fastmap_1.1.1 XVector_0.44.0
[15] labeling_0.4.3 scuttle_1.14.0
[17] utf8_1.2.4 rmarkdown_2.26
[19] tzdb_0.4.0 UCSC.utils_1.0.0
[21] ggbeeswarm_0.7.2 purrr_1.0.2
[23] xfun_0.43 modeltools_0.2-23
[25] bluster_1.14.0 zlibbioc_1.50.0
[27] cachem_1.0.8 beachmat_2.20.0
[29] jsonlite_1.8.8 highr_0.10
[31] DelayedArray_0.30.0 BiocParallel_1.38.0
[33] irlba_2.3.5.1 parallel_4.4.1
[35] cluster_2.1.6 R6_2.5.1
[37] RColorBrewer_1.1-3 bslib_0.7.0
[39] stringi_1.8.3 limma_3.60.0
[41] jquerylib_0.1.4 Rcpp_1.0.12
[43] knitr_1.46 readr_2.1.5
[45] batchelor_1.20.0 Matrix_1.7-0
[47] splines_4.4.1 nnet_7.3-19
[49] igraph_2.0.3 tidyselect_1.2.1
[51] abind_1.4-5 yaml_2.3.8
[53] viridis_0.6.5 codetools_0.2-20
[55] lattice_0.22-6 tibble_3.2.1
[57] withr_3.0.0 evaluate_0.23
[59] getopt_1.20.4 pillar_1.9.0
[61] generics_0.1.3 hms_1.1.3
[63] sparseMatrixStats_1.16.0 munsell_0.5.1
[65] scales_1.3.0 miQC_1.12.0
[67] RhpcBLASctl_0.23-42 glue_1.7.0
[69] metapod_1.12.0 tools_4.4.1
[71] BiocNeighbors_1.22.0 ScaledMatrix_1.12.0
[73] locfit_1.5-9.9 fs_1.6.4
[75] scran_1.32.0 cowplot_1.1.3
[77] grid_4.4.1 edgeR_4.2.0
[79] colorspace_2.1-0 GenomeInfoDbData_1.2.12
[81] beeswarm_0.4.0 BiocSingular_1.20.0
[83] vipor_0.4.7 cli_3.6.2
[85] rsvd_1.0.5 fansi_1.0.6
[87] S4Arrays_1.4.0 viridisLite_0.4.2
[89] dplyr_1.1.4 uwot_0.2.2
[91] ResidualMatrix_1.14.0 gtable_0.3.5
[93] sass_0.4.9 digest_0.6.35
[95] SparseArray_1.4.0 ggrepel_0.9.5
[97] dqrng_0.3.2 farver_2.1.1
[99] htmltools_0.5.8.1 lifecycle_1.0.4
[ reached getOption("max.print") -- omitted 3 entries ]