This notebook will demonstrate how to:
DESeq2
packageggplot2
and EnhancedVolcano
to
visualize gene expression changes across cell types and samplesJust like bulk RNA-seq, it is likely that one of the goals when
performing scRNA-seq will be to compare the gene expression of multiple
samples to each other. Unlike bulk RNA-seq analysis, scRNA-seq analysis
allows us to identify and annotate cell types or subpopulations of cells
present in each of our samples. This means that we can account for
differences in cell type composition across samples and specifically
focus on cell types or populations of interest when performing
differential expression (DE) analysis. In this notebook, we will work
with multiple samples to identify differentially expressed genes across
cell types of interest using the DESeq2
package.
We will continue working with samples from the SCPCP000005
project, an investigation of pediatric solid tumors led by the Dyer
and Chen labs at St. Jude Children’s Research Hospital. This particular
dataset contains 10 different samples that have been integrated using
fastMNN
, following the same procedure we outlined in
02-dataset_integration.Rmd
. These 10 samples represent two
different types of rhabdomyosarcoma (RMS): embryonal rhabdomyosarcoma
(ERMS) and alveolar rhabdomyosarcoma (ARMS). These two subtypes are
distinguished by the presence of the PAX3/PAX7-FOXO1
fusion
gene, which is present only in ARMS patients. Additionally, cells found
in ARMS tumors tend to have an increased mutational burden with cells in
a more differentiated state compared to ERMS tumor cells (Shern et
al. 2014; Stewart et
al. 2018). RMS tumors, regardless of subtype, are made up of
cells typically associated with development of skeletal muscle:
mesoderm, myoblasts, and myocytes (Sebire and Malone 2003).
Patel et
al. (2022) tested the hypothesis that cell types have distinct
gene expression patterns in ARMS vs. ERMS samples. Here we will look at
a subset of the samples they sequenced and identify differentially
expressed genes in tumor cells between ARMS and ERMS samples.
# set seed for reproducibility
set.seed(2022)
# load libraries
library(ggplot2) # plotting functions
library(SingleCellExperiment)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# package used for differential expression analysis
library(DESeq2)
We will start by reading in a SingleCellExperiment
(SCE)
object that contains both the uncorrected (merged but not integrated)
and corrected (integrated) gene expression data for all 10 samples.
Prior to integration, all 10 samples went through the same filtering,
normalization, and dimensionality reduction. These 10 samples were then
merged into one SingleCellExperiment
object following the
same steps outlined in 03-dataset_integration.Rmd
. The
merged object was then integrated with fastMNN
to obtain a
corrected gene expression assay and corrected reduced dimensionality
results. The final SCE object was stored in
data/rms/integrated/rms_all_sce.rds
.
We also have provided a metadata file,
data/rms/annotations/rms_sample_metadata.tsv
, that contains
information from each sample, such as diagnosis, sex, age, etc. Each row
in this file corresponds to a sample found in the integrated SCE
object.
To begin, let’s set up our directories and files:
# set up file paths
# data directory for RMS data
data_dir <- file.path("data", "rms")
# integrated file containing samples to use for DE analysis
integrated_sce_file <- file.path(data_dir,
"integrated",
"rms_all_sce.rds")
# sample metadata to set up DE analysis
sample_metadata_file <- file.path(data_dir,
"annotations",
"rms_sample_metadata.tsv")
# directory to store output
deseq_dir <- file.path("analysis", "rms", "deseq")
fs::dir_create(deseq_dir)
# results file to output from DE analysis
deseq_output_file <- file.path(deseq_dir,
"rms_myoblast_deseq_results.tsv")
# output integrated sce object
output_sce_file <- file.path(data_dir,
"integrated",
"rms_subset_sce.rds")
We can go ahead and read in the SCE object and the metadata file.
# read in the SCE object that has already been integrated
integrated_sce <- readr::read_rds(integrated_sce_file)
# read in sample metadata file
sample_metadata <- readr::read_tsv(sample_metadata_file)
Rows: 10 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (10): project_id, submitter, library_id, sample_id, diagnosis, technolog...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before we dive into differential expression, let’s explore our integrated SCE object and the dataset a little more.
We’ll start by looking at what’s inside the object. Here we should have both the original (uncorrected) data and the integrated (corrected) data for both the gene expression and the reduced dimensionality results. How are those stored in our object?
# print out entire object
integrated_sce
class: SingleCellExperiment
dim: 60319 39332
metadata(141): salmon_version reference_index ... miQC_model
combined_hvg
assays(3): counts logcounts fastmnn_corrected
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(21): gene_symbol SCPCL000478-mean ... SCPCL000498-mean
SCPCL000498-detected
colnames(39332): SCPCL000478-CCTTTGGCACTACCCT
SCPCL000478-GCAACCGGTCTTAGTG ... SCPCL000498-CGTCAAATCGAGAATA
SCPCL000498-GGAACCCCACGACAAG
colData names(13): sum detected ... celltype_broad sample
Loading required package: BiocSingular
reducedDimNames(4): PCA UMAP fastmnn_PCA fastmnn_UMAP
mainExpName: NULL
altExpNames(0):
# look at the assay names in our object
assayNames(integrated_sce)
[1] "counts" "logcounts" "fastmnn_corrected"
counts
logcounts
fastmnn_corrected
When we look at the assay names we should see that there are 3
matrices, counts
, logcounts
, and
fastmnn_corrected
. The counts
and
logcounts
assays correspond to the uncorrected gene
expression data that has been merged but NOT integrated. The
fastmnn_corrected
data contains the corrected gene
expression data obtained from integration. For this exercise we will not
be using the fastmnn_corrected
data (more on why not once
we get to setting up the differential expression), but we need to be
aware that it is present and be able to distinguish it from our
uncorrected data.
# look at the names of the dimension reductions
reducedDimNames(integrated_sce)
[1] "PCA" "UMAP" "fastmnn_PCA" "fastmnn_UMAP"
PCA
UMAP
fastmnn_PCA
fastmnn_UMAP
In the reducedDim
slots you should see PCA
and UMAP
, which were both calculated from the combined data
before integration. You should also see
fastmnn_PCA
and fastmnn_UMAP
reduced
dimensions, which correspond to the integrated results.
Just like in the integration notebook, this dataset also contains the
cell type annotations found in the celltype_fine
and
celltype_broad
columns of the colData
. These
cell types were originally assigned in Patel et
al. (2022). We will use these cell type assignments to set up
the DE analysis below, but they are not required for DE analysis itself.
It’s important to note that DE analysis can be applied to any
subpopulation of interest that is shared across samples besides just
cell types.
Because we are going to be doing DE analysis between ARMS and ERMS
samples, let’s start by labeling cells in the integrated dataset based
on their RMS subtype. To do this we will need to be sure that the
subtype is present in the colData
of the integrated SCE
object. If it’s not there, we need to add it in.
# look at the head of the coldata
head(colData(integrated_sce)) |>
as.data.frame()
Uh oh, it looks like the RMS subtype is not found in the SCE object. Fortunately we also have the sample metadata table that we read in earlier, which contains information about each of the samples present in the dataset.
# print out sample metadata
head(sample_metadata)
Looking at this sample table, we see a column named
subdiagnosis
which accounts for the RMS subtype, ARMS or
ERMS. We also see other columns that contain information about each
specific sample.
We can incorporate the information in this sample metadata table into
the colData
of the integrated SCE object. This will allow
us to match each of the samples in the SCE object with the RMS subtype
and also allow us to use any of the columns in the sample metadata for
plotting.
# add sample metadata to colData from the integrated SCE object
coldata_df <- colData(integrated_sce) |>
# convert from DataFrame to data.frame
as.data.frame() |>
# merge with sample metadata
dplyr::left_join(sample_metadata, by = c("sample" = "library_id")) |>
# create new columns
# cell_id is a combination of barcode and sample
dplyr::mutate(cell_id = glue::glue("{sample}-{barcode}"),
# simplify subdiagnosis
diagnosis_group = forcats::fct_recode(
subdiagnosis,
"ARMS" = "Alveolar rhabdomyosarcoma",
"ERMS" = "Embryonal rhabdomyosarcoma"
))
# add modified data frame back to SCE as DataFrame
colData(integrated_sce) <- DataFrame(coldata_df,
row.names = coldata_df$cell_id)
Now when we look at the colData
of the SCE object we
should see new columns, including the diagnosis_group
column which indicates if each cell comes from an ERMS or ARMS
sample.
# take a look at the new modified colData
head(colData(integrated_sce)) |>
as.data.frame()
We can now use that column to label any UMAP plots (or other plot types) that we make. In the chunk below we will start by taking a look at our integration results and color our cells by RMS subtype.
Reminder: You should always use the batch-corrected dimensionality reduction results for visualizing datasets containing multiple libraries or samples.
# UMAP of all samples, separating by diagnosis group
scater::plotReducedDim(integrated_sce,
dimred = "fastmnn_UMAP",
color_by = "diagnosis_group",
point_size= 0.5,
point_alpha = 0.2)
Interestingly, it looks like samples from the ARMS and ERMS subtypes tend to group with samples of the same subtype rather than all together.
In the integration notebook we also looked at the distribution of cell types after integration. In that notebook, we discussed that cells of the same cell type are expected to integrate with other cells of the same type. Is that the case with this dataset?
A word of caution when evaluating the cell type results for this
dataset: The cell types for this dataset were assigned in a two stage
process as described in Patel et
al. (2022). The first stage assigned cells as tumor or
non-tumor. The next stage further classified tumor cells into one of
three types of tumor cells: myoblast, myocyte, or mesoderm. Some samples
could not be further classified, so all of their tumor cells are denoted
Tumor
. The samples which could be further classified have a
mix of Tumor_Mesoderm
, Tumor_Myoblast
, and
Tumor_Myocyte
.
# UMAP of all samples labeled by cell type
scater::plotReducedDim(integrated_sce,
dimred = "fastmnn_UMAP",
# color each point by cell type
color_by = "celltype_broad",
point_size= 0.5,
point_alpha = 0.4)
Unlike with the previous datasets we have seen where all cells of the same cell type always grouped together, this dataset shows some slightly different patterns and not all cells of the same cell type cluster together. One reason is that tumor data can be heterogeneous and every tumor is unique. Depending on the tumor type we may not expect every sample to integrate perfectly and more heterogeneous tumor types will be more difficult to integrate together. In this particular case we are looking at two subtypes of RMS that have distinct mutation burdens and differentiation states, so it’s likely that those differences contribute to how well they integrate.
To explore whether cells are grouping together both by cell type and
by RMS subtype, we can create a plot that incorporates both pieces of
metadata. We will take advantage of the facet_grid()
function from ggplot2
to look at two variables in the
colData
at once - the cell type and the subdiagnosis. In
the below plot we will color our cells by cell type while also using
facet_grid()
so that cells from different subdiagnoses will
be in their own plot panel.
# UMAP of all samples
# separating by diagnosis group and labeling cell type
scater::plotReducedDim(integrated_sce,
dimred = "fastmnn_UMAP",
# color each point by cell type
color_by = "celltype_broad",
point_size= 0.5,
point_alpha = 0.4,
# tell scater to use diagnosis_group for plotting
other_fields = "diagnosis_group") +
# include each diagnosis group as its own column
facet_grid(cols = vars(diagnosis_group))
As expected, we see that cell types are separated, most likely due to different RMS subtypes.
We can also use a stacked barplot to look at the distribution of cell types across each sample, which will require a bit of wrangling first.
# filter coldata to only include tumor cells
tumor_cells_df <- coldata_df |>
# find rows where the cell type name contains the string "Tumor"
dplyr::filter(stringr::str_detect(celltype_broad, "Tumor"))
# create a stacked barplot
ggplot(tumor_cells_df, aes(x = sample, fill = celltype_broad)) +
geom_bar(position = "fill", color = "black", size = 0.2) +
labs(
x = "Sample",
y = "Proportion of cells",
fill = "Cell type"
) +
scale_fill_brewer(palette = "Dark2") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
# facet by diagnosis group
facet_grid(cols = vars(diagnosis_group),
# only show non-NA values on x-axis
scales = "free_x",
space = "free_x")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Similar to the UMAP, this plot shows that ARMS and ERMS share a lot of the same cell types.
We also see that only 6 of these libraries have tumor cells that have
been further classified into mesoderm, myoblast, and myocyte. 3
libraries contain cells that are only classified as tumor or non-tumor,
and tumor cells are not further classified, and the remaining library is
not even present in our plot because it was not assigned any cell types
(all are NA
). We will continue our analysis only using the
6 libraries with fully classified cell types, removing the other 4
before we proceed with differential expression.
The reason we want to pare down our list of samples to consider is that we want to ensure that the cell types (or subpopulations) that we are interested in are present in all samples included in our DE analysis. We want to remove any samples that do not contain our cell population(s) of interest as they have no counts to contribute to the DE analysis.
# define samples to keep
library_ids <- c(
"SCPCL000479",
"SCPCL000480",
"SCPCL000481",
"SCPCL000484",
"SCPCL000488",
"SCPCL000491"
)
# subset sce to only contain samples of interest
samples_to_keep <- integrated_sce$sample %in% library_ids
rms_sce <- integrated_sce[, samples_to_keep]
# print out our new SCE
rms_sce
class: SingleCellExperiment
dim: 60319 26033
metadata(141): salmon_version reference_index ... miQC_model
combined_hvg
assays(3): counts logcounts fastmnn_corrected
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(21): gene_symbol SCPCL000478-mean ... SCPCL000498-mean
SCPCL000498-detected
colnames(26033): SCPCL000479-GGGACCTCAAGCGGAT
SCPCL000479-CACAGATAGTGAGTGC ... SCPCL000491-TCGCACTAGGAACGTC
SCPCL000491-TTGCATTTCAACGCTA
colData names(24): sum detected ... cell_id diagnosis_group
reducedDimNames(4): PCA UMAP fastmnn_PCA fastmnn_UMAP
mainExpName: NULL
altExpNames(0):
Before we move on, we’ll remove the original integrated object from our environment to save some memory.
rm(integrated_sce)
We will also save our new object in case we want to use it for other analysis later on.
# write RDS file with compression
readr::write_rds(rms_sce, file = output_sce_file, compress = "gz")
We now have an updated SCE object that contains 6 samples that were obtained from a mix of ARMS and ERMS patients. We can then ask the question, do specific tumor cell types contain sets of differentially expressed genes between ARMS and ERMS samples?
We should make sure that we have enough biological replicates from each group to set up our experiment. It is imperative to consider good experimental design and ensure that we have enough biological replicates (at least 3 for each group) when performing differential gene expression analysis.
If we look back at our stacked barplot we see that we picked 3 ARMS
and 3 ERMS samples. We can also see that the majority of cells are tumor
cells, in particular the largest population of cells appears to be the
Tumor_Myoblast
. For this example we will focus on
identifying DE genes in these Tumor_Myoblast
cells, but the
principles applied below can be applied to any cell types or
subpopulations of interest.
Now we are ready to start preparing for our DE analysis, where we will compare the gene expression of tumor myoblast cells between ARMS and ERMS samples.
Throughout the notebook we have been working with an integrated
dataset that contains corrected gene expression data
(fastmnn_corrected
assay) and a corrected UMAP. As a
reminder, the uncorrected gene expression data, found in the
counts
and logcounts
assays, correspond to
data that has been merged (the first step we walked through prior to
integration) into the same SCE but not yet integrated. We do not want to
use corrected gene expression values for differential expression;
DESeq2
expects the original raw counts as input so we will
be using data found in the counts
assay of the
SingleCellExperiment
object.
It is advised to only use the corrected values for any analyses being performed at the cell level, e.g., dimensionality reduction. In contrast, it is not advised to use corrected values for any analyses that are gene-based, such as differential expression or marker gene detection, because within-batch and between-batch gene expression differences are no longer preserved. The reason for this is two-fold – many of the DE models will expect uncorrected counts because they will account for between-sample variation within the model, and we want to ensure we are preserving variation that is present so as not to artificially inflate differences between populations. See the OSCA chapter on Using the corrected values for more insight.
Before we can compare the gene expression profiles of myoblasts in ARMS vs. ERMS samples, we will need to “pseudo-bulk” the gene counts. Pseudo-bulking creates a new counts matrix that contains the sum of the counts from all cells with a given label (e.g., cell type) for each sample (Tung et al. 2017). If we were to keep each cell’s counts separate, they would be treated as replicates, leading to inflated statistics. By pseudo-bulking first, we will now have one count for each gene for each sample and we can take advantage of well-established methods for differential expression with bulk RNA-seq.
Pseudo-bulking is implemented prior to differential expression analysis on single-cell data because it:
Before we apply pseudo-bulking to our dataset, let’s look at a simple example of how pseudo-bulking works. We’ll start by creating a fake matrix of counts.
# create an example counts matrix
counts_mtx <- matrix(
1:12,
ncol = 4,
dimnames = list(c("geneA", "geneB", "geneC"),
c("A-cell1", "A-cell2", "B-cell1", "B-cell2"))
)
counts_mtx
A-cell1 A-cell2 B-cell1 B-cell2
geneA 1 4 7 10
geneB 2 5 8 11
geneC 3 6 9 12
Next we will create a pseudo-bulked version of this matrix with only
2 columns: 1 for group A
and 1 for group B
. To
do this we will use the DelayedArray::colsum()
function,
which allows us to sum the counts for each row across groups of
columns.
# define the group that each column belongs to
groups <- c("A", "A", "B", "B")
# sum counts across cells (columns) by group label
pb_counts <- DelayedArray::colsum(counts_mtx,
groups)
pb_counts
A B
geneA 5 17
geneB 7 19
geneC 9 21
Looking at this output, you should see that the original 4 columns
have been condensed to only 2 columns: 1 column to represent all cells
from group A
, and 1 column to represent all cells from
group B
.
Now the actual pseudo-bulking for our dataset!
We will use the scuttle::aggregateAcrossCells()
function to pseudo-bulk our dataset. This function takes as input an
SCE object and the grouping assignments for each cell. The output will
be an SCE object that contains only the pseudo-bulked counts for all
genes across all specified groups, rather than across all cells. We can
then subset this SCE to just include our cell type of interest (tumor
myoblasts) for input to the DE analysis.
We can pseudo-bulk using any grouping that we are interested in. For right now, we are interested in looking at gene expression across cell types, so we want to group the pseudo-bulked counts matrix by both cell type and original sample.
# first subset the coldata
# to only have the columns we care about for pseudo-bulking
pb_groups <- colData(rms_sce)[, c("celltype_broad", "sample")]
# create a new SCE object that contains
# the pseudo-bulked counts across the provided groups
pb_sce <- scuttle::aggregateAcrossCells(rms_sce,
id = pb_groups)
# column names aren't automatically added to the pseudo-bulked sce,
# so let's add them in
colnames(pb_sce) <- glue::glue(
"{pb_sce$celltype_broad}_{pb_sce$sample}"
)
pb_sce
class: SingleCellExperiment
dim: 60319 37
metadata(141): salmon_version reference_index ... miQC_model
combined_hvg
assays(1): counts
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(21): gene_symbol SCPCL000478-mean ... SCPCL000498-mean
SCPCL000498-detected
colnames(37): Fibroblast_SCPCL000488 Lymphocyte_SCPCL000484 ...
Vascular Endothelium_SCPCL000488 Vascular Endothelium_SCPCL000491
colData names(27): sum detected ... sample ncells
reducedDimNames(4): PCA UMAP fastmnn_PCA fastmnn_UMAP
mainExpName: NULL
altExpNames(0):
How does the new pseudo-bulked SingleCellExperiment
look
different? How many columns does it have?
Let’s take a look at what the colData
looks like in the
pseudo-bulked SCE object.
# note the new column with number of cells per group
head(colData(pb_sce)) |>
as.data.frame()
You should see that columns such as sum
,
detected
, subsets_mito_sum
, and other columns
that typically contain per cell QC statistics now contain
NA
rather than numeric values. This is because these values
were initially calculated on a per cell level (we did this using
scuttle::addPerCellQCMetrics()
), but we no longer have a
single column per cell. Instead, each column now represents a
group of cells, in this case comprised of cells of a given cell
type and sample combination. Therefore, the values that we calculated on
a per-cell level are no longer applicable to this pseudo-bulked SCE
object.
You should also see a new column that wasn’t present previously, the
ncells
column. This column was added during pseudo-bulking
and indicates the total number of cells that were summed together to
form each column of the SCE object.
Before we proceed we will want to filter out any columns that have a low number of cells. A low number of cells will usually result in small counts that can cause issues with the statistical approximations made during differential expression analysis. This is equivalent to filtering out any libraries in bulk RNA-seq analysis that have low library sizes.
We can set a threshold for the number of cells required to continue with our analysis and remove any groups that do not meet the minimum threshold. Here we will use 10, but the threshold you use for your dataset can vary depending on the composition of cell types.
# remove any groups with fewer than 10 cells
filter_pb_sce <- pb_sce[, pb_sce$ncells >= 10]
We can then take a look and see how many cell type-sample columns we removed, if any.
# print out dimensions of unfiltered pseudobulk sce
dim(pb_sce)
[1] 60319 37
# dimensions of filtered pseudobulk sce
dim(filter_pb_sce)
[1] 60319 36
It looks like we only got rid of one group. We can do a quick check to see which group was removed by finding which column is no longer present in the filtered object.
# find removed columns
removed_cols <- !(colnames(pb_sce) %in% colnames(filter_pb_sce))
# print out missing columns
colnames(pb_sce)[removed_cols]
[1] "Lymphocyte_SCPCL000484"
Lymphocyte_SCPCL000484
The last step we want to do to prepare our dataset for DE is to
subset the pseudo-bulked SCE object to contain only the cell type that
we are interested in comparing across the two RMS subtypes. As mentioned
previously, we are specifically interested in the
Tumor_Myoblast
cell type.
# logical vector indicating if cells are tumor myoblast or not
myoblast_cells <- filter_pb_sce$celltype_broad == "Tumor_Myoblast"
# create a new sce with only the tumor myoblasts
tumor_myoblast_sce <- filter_pb_sce[, myoblast_cells]
After filtering for our cell type of interest we should have a
dataset with 6 columns, 1 for each group of Tumor_Myoblast
cells in each of our 6 samples.
DESeq2
Now we will use the DESeq2
package to perform
differential expression (DE) analysis on our pseudo-bulked SCE object.
From this point, we can proceed in the same way we would if we had a
bulk RNA-seq dataset with 6 samples. We will start with the unnormalized
raw counts in the counts
assay of the pseudo-bulked SCE and
do the following with DESeq2
:
DESeqDataSet
objectYou can also refer to our materials
from our previous workshops covering bulk RNA-seq for more
information on using DESeq
.
DESeqDataSet
objectTo create the DESeqDataSet
object we will need the
unnormalized counts matrix, the metadata associated with the samples,
and a design formula. The first two items are already stored in our SCE
object, so we can create a DESeqDataSet
object directly
from that object using the DESeqDataSet()
function. The
design formula is used to indicate which columns of the metadata need to
be considered in the DE comparison. For our experiment we are comparing
gene expression between different RMS subtypes. The subtype information
is stored in the diagnosis_group
column of the
colData
in the pseudo-bulked SCE.
# set up the deseq object, group by diagnosis
deseq_object <- DESeq2::DESeqDataSet(tumor_myoblast_sce,
design = ~ diagnosis_group)
converting counts to integer mode
The pseudo-bulked SCE object contains only one assay: the
counts
assay. This is because DESeq2
expects
raw counts. When we run DESeq2
on our dataset, raw counts
will first be normalized using size factors to account for differences
in total sample counts. Therefore we don’t have to do any normalization
on our own – we’ll let DESeq2
do all the work for us.
However, before we dive into DE analysis, we can do some initial exploration and visualization of our data to see if our samples separate by our known factor of interest, RMS subtype. In particular, we can use principal component analysis (PCA) of our pseudo-bulked dataset to visualize any variation between samples. If there is variation between RMS subtypes, we expect their respective samples to separate in PC space, likely indicating presence of differentially expressed genes. We can evaluate this by plotting PC1 and PC2.
In order to create our PCA plot, we will first need to normalize our
data to account for any technical variations across samples. As a
reminder, this is NOT required for running DESeq2
analysis;
we are just using it to visualize our data prior to DE analysis.
# estimate size factors first
deseq_object <- DESeq2::estimateSizeFactors(deseq_object)
# normalize and log transform to use for visualization
normalized_object <- DESeq2::rlog(deseq_object,
blind = TRUE)
normalized_object
class: DESeqTransform
dim: 60319 6
metadata(142): salmon_version reference_index ... combined_hvg version
assays(1): ''
rownames(60319): ENSG00000000003 ENSG00000000005 ... ENSG00000288724
ENSG00000288725
rowData names(28): gene_symbol SCPCL000478-mean ... dispFit
rlogIntercept
colnames(6): Tumor_Myoblast_SCPCL000479 Tumor_Myoblast_SCPCL000480 ...
Tumor_Myoblast_SCPCL000488 Tumor_Myoblast_SCPCL000491
colData names(27): sum detected ... sample ncells
We now have a normalized and transformed object that can be directly
input to the DESeq2::plotPCA()
function, which will both
calculate and plot the PC results.
DESeq2::plotPCA(normalized_object, intgroup = "diagnosis_group")
using ntop=500 top features by variance
As expected we see that samples group together based on RMS subtype and are separated along the PC1 axis, the PC contributing the highest amount of variation.
DESeq
We’ll now use the convenience function DESeq()
to
perform our differential expression analysis. This function calculates
normalization factors, estimates gene-wise dispersions, fits a negative
binomial model and performs hypothesis testing using Wald
statistics.
# run DESeq
deseq_object <- DESeq2::DESeq(deseq_object)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
We can evaluate how well the model fit our data by looking at the dispersion estimates. We expect to see the dispersion estimates decrease as means are increasing and follow the line of best fit.
plotDispEsts(deseq_object)
Now we can extract the results from the object, specifying the p-value threshold that we would like to use.
# extract the results as a DataFrame
deseq_results <- DESeq2::results(deseq_object, alpha = 0.05)
But we aren’t done yet!
The estimates of log2 fold change calculated by DESeq()
are not corrected for expression level. This means that when counts are
small, we are likely to end up with some large fold change values that
overestimate the true extent of the change between conditions.
We can correct this by applying a “shrinkage” procedure, which will adjust large values with small counts downward, while preserving values with larger counts, which are likely to be more accurate.
To do this, we will use the lfcShrink()
function, but
first we need to know the name and/or position of the “coefficient” that
was calculated by DESeq()
, which we can do with the
resultsNames()
function.
# identify position of coefficient
DESeq2::resultsNames(deseq_object)
[1] "Intercept" "diagnosis_group_ERMS_vs_ARMS"
Intercept
diagnosis_group_ERMS_vs_ARMS
# appyly logFC shrinkage using the default model
shrink_results <- DESeq2::lfcShrink(
deseq_object,
res = deseq_results,
coef = 2,
type = "apeglm"
)
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
sequence count data: removing the noise and preserving large differences.
Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
head(shrink_results)
log2 fold change (MAP): diagnosis group ERMS vs ARMS
Wald test p-value: diagnosis group ERMS vs ARMS
DataFrame with 6 rows and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 101.05115 0.4108102 0.508644 0.314263 0.653799
ENSG00000000005 15.06975 0.7449261 0.992470 0.106750 0.368312
ENSG00000000419 445.70945 -0.1982544 0.443807 0.594405 0.850099
ENSG00000000457 287.05639 -0.4654800 0.530626 0.262554 0.600069
ENSG00000000460 553.73970 -0.0980086 0.428254 0.785786 0.932283
ENSG00000000938 5.64924 0.1550990 0.655715 0.705489 0.900066
If you look at our shrink_results
object, we see that
the genes are labeled with the Ensembl gene identifiers, as those were
the row names of the pseudo-bulked SCE we used as input to build our
DESeq2
object. Although some of us may have all of the
identifiers memorized by heart, it can be useful to have a human
readable symbol in our results. Before we save the results as a file, we
will grab the gene symbols from the rowData
of our original
SCE object and add them as a new column.
deseq_results <- shrink_results |>
# directly add Ensembl id as a column
# converting results into a data frame
tibble::as_tibble(rownames = "ensembl_id")
# convert rowdata to data frame
sce_rowdata_df <- rowData(tumor_myoblast_sce) |>
# create a column with rownames stored as ensembl id
# use for joining with deseq results
tibble::as_tibble(rownames = "ensembl_id")
# combine deseq results with rowdata by ensembl id
deseq_results <- deseq_results |>
dplyr::left_join(sce_rowdata_df, by = "ensembl_id")
head(deseq_results)
We can save the new data frame that we have created with the Ensembl
identifiers, gene symbols, and the DESeq2
results as a tab
separated (tsv
) file.
# save our results as tsv
readr::write_tsv(deseq_results, deseq_output_file)
The last thing that we will do is take a look at how many genes are
significant. Here we will want to use the adjusted p-value, found in the
padj
column of the results, as this accounts for multiple
test correction.
# first look at the significant results
deseq_results_sig <- deseq_results |>
# filter based on adjusted pvalue
dplyr::filter(padj <= 0.05)
head(deseq_results_sig)
Now that we have identified a set of genes that are differentially
expressed in the tumor myoblasts between ARMS and ERMS subtypes, lets
actually take a look at them and see if we can make some informative
plots. The first plot we’ll make is a volcano plot using the EnhancedVolcano
package. This package automatically colors the points by cutoffs for
both significance and fold change and labels many of the significant
genes (subject to spacing). EnhancedVolcano
has many, many
options, which is a good thing if you don’t like all of its default
settings. Even better, it outputs a ggplot2
object, so if
we want to customize the plot further, we can use the same
ggplot2
commands we have used before.
EnhancedVolcano::EnhancedVolcano(deseq_results,
x = 'log2FoldChange', # fold change statistic to plot
y = 'pvalue', # significance values
lab = deseq_results$gene_symbol, # labels for points
pCutoff = 1e-05, # p value cutoff (default)
FCcutoff = 1, # fold change cutoff (default)
title = NULL, # no title
subtitle = NULL, # or subtitle
caption = NULL, # or caption
drawConnectors = TRUE, # add some fun arrows
labSize = 3 # smaller labels
) +
# change the overall theme
theme_bw() +
# move the legend to the bottom
theme(legend.position = "bottom")
Warning: ggrepel: 543 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
We can also return back to the SCE object that we used to create our
pseudo-bulked SCE and look at gene expression of some of the significant
genes. We can create UMAP plots as we did previously, but instead of
labeling each cell with metadata, we can color cells by a specified
gene’s expression levels. We will also use some of the
ggplot2
skills we picked up earlier, like
facet_grid()
to plot cells from different RMS subtypes
separately. This can help us validate the DESeq2
results so
that we can visualize gene expression changes across our cell type of
interest on a single-cell level.
# filter to just myoblast cells and remove any NA's before plotting
myoblast_combined_sce <- rms_sce[, which(rms_sce$celltype_broad == "Tumor_Myoblast")]
# plot PTPRT (ENSG00000196090) expression in ARMS vs. ERMS
scater::plotReducedDim(myoblast_combined_sce,
dimred = "fastmnn_UMAP",
color_by = "ENSG00000196090", #PTPRT
point_size= 0.5,
point_alpha = 0.4,
other_fields = "diagnosis_group") +
facet_grid(cols = vars(diagnosis_group))
In the above plot we only plotted the tumor myoblast cells that we used in our DE analysis. However, we might be interested to see the expression of genes that are differentially expressed in other cell types present in our samples.
# let's compare gene expression across some other cell types
# look at all tumor cells and pick one normal cell type
celltypes <- c("Tumor_Myoblast",
"Tumor_Mesoderm",
"Tumor_Myocyte",
"Vascular Endothelium")
# subset to just celltypes that we are interested in
tumor_sce <- rms_sce[, which(rms_sce$celltype_broad %in% celltypes)]
Next we will look at a few DE genes that we identified, one up
regulated gene and one down regulated gene, and compare their expression
in myoblasts to other cell types in ARMS and ERMS samples. We will use
the scater::plotExpression()
function to create a violin
plot with RMS subtype on the x-axis and gene expression on the y-axis.
We can continue using facet_grid()
to show separate panels
for each cell type. Because we want to show multiple genes here, we are
going to add an additional option to facet_grid()
to
include multiple rows in our plot grid, one for each gene of interest.
One neat trick of the scater::plotExpression()
function is
that it actually creates a Feature
column which corresponds
to the features (in this case genes) being used in plotting. We can then
directly reference that Feature
column when plotting,
instead of using the other_fields
option we used
previously.
# pick a couple genes to look at
genes_to_plot <- c("ENSG00000196090", #PTPRT
"ENSG00000148935") #GAS2
# create a violin plot
scater::plotExpression(tumor_sce,
# a vector of genes to plot
features = genes_to_plot,
x = "diagnosis_group",
color_by = "diagnosis_group",
other_fields = "celltype_broad",
point_size = 0.1) +
# each celltype is its own column
facet_grid(cols = vars(celltype_broad),
# each feature (gene) is its own row
rows = vars(Feature)) +
# change the font size of the facet labels
theme(strip.text = element_text(size = 7)) +
guides(color = guide_legend(
title = "Subtype", # update the legend title
# change the size of the legend colors
override.aes = list(size = 3, alpha = 1))
)
How do the expression of these genes change across cell types and RMS subtypes?
Go ahead and explore some genes on your own! Feel free to plot any of the genes that are identified as significant, found in the DE results table, or your favorite gene. Remember, you need to use the Ensembl gene identifier to refer to each gene.
# now do some exploration of other genes on your own!
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BiocSingular_1.20.0 DESeq2_1.44.0
[3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[5] Biobase_2.64.0 GenomicRanges_1.56.0
[7] GenomeInfoDb_1.40.0 IRanges_2.38.0
[9] S4Vectors_0.42.0 BiocGenerics_0.50.0
[11] MatrixGenerics_1.16.0 matrixStats_1.3.0
[13] ggplot2_3.5.1 optparse_1.7.5
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.3
[3] magrittr_2.0.3 scater_1.32.0
[5] compiler_4.4.1 flexmix_2.3-19
[7] DelayedMatrixStats_1.26.0 vctrs_0.6.5
[9] stringr_1.5.1 pkgconfig_2.0.3
[11] crayon_1.5.2 fastmap_1.1.1
[13] XVector_0.44.0 scuttle_1.14.0
[15] labeling_0.4.3 utf8_1.2.4
[17] rmarkdown_2.26 tzdb_0.4.0
[19] UCSC.utils_1.0.0 ggbeeswarm_0.7.2
[21] bit_4.0.5 xfun_0.43
[23] modeltools_0.2-23 zlibbioc_1.50.0
[25] cachem_1.0.8 beachmat_2.20.0
[27] jsonlite_1.8.8 highr_0.10
[29] DelayedArray_0.30.0 BiocParallel_1.38.0
[31] irlba_2.3.5.1 parallel_4.4.1
[33] R6_2.5.1 RColorBrewer_1.1-3
[35] bslib_0.7.0 stringi_1.8.3
[37] numDeriv_2016.8-1.1 jquerylib_0.1.4
[39] Rcpp_1.0.12 knitr_1.46
[41] readr_2.1.5 Matrix_1.7-0
[43] splines_4.4.1 nnet_7.3-19
[45] tidyselect_1.2.1 abind_1.4-5
[47] yaml_2.3.8 viridis_0.6.5
[49] codetools_0.2-20 plyr_1.8.9
[51] lattice_0.22-6 tibble_3.2.1
[53] withr_3.0.0 coda_0.19-4.1
[55] evaluate_0.23 getopt_1.20.4
[57] pillar_1.9.0 generics_0.1.3
[59] vroom_1.6.5 emdbook_1.3.13
[61] hms_1.1.3 sparseMatrixStats_1.16.0
[63] munsell_0.5.1 scales_1.3.0
[65] miQC_1.12.0 glue_1.7.0
[67] apeglm_1.26.0 tools_4.4.1
[69] BiocNeighbors_1.22.0 ScaledMatrix_1.12.0
[71] locfit_1.5-9.9 forcats_1.0.0
[73] mvtnorm_1.2-4 fs_1.6.4
[75] cowplot_1.1.3 grid_4.4.1
[77] bbmle_1.0.25.1 bdsmatrix_1.3-7
[79] colorspace_2.1-0 GenomeInfoDbData_1.2.12
[81] beeswarm_0.4.0 vipor_0.4.7
[83] cli_3.6.2 rsvd_1.0.5
[85] fansi_1.0.6 S4Arrays_1.4.0
[87] viridisLite_0.4.2 dplyr_1.1.4
[89] gtable_0.3.5 EnhancedVolcano_1.22.0
[91] sass_0.4.9 digest_0.6.35
[93] SparseArray_1.4.0 ggrepel_0.9.5
[95] farver_2.1.1 htmltools_0.5.8.1
[97] lifecycle_1.0.4 httr_1.4.7
[99] MASS_7.3-60.2 bit64_4.0.5