This notebook will demonstrate how to:
DESeq2
data set from a
SummarizedExperiment
In this notebook, we’ll import the gastric cancer data and do some
exploratory analyses and visual inspection. We’ll use the DESeq2
package for this.
DESeq2
also has an excellent
vignette from Love, Anders, and Huber from which this is adapted
(see also: Love,
Anders, and Huber. Genome Biology. 2014.).
# Load the DESeq2 library
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Main data directory
data_dir <- file.path("data", "gastric-cancer")
# directory with the tximeta processed data
txi_dir <- file.path(data_dir, "txi")
txi_file <- file.path(txi_dir, "gastric-cancer_tximeta.rds")
We’ll create a directory to hold our plots.
# Create a plots directory if it does not exist yet
plots_dir <- file.path("plots", "gastric-cancer")
fs::dir_create(plots_dir)
Output
# We will save a PDF copy of the PCA plot to the plots directory
# and name the file "gastric-cancer_PC_scatter.pdf"
pca_plot_file <- file.path(plots_dir, "gastric-cancer_PC_scatter.pdf")
First, let’s read in the data we processed with
tximeta
.
# Read in the RDS file we created in the last notebook
gene_summarized <- readr::read_rds(txi_file)
We use the tissue of origin in the design formula because that will allow us to model this variable of interest.
ddset <- DESeqDataSet(gene_summarized, design = ~tissue)
using counts and average transcript lengths from tximeta
Warning in DESeqDataSet(gene_summarized, design = ~tissue): some variables in
design formula are characters, converting to factors
Raw count data is not usually suitable for the algorithms we use for dimensionality reduction, clustering, or heatmaps. To improve this, we will transform the count data to create an expression measure that is better suited for these analyses. The core transformation will map the expression to a log2 scale, while accounting for some of the expected variation among samples and genes.
Since different samples are usually sequenced to different depths, we want to transform our RNA-seq count data to make different samples more directly comparable. We also want to deal with the fact that genes with low counts are also likely to have higher variance (on the log2 scale), as that could bias our clustering. To handle both of these considerations, we can calculate a Variance Stabilizing Transformation of the count data, and work with that transformed data for our analysis.
See this
section of the DESeq2
vignette for more on this
topic.
vst_data <- vst(ddset)
using 'avgTxLength' from assays(dds), correcting for library size
Principal component analysis (PCA) is a dimensionality reduction technique that allows us to identify the largest components of variation in a complex dataset. Our expression data can be thought of as mapping each sample in a multidimensional space defined by the expression level of each gene. The expression of many of those genes are correlated, so we can often get a better, simpler picture of the data by combining the information from those correlated genes.
PCA rotates and transforms this space so that each axis is now a combination of multiple correlated genes, ordered so the first axes capture the most variation from the data. These new axes are the “principal components.” If we look at the first few components, we can often get a nice overview of relationships among the samples in the data.
The plotPCA()
function we will use from the
DESeq2
package calculates and plots the first two principal
components (PC1 and PC2). Visualizing PC1 and PC2 can give us insight
into how different variables (e.g., tissue source) affect our dataset
and help us spot any technical effects (more on that below).
# DESeq2 built in function is called plotPCA and we want to
# color points by tissue
plotPCA(vst_data, intgroup = "tissue")
using ntop=500 top features by variance
Save the most recent plot to file with ggsave
from
ggplot2
# Save plot as a PDF file
# Note that `last_plot()` is the default, but we are making it explicit here
ggplot2::ggsave(pca_plot_file, plot = ggplot2::last_plot())
Saving 7 x 5 in image
We don’t have batch information (i.e., when the samples were run) for
this particular experiment, but let’s imagine that
SRR585574
and SRR585576
were run separately
from all other samples. We’ll add this as a new “toy” column in the
sample data (colData
).
# Extract colData
sample_info <- colData(vst_data)
# Print out preview
sample_info
DataFrame with 8 rows and 3 columns
names tissue title
<character> <factor> <character>
SRR585570 SRR585570 gastric_normal Gastric normal (CGC-..
SRR585571 SRR585571 gastric_normal Gastric normal (CGC-..
SRR585572 SRR585572 primary_gastric_tumor Primary gastric tumo..
SRR585573 SRR585573 primary_gastric_tumor Primary gastric tumo..
SRR585574 SRR585574 primary_gastric_tumor Primary gastric tumo..
SRR585575 SRR585575 gastric_cancer_cell_line SNU484
SRR585576 SRR585576 gastric_cancer_cell_line SNU601
SRR585577 SRR585577 gastric_cancer_cell_line SNU668
Now we can add a new column with toy batch information and re-store
the colData()
.
# Add batch information
sample_info$batch <- c("batch1", "batch1", "batch1", "batch1",
"batch2", "batch1", "batch2", "batch1")
If this batch information were real we would have included it with
the sample metadata when we made the original
SummarizedExperiment
object with tximeta
. We
would then include it in the model stored in our DESeq2 object using the
design
argument (design = ~ tissue + batch
)
and we would re-run the DESeqDataSet()
and
vst()
steps we did above. Here we will take a bit of a
shortcut and add it directly to the colData()
for our
vst()
-transformed data.
# Add updated colData() with batch info to vst_data
colData(vst_data) <- sample_info
# PCA plot - tissue *and* batch
# We want plotPCA to return the data so we can have more control about the plot
pca_data <- plotPCA(
vst_data,
intgroup = c("tissue", "batch"),
returnData = TRUE
)
using ntop=500 top features by variance
# Here we are setting up the percent variance that we are extracting from the `pca_data` object
percent_var <- round(100 * attr(pca_data, "percentVar"))
Let’s use ggplot to visualize the first two principal components.
# Color points by "batch" and use shape to indicate the tissue of origin
ggplot2::ggplot(pca_data,
ggplot2::aes(
x = PC1,
y = PC2,
color = batch,
shape = tissue
)
) +
ggplot2::geom_point(size = 3) +
ggplot2::xlab(paste0("PC1: ", percent_var[1], "% variance")) +
ggplot2::ylab(paste0("PC2: ", percent_var[2], "% variance")) +
ggplot2::coord_fixed()
Record session info for reproducibility & provenance purposes.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] DESeq2_1.44.0 SummarizedExperiment_1.34.0
[3] Biobase_2.64.0 MatrixGenerics_1.16.0
[5] matrixStats_1.3.0 GenomicRanges_1.56.0
[7] GenomeInfoDb_1.40.0 IRanges_2.38.0
[9] S4Vectors_0.42.0 BiocGenerics_0.50.0
[11] optparse_1.7.5
loaded via a namespace (and not attached):
[1] gtable_0.3.5 xfun_0.43 bslib_0.7.0
[4] ggplot2_3.5.1 lattice_0.22-6 tzdb_0.4.0
[7] vctrs_0.6.5 tools_4.4.1 generics_0.1.3
[10] parallel_4.4.1 getopt_1.20.4 tibble_3.2.1
[13] fansi_1.0.6 highr_0.10 pkgconfig_2.0.3
[16] Matrix_1.7-0 lifecycle_1.0.4 GenomeInfoDbData_1.2.12
[19] farver_2.1.1 compiler_4.4.1 stringr_1.5.1
[22] textshaping_0.3.7 munsell_0.5.1 codetools_0.2-20
[25] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.8
[28] pillar_1.9.0 crayon_1.5.2 jquerylib_0.1.4
[31] BiocParallel_1.38.0 DelayedArray_0.30.0 cachem_1.0.8
[34] abind_1.4-5 locfit_1.5-9.9 tidyselect_1.2.1
[37] digest_0.6.35 stringi_1.8.3 dplyr_1.1.4
[40] labeling_0.4.3 fastmap_1.1.1 grid_4.4.1
[43] colorspace_2.1-0 cli_3.6.2 SparseArray_1.4.0
[46] magrittr_2.0.3 S4Arrays_1.4.0 utf8_1.2.4
[49] withr_3.0.0 readr_2.1.5 UCSC.utils_1.0.0
[52] scales_1.3.0 rmarkdown_2.26 XVector_0.44.0
[55] httr_1.4.7 ragg_1.3.0 hms_1.1.3
[58] evaluate_0.23 knitr_1.46 rlang_1.1.3
[61] Rcpp_1.0.12 glue_1.7.0 jsonlite_1.8.8
[64] R6_2.5.1 systemfonts_1.0.6 fs_1.6.4
[67] zlibbioc_1.50.0