This notebook will demonstrate how to:
DESeq2
EnhancedVolcano
packageIn this notebook, we’ll perform an analysis to identify the genes that are differentially expressed in MYCN amplified vs. nonamplified neuroblastoma cell lines.
These RNA-seq data are from Harenza, et al. (2017).
More information about DESeq2 can be found in the excellent vignette from Love, Anders, and Huber from which this is adapted (see also: Love, et al. (2014)).
DESeq2 takes unnormalized counts or estimated counts and does the following:
# Load the DESeq2 library
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# We will be making fancy volcano plots
library(EnhancedVolcano)
Loading required package: ggplot2
Loading required package: ggrepel
Input
# directory with the tximeta processed data
txi_dir <- file.path("data", "NB-cell", "txi")
txi_file <- file.path(txi_dir, "NB-cell_tximeta.rds")
Output
We’ll create a results directory to hold our results.
# Create a results directory if it doesn't already exist
results_dir <- file.path("results", "NB-cell")
fs::dir_create(results_dir)
We will also need a directory to store our plots.
# Create a plots directory if it doesn't already exist
plots_dir <- file.path("plots", "NB-cell")
fs::dir_create(plots_dir)
# RDS for the output of DESeq analysis
deseq_file <- file.path(results_dir,
"NB-cell_DESeq_amplified_v_nonamplified.rds")
# DESeq2 results table
deseq_df_file <- file.path(results_dir,
"NB-cell_DESeq_amplified_v_nonamplified_results.tsv")
# PNG of the volcano plot
volcano_file <- file.path(plots_dir, "NB-cell_volcano.png")
First, let’s read in the data we processed with
tximeta
.
# Read in the RDS file we created in the last notebook
gene_summarized <- readr::read_rds(txi_file)
We’re most interested in MYCN amplification, which we had
stored in the status
column of the sample metadata of
gene_summarized
. While the sample metadata is stored
internally in the colData
slot, the
SummarizedExperiment
object makes it easy for us to access
it as if it were just a column of a data frame, using the familiar
$
syntax.
gene_summarized$status
[1] "Amplified" "Amplified" "Amplified" "Amplified" "Amplified"
[6] "Amplified" "Amplified" "Amplified" "Nonamplified" "Nonamplified"
[11] "Amplified" "Amplified" "Amplified" "Nonamplified" "Amplified"
[16] "Amplified" "Amplified" "Amplified" "Nonamplified" "Amplified"
[21] "Amplified" "Amplified" "Nonamplified" "Amplified" "Nonamplified"
[26] "Amplified" "Amplified" "Amplified" "Nonamplified" "Nonamplified"
[31] "Nonamplified" "Amplified" "Amplified" "Amplified" "Nonamplified"
[36] "Nonamplified" "Amplified" "Amplified" "Nonamplified"
Amplified
Amplified
Amplified
Amplified
Amplified
Amplified
Amplified
Amplified
Nonamplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Amplified
Amplified
Amplified
Amplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Amplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Nonamplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Nonamplified
Amplified
Amplified
Nonamplified
This is stored as a character
type, but to give a bit
more information to DESeq
, we will convert this to a
factor
.
gene_summarized$status <- as.factor(gene_summarized$status)
We’ll want to use the “Nonamplified” samples as our
reference. Let’s look at the levels
of
status
.
levels(gene_summarized$status)
[1] "Amplified" "Nonamplified"
Amplified
Nonamplified
We can see that these are in alphabetical order, so “Amplified”
samples would be the reference. We can use the relevel()
function to remedy this.
gene_summarized$status <- relevel(gene_summarized$status, ref = "Nonamplified")
# Check what the levels are now
levels(gene_summarized$status)
[1] "Nonamplified" "Amplified"
Nonamplified
Amplified
# Create a DESeq2 dataset from `gene_summarized`
# remember that `status` is the variable of interest here
ddset <- DESeqDataSet(gene_summarized,
design = ~status)
using counts and average transcript lengths from tximeta
Genes that have very low counts are not likely to yield reliable differential expression results, so we will do some light pre-filtering. We will keep only genes with total counts of at least 10 across all samples.
# create a vector of TRUE and FALSE values where
# TRUE corresponds to genes with counts of at least 10
genes_to_keep <- rowSums(counts(ddset)) >= 10
# use which() to prevent any NAs sneaking through
ddset <- ddset[which(genes_to_keep), ]
DESeq()
functionWe’ll now use the wrapper function DESeq()
to perform
our differential expression analysis. As mentioned earlier, this
performs a number of steps, including an outlier
removal procedure. For this particular dataset, there is a pretty
large number of outliers, which can be a bit of a red flag, but we will
proceed for now.
deseq_object <- DESeq(ddset)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 2895 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
Let’s save this to our results file.
# Save the results as an RDS
readr::write_rds(deseq_object, file = deseq_file)
Now we will have a look at the results table.
deseq_results <- results(deseq_object)
deseq_results
log2 fold change (MLE): status Amplified vs Nonamplified
Wald test p-value: status Amplified vs Nonamplified
DataFrame with 24912 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 1148.278399 0.921536 0.424309 2.171849 0.029867
ENSG00000000005 0.627406 1.672285 2.247996 0.743900 0.456937
ENSG00000000419 1680.109464 -0.176649 0.215485 -0.819775 0.412344
ENSG00000000457 962.907631 -0.257752 0.166387 -1.549110 0.121355
ENSG00000000460 1595.937423 -0.133821 0.197230 -0.678504 0.497452
... ... ... ... ... ...
ENSG00000285976 1874.02776 0.0285397 0.183730 0.155335 0.876557
ENSG00000285978 1.40743 -1.1452465 0.874165 -1.310103 0.190161
ENSG00000285982 90.93868 0.1131803 0.493040 0.229556 0.818437
ENSG00000285990 13.77859 0.3673226 0.456293 0.805015 0.420811
ENSG00000285991 17.07491 0.0709553 0.333191 0.212957 0.831361
padj
<numeric>
ENSG00000000003 0.133479
ENSG00000000005 NA
ENSG00000000419 0.656626
ENSG00000000457 0.326981
ENSG00000000460 0.721065
... ...
ENSG00000285976 0.946696
ENSG00000285978 0.427379
ENSG00000285982 0.918545
ENSG00000285990 0.662821
ENSG00000285991 0.926078
How many genes were differentially expressed (FDR < 0.05)?
summary(deseq_results, alpha = 0.05)
out of 24799 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 1071, 4.3%
LFC < 0 (down) : 1798, 7.3%
outliers [1] : 0, 0%
low counts [2] : 3478, 14%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
The estimates of log2 fold change calculated by DESeq()
are not corrected for expression level. This means that when counts are
small, we are likely to end up with some large fold change values that
overestimate the true extent of the change between conditions.
We can correct this by applying a “shrinkage” procedure, which will adjust large values with small counts downward, while preserving values with larger counts, which are likely to be more accurate.
To do this, we will use the lfcShrink()
function, but
first we need to know the name and/or position of the “coefficient” that
was calculated by DESeq()
, which we can do with the
resultsNames()
function
# get the deseq coefficient names:
resultsNames(deseq_object)
[1] "Intercept" "status_Amplified_vs_Nonamplified"
Intercept
status_Amplified_vs_Nonamplified
We are interested in the status
coefficient, which is in
position 2.
There are a few options for the shrinkage estimation. The default is
apeglm
(Zhu et al.
2018), but we have found that this can be sensitive to extreme
outliers, which are definitely a factor in this data set. So for this
data set we will be using ashr
(Stephens
2017)
# calculate shrunken log2 fold change estimates
deseq_shrunken <- lfcShrink(deseq_object,
coef = 2, # the coefficient we want to reestimate
type = "ashr" # We will use `ashr` for estimation
)
using 'ashr' for LFC shrinkage. If used in published research, please cite:
Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
https://doi.org/10.1093/biostatistics/kxw041
Let’s compare the log2 fold change estimates from the two results tables by creating a plot.
First we will combine the results into a new data frame.
comparison_df <- data.frame(
lfc_original = deseq_results$log2FoldChange,
lfc_shrunken = deseq_shrunken$log2FoldChange,
logmean = log10(deseq_results$baseMean)
)
Now we can plot the original and shrunken log2 fold change values to see what happened after shrinkage.
ggplot(comparison_df,
aes(
x = lfc_original,
y = lfc_shrunken,
color = logmean
)
) +
geom_point(alpha = 0.1) +
theme_bw() +
scale_color_viridis_c() +
coord_cartesian(xlim = c(-10, 10), ylim = c(-10, 10)) # zoom in on the middle
We will now do a bit of manipulation to store the results in a data frame and add the gene symbols.
# this is of class DESeqResults -- we want a data frame
deseq_df <- deseq_shrunken |>
# convert to a data frame
as.data.frame() |>
# the gene ids were stored as row names -- let's them a column for easy display
tibble::rownames_to_column(var = "gene_id") |>
# add on the gene symbols from the original deseq object
dplyr::mutate(gene_symbol = rowData(deseq_object)$gene_name)
Let’s print out the results table, sorted by log2 fold change. The highest values should be genes more expressed in the MYCN amplified cell lines.
# Print the table sorted by log2FoldChange
deseq_df |>
dplyr::arrange(dplyr::desc(log2FoldChange))
Now let’s write the full results table to a file.
readr::write_tsv(deseq_df, file = deseq_df_file)
With these shrunken effect sizes, we will draw a volcano plot, using
the EnhancedVolcano
package to make it a bit easier. This package automatically color
codes the points by cutoffs for both significance and fold change and
labels many of the significant genes (subject to spacing).
EnhancedVolcano
has many, many options, which is a good
thing if you don’t like all of it’s default settings. Even better, it
outputs a ggplot2
object, so if we want to customize it
further, we can do that with the same ggplot2
commands we
have used before.
EnhancedVolcano(deseq_df,
x = "log2FoldChange", # fold change statistic to plot
y = "pvalue", # significance values
lab = deseq_df$gene_symbol, # labels for points
pCutoff = 1e-05, # The p value cutoff we will use (default)
FCcutoff = 1, # The fold change cutoff (default)
title = NULL, # no title
subtitle = NULL, # or subtitle
caption = NULL, # or caption
labSize = 3 # smaller labels
) +
# change the overall theme
theme_classic() +
# move the legend to the bottom
theme(legend.position = "bottom")
We will save this plot to a file as well:
ggsave(volcano_file, plot = last_plot())
Saving 7 x 5 in image
Record session info for reproducibility & provenance purposes.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] EnhancedVolcano_1.22.0 ggrepel_0.9.5
[3] ggplot2_3.5.1 DESeq2_1.44.0
[5] SummarizedExperiment_1.34.0 Biobase_2.64.0
[7] MatrixGenerics_1.16.0 matrixStats_1.3.0
[9] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
[11] IRanges_2.38.0 S4Vectors_0.42.0
[13] BiocGenerics_0.50.0 optparse_1.7.5
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 viridisLite_0.4.2 dplyr_1.1.4
[4] farver_2.1.1 fastmap_1.1.1 digest_0.6.35
[7] lifecycle_1.0.4 invgamma_1.1 magrittr_2.0.3
[10] compiler_4.4.1 rlang_1.1.3 sass_0.4.9
[13] tools_4.4.1 utf8_1.2.4 yaml_2.3.8
[16] knitr_1.46 S4Arrays_1.4.0 labeling_0.4.3
[19] bit_4.0.5 DelayedArray_0.30.0 abind_1.4-5
[22] BiocParallel_1.38.0 withr_3.0.0 grid_4.4.1
[25] fansi_1.0.6 colorspace_2.1-0 scales_1.3.0
[28] cli_3.6.2 rmarkdown_2.26 crayon_1.5.2
[31] ragg_1.3.0 generics_0.1.3 httr_1.4.7
[34] tzdb_0.4.0 getopt_1.20.4 cachem_1.0.8
[37] stringr_1.5.1 zlibbioc_1.50.0 parallel_4.4.1
[40] XVector_0.44.0 vctrs_0.6.5 Matrix_1.7-0
[43] jsonlite_1.8.8 hms_1.1.3 mixsqp_0.3-54
[46] bit64_4.0.5 irlba_2.3.5.1 systemfonts_1.0.6
[49] locfit_1.5-9.9 jquerylib_0.1.4 glue_1.7.0
[52] codetools_0.2-20 stringi_1.8.3 gtable_0.3.5
[55] UCSC.utils_1.0.0 munsell_0.5.1 tibble_3.2.1
[58] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.12
[61] truncnorm_1.0-9 R6_2.5.1 textshaping_0.3.7
[64] vroom_1.6.5 evaluate_0.23 lattice_0.22-6
[67] readr_2.1.5 highr_0.10 SQUAREM_2021.1
[70] ashr_2.2-63 bslib_0.7.0 Rcpp_1.0.12
[73] SparseArray_1.4.0 xfun_0.43 fs_1.6.4
[76] pkgconfig_2.0.3