Objectives

This notebook will demonstrate how to:

  • Perform differential expression analysis with DESeq2
  • Apply a shrinkage algorithm to improve estimates of expression changes
  • Draw a volcano plot with the EnhancedVolcano package

In this notebook, we’ll perform an analysis to identify the genes that are differentially expressed in MYCN amplified vs. nonamplified neuroblastoma cell lines.

These RNA-seq data are from Harenza, et al. (2017).

More information about DESeq2 can be found in the excellent vignette from Love, Anders, and Huber from which this is adapted (see also: Love, et al. (2014)).

DESeq2 takes unnormalized counts or estimated counts and does the following:

Libraries and functions

# Load the DESeq2 library
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min

Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# We will be making fancy volcano plots
library(EnhancedVolcano)
Loading required package: ggplot2
Loading required package: ggrepel

Directories and files

Input

# directory with the tximeta processed data
txi_dir <- file.path("data", "NB-cell", "txi")
txi_file <- file.path(txi_dir, "NB-cell_tximeta.rds")

Output

We’ll create a results directory to hold our results.

# Create a results directory if it doesn't already exist
results_dir <- file.path("results", "NB-cell")
fs::dir_create(results_dir)

We will also need a directory to store our plots.

# Create a plots directory if it doesn't already exist
plots_dir <- file.path("plots", "NB-cell")
fs::dir_create(plots_dir)
# RDS for the output of DESeq analysis
deseq_file <- file.path(results_dir,
                        "NB-cell_DESeq_amplified_v_nonamplified.rds")

# DESeq2 results table
deseq_df_file <- file.path(results_dir,
                           "NB-cell_DESeq_amplified_v_nonamplified_results.tsv")

# PNG of the volcano plot
volcano_file <- file.path(plots_dir, "NB-cell_volcano.png")

DESeq2

Creating a DESeq2 dataset from tximeta object

First, let’s read in the data we processed with tximeta.

Preparation

# Read in the RDS file we created in the last notebook
gene_summarized <- readr::read_rds(txi_file)

We’re most interested in MYCN amplification, which we had stored in the status column of the sample metadata of gene_summarized. While the sample metadata is stored internally in the colData slot, the SummarizedExperiment object makes it easy for us to access it as if it were just a column of a data frame, using the familiar $ syntax.

gene_summarized$status
 [1] "Amplified"    "Amplified"    "Amplified"    "Amplified"    "Amplified"   
 [6] "Amplified"    "Amplified"    "Amplified"    "Nonamplified" "Nonamplified"
[11] "Amplified"    "Amplified"    "Amplified"    "Nonamplified" "Amplified"   
[16] "Amplified"    "Amplified"    "Amplified"    "Nonamplified" "Amplified"   
[21] "Amplified"    "Amplified"    "Nonamplified" "Amplified"    "Nonamplified"
[26] "Amplified"    "Amplified"    "Amplified"    "Nonamplified" "Nonamplified"
[31] "Nonamplified" "Amplified"    "Amplified"    "Amplified"    "Nonamplified"
[36] "Nonamplified" "Amplified"    "Amplified"    "Nonamplified"
Amplified
Amplified
Amplified
Amplified
Amplified
Amplified
Amplified
Amplified
Nonamplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Amplified
Amplified
Amplified
Amplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Amplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Nonamplified
Nonamplified
Amplified
Amplified
Amplified
Nonamplified
Nonamplified
Amplified
Amplified
Nonamplified

This is stored as a character type, but to give a bit more information to DESeq, we will convert this to a factor.

gene_summarized$status <- as.factor(gene_summarized$status)

We’ll want to use the “Nonamplified” samples as our reference. Let’s look at the levels of status.

levels(gene_summarized$status)
[1] "Amplified"    "Nonamplified"
Amplified
Nonamplified

We can see that these are in alphabetical order, so “Amplified” samples would be the reference. We can use the relevel() function to remedy this.

gene_summarized$status <- relevel(gene_summarized$status, ref = "Nonamplified")
# Check what the levels are now
levels(gene_summarized$status)
[1] "Nonamplified" "Amplified"   
Nonamplified
Amplified

DESeq Dataset creation

# Create a DESeq2 dataset from `gene_summarized`
# remember that `status` is the variable of interest here
ddset <- DESeqDataSet(gene_summarized,
                      design = ~status)
using counts and average transcript lengths from tximeta

Differential expression analysis

Filtering low-expressed genes

Genes that have very low counts are not likely to yield reliable differential expression results, so we will do some light pre-filtering. We will keep only genes with total counts of at least 10 across all samples.

# create a vector of TRUE and FALSE values where
# TRUE corresponds to genes with counts of at least 10 
genes_to_keep <- rowSums(counts(ddset)) >= 10
# use which() to prevent any NAs sneaking through
ddset <- ddset[which(genes_to_keep), ]

The DESeq() function

We’ll now use the wrapper function DESeq() to perform our differential expression analysis. As mentioned earlier, this performs a number of steps, including an outlier removal procedure. For this particular dataset, there is a pretty large number of outliers, which can be a bit of a red flag, but we will proceed for now.

deseq_object <- DESeq(ddset)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 2895 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

Let’s save this to our results file.

# Save the results as an RDS
readr::write_rds(deseq_object, file = deseq_file)

Now we will have a look at the results table.

deseq_results <- results(deseq_object)
deseq_results
log2 fold change (MLE): status Amplified vs Nonamplified 
Wald test p-value: status Amplified vs Nonamplified 
DataFrame with 24912 rows and 6 columns
                   baseMean log2FoldChange     lfcSE      stat    pvalue
                  <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 1148.278399       0.921536  0.424309  2.171849  0.029867
ENSG00000000005    0.627406       1.672285  2.247996  0.743900  0.456937
ENSG00000000419 1680.109464      -0.176649  0.215485 -0.819775  0.412344
ENSG00000000457  962.907631      -0.257752  0.166387 -1.549110  0.121355
ENSG00000000460 1595.937423      -0.133821  0.197230 -0.678504  0.497452
...                     ...            ...       ...       ...       ...
ENSG00000285976  1874.02776      0.0285397  0.183730  0.155335  0.876557
ENSG00000285978     1.40743     -1.1452465  0.874165 -1.310103  0.190161
ENSG00000285982    90.93868      0.1131803  0.493040  0.229556  0.818437
ENSG00000285990    13.77859      0.3673226  0.456293  0.805015  0.420811
ENSG00000285991    17.07491      0.0709553  0.333191  0.212957  0.831361
                     padj
                <numeric>
ENSG00000000003  0.133479
ENSG00000000005        NA
ENSG00000000419  0.656626
ENSG00000000457  0.326981
ENSG00000000460  0.721065
...                   ...
ENSG00000285976  0.946696
ENSG00000285978  0.427379
ENSG00000285982  0.918545
ENSG00000285990  0.662821
ENSG00000285991  0.926078

How many genes were differentially expressed (FDR < 0.05)?

summary(deseq_results, alpha = 0.05)

out of 24799 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 1071, 4.3%
LFC < 0 (down)     : 1798, 7.3%
outliers [1]       : 0, 0%
low counts [2]     : 3478, 14%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Shrinking log2 fold change estimates

The estimates of log2 fold change calculated by DESeq() are not corrected for expression level. This means that when counts are small, we are likely to end up with some large fold change values that overestimate the true extent of the change between conditions.

We can correct this by applying a “shrinkage” procedure, which will adjust large values with small counts downward, while preserving values with larger counts, which are likely to be more accurate.

To do this, we will use the lfcShrink() function, but first we need to know the name and/or position of the “coefficient” that was calculated by DESeq(), which we can do with the resultsNames() function

# get the deseq coefficient names:
resultsNames(deseq_object)
[1] "Intercept"                        "status_Amplified_vs_Nonamplified"
Intercept
status_Amplified_vs_Nonamplified

We are interested in the status coefficient, which is in position 2.

There are a few options for the shrinkage estimation. The default is apeglm (Zhu et al. 2018), but we have found that this can be sensitive to extreme outliers, which are definitely a factor in this data set. So for this data set we will be using ashr (Stephens 2017)

# calculate shrunken log2 fold change estimates
deseq_shrunken <- lfcShrink(deseq_object,
  coef = 2, # the coefficient we want to reestimate
  type = "ashr" # We will use `ashr` for estimation
)
using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

Let’s compare the log2 fold change estimates from the two results tables by creating a plot.

First we will combine the results into a new data frame.

comparison_df <- data.frame(
  lfc_original = deseq_results$log2FoldChange,
  lfc_shrunken = deseq_shrunken$log2FoldChange,
  logmean = log10(deseq_results$baseMean)
)

Now we can plot the original and shrunken log2 fold change values to see what happened after shrinkage.

ggplot(comparison_df,
  aes(
    x = lfc_original,
    y = lfc_shrunken,
    color = logmean
  )
) +
  geom_point(alpha = 0.1) +
  theme_bw() +
  scale_color_viridis_c() +
  coord_cartesian(xlim = c(-10, 10), ylim = c(-10, 10)) # zoom in on the middle