We’ll use limma
to identify genes that are differentially expressed between microarray and RNA-seq samples in the test compendium.
Note that I had to roll back the R version here to get Rsamtools
to install. This will be captured by sessionInfo()
below.
`%>%` <- dplyr::`%>%`
library(limma)
library(GenomicRanges)
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from ‘package:limma’:
plotMA
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, cbind, colMeans, colnames, colSums, do.call, duplicated, eval, evalq, Filter, Find, get,
grep, grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort, table, tapply, union, unique, unsplit,
which, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomeInfoDb
library(rtracklayer)
library(Rsamtools)
Loading required package: Biostrings
Loading required package: XVector
Attaching package: ‘Biostrings’
The following object is masked from ‘package:base’:
strsplit
results_dir <- "results"
dir.create(results_dir, showWarnings = FALSE)
Test compendium expression matrix
exprs_mat_file <- file.path("data", "larger_compendium_exprs_mat.RDS")
exprs_mat <- readRDS(exprs_mat_file)
Generate technology labels using the sample names
sample_names <- colnames(exprs_mat)
compendium_technology_labels <- rep("MICROARRAY", length(sample_names))
compendium_technology_labels[grep("SRR|ERR|DRR", sample_names)] <- "RNASEQ"
Test for genes that are differentially expressed between technologies
design_matrix <- model.matrix(~ compendium_technology_labels)
fit <- lmFit(exprs_mat, design = design_matrix)
fit2 <- eBayes(fit)
diff_exp_stats <- topTable(fit2, number = nrow(exprs_mat)) %>%
as.data.frame() %>%
tibble::rownames_to_column("Gene")
Removing intercept from test coefficients
Save to file
readr::write_tsv(diff_exp_stats,
file.path(results_dir,
"larger_compendium_diff_exprs_technologies.tsv"))
Let’s take a peak at the “top” differentially expressed genes.
diff_exp_stats %>%
dplyr::filter(adj.P.Val < 0.00001) %>%
dplyr::arrange(`t`) %>%
head()
data.frame(
expression_value = exprs_mat["ENSDARG00000098608", ],
technology = compendium_technology_labels
) %>%
ggplot2::ggplot(ggplot2::aes(x = technology, y = expression_value)) +
ggplot2::geom_boxplot() +
ggplot2::labs(title = "ENSDARG00000098608", subtitle = "t = -234.7375") +
ggplot2::theme_bw()
Genes with negative t
values have lower values in RNA-seq samples.
Let’s take a look at the genes that have higher values in RNA-seq samples (positive t
).
diff_exp_stats %>%
dplyr::filter(adj.P.Val < 0.00001) %>%
dplyr::arrange(`t`) %>%
tail()
data.frame(
expression_value = exprs_mat["ENSDARG00000060767", ],
technology = compendium_technology_labels
) %>%
ggplot2::ggplot(ggplot2::aes(x = technology, y = expression_value)) +
ggplot2::geom_boxplot() +
ggplot2::labs(title = "ENSDARG00000060767", subtitle = "t = 165.2244") +
ggplot2::theme_bw()
Is there any relationship between the length of a gene and its t
statistic?
GRCz11
related files were obtained in 00-data_download.sh
Adapted from: https://github.com/dpryan79/Answers/blob/d2d7366fec907fd92ccf03e77c2d53cfe45b02c9/SEQanswers_42420/GTF2LengthGC.R
Non-duplicated exons from each gene are summed up to get the length (“union gene model”).
# obtained in 00-data_download.sh
gtf_file <- file.path("data", "GRCz11", "Danio_rerio.GRCz11.95.gtf")
# Load the annotation and reduce it -- filtering for exons
gtf <- import.gff(gtf_file, format = "gtf", genome = "GRCz11.95",
feature.type = "exon")
grl <- reduce(split(gtf, elementMetadata(gtf)$gene_id))
reduced_gtf <- unlist(grl, use.names = TRUE)
elementMetadata(reduced_gtf)$gene_id <- rep(names(grl), elementNROWS(grl))
# add the widths to the metadata slot -- number of values in the range
elementMetadata(reduced_gtf)$widths <- width(reduced_gtf)
output <- t(sapply(
split(reduced_gtf, elementMetadata(reduced_gtf)$gene_id),
# for each gene sum exon widths
function(x) sum(elementMetadata(x)$widths)))
output[, 1:5]
ENSDARG00000000001 ENSDARG00000000002 ENSDARG00000000018 ENSDARG00000000019 ENSDARG00000000068
2476 2604 3013 2758 2397
gene_lengths_df <- data.frame(Gene = colnames(output),
gene_length = output[1, ])
readr::write_tsv(gene_lengths_df,
file.path("data",
"Danio_rerio.GRCz11_gene_lengths_exons.tsv"))
# if one already had the gene lengths file, it could be loaded:
# gene_lengths_df <-
# readr::read_tsv(file.path("data",
# "Danio_rerio.GRCz11_gene_names_and_lengths.tsv"),
# col_names = FALSE) %>%
# dplyr::mutate(X1 = sub("gene:", "", X1))
Join the t-statistic with the gene lengths and make a scatterplot
dplyr::inner_join(diff_exp_stats, gene_lengths_df,
by = "Gene") %>%
dplyr::select(c("Gene", "t", "gene_length")) %>%
ggplot2::ggplot(ggplot2::aes(x = `t`, y = gene_length)) +
ggplot2::geom_point(alpha = 0.2) +
ggplot2::geom_smooth() +
ggplot2::theme_bw() +
ggplot2::labs(x = "t statistic", y = "gene length")
Column `Gene` joining character vector and factor, coercing into character vector
As noted above:
Genes with negative
t
values have lower values in RNA-seq samples.
So longer genes tend to have higher expression values in RNA-seq data.
scatter_file <- file.path("plots", "larger_compendium_tstat_length_scatter.png")
ggplot2::ggsave(filename = scatter_file,
plot = ggplot2::last_plot())
Saving 7 x 7 in image
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 Rsamtools_1.30.0 Biostrings_2.46.0 XVector_0.18.0 rtracklayer_1.38.3 GenomicRanges_1.30.3
[7] GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0 limma_3.34.9
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.8.1 lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 yaml_2.1.18
[6] mgcv_1.8-23 base64enc_0.1-3 XML_3.98-1.11 rlang_0.2.0 pillar_1.2.1
[11] glue_1.2.0 BiocParallel_1.12.0 matrixStats_0.54.0 GenomeInfoDbData_1.0.0 bindr_0.1.1
[16] plyr_1.8.4 stringr_1.3.0 zlibbioc_1.24.0 munsell_0.4.3 gtable_0.2.0
[21] evaluate_0.10.1 labeling_0.3 Biobase_2.38.0 knitr_1.20 Rcpp_0.12.16
[26] readr_1.1.1 scales_0.5.0 backports_1.1.2 DelayedArray_0.4.1 jsonlite_1.5
[31] ggplot2_2.2.1 hms_0.4.2 digest_0.6.15 stringi_1.1.7 dplyr_0.7.4
[36] grid_3.4.4 rprojroot_1.3-2 tools_3.4.4 bitops_1.0-6 magrittr_1.5
[41] lazyeval_0.2.1 RCurl_1.95-4.11 tibble_1.4.2 pkgconfig_2.0.1 Matrix_1.2-12
[46] assertthat_0.2.0 rmarkdown_1.9 R6_2.2.2 GenomicAlignments_1.14.2 nlme_3.1-131.1
[51] compiler_3.4.4