In 7-technology_diff_exp
, we observed that genes that are shorter tend to have lower expression values in RNA-seq data in the test compendium.
Here, we’re interested in if short genes tend to be unobserved (e.g., have counts/count-scale data that are zero) in the RNA-seq data used in the test compendium. We’ll go back to the lengthScaledTPM.tsv
that has not yet been through the aggregation process to explore this. Note that we will only be using RNA-seq files that were used in an earlier analysis of imputation performance. These will be ~400 RNA-seq samples that were included in a smaller test compendium and thus, for later analyses in this notebook we’ll use the smaller test compendium.
`%>%` <- dplyr::`%>%`
# we'll need ggfortify to do the PCA plots
library(ggfortify)
Loading required package: ggplot2
These are a series of wrapper functions for plotting, as we typically make plots for all genes and then a subset of genes (such as those that are in the compendium). They generally expect certain objects to be in the global environment; see the documentation for the individual functions.
zero_proportion_point_wrapper <- function(df, plot.subtitle) {
# Make the point plots (with geom_smooth) for the proportion of samples that
# have a value of 0 for a gene vs. that gene's length
#
# Args:
# df: a data.frame that includes 'gene_length' and 'Zero_proportion'
# plot.subtitle: subtitle for the plot; character
#
# Returns:
# a ggplot as described above
if(!all(c("gene_length" %in% colnames(df),
"Zero_proportion" %in% colnames(df)))) {
stop("df must have colnames 'gene_length' and 'Zero_proportion'")
}
df %>%
ggplot(aes(x = gene_length, y = Zero_proportion)) +
geom_point(alpha = 0.2) +
geom_smooth() +
theme_bw() +
labs(x = "gene length", y = "proportion of samples with zero count",
subtitle = plot.subtitle, title = "Gene length and observations")
}
pca_plot_wrapper <- function(prcomp_results, plot.title) {
# given the output of prcomp, make a scatterplot of PC1 and PC2 with
# ggfortify::autoplot where the samples are colored by technology (e.g.,
# microarray, RNA-seq)
# requires technology_df to be in the global environment
# Args:
# prcomp_results: output of prcomp, run on a transposed expression matrix
# plot.title: title of the plot; character
#
# Returns
# ggplot as described above
autoplot(prcomp_results, data = technology_df, alpha = 0.5,
colour = 'compendium_technology_labels', scale = 0) +
scale_color_manual(labels = c("MICROARRAY", "RNA-SEQ"),
values = c("#E69F00", "#56B4E9")) +
theme_bw() +
ggtitle(plot.title) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
guides(colour = guide_legend(title = "technology"))
}
pca_pairs_plot_wrapper <- function(prcomp_results, plot.title) {
# given the output of prcomp, make a pairs plot of PC1-5 with where the
# samples are colored by technology (e.g., microarray, RNA-seq)
# requires compendium_technology_labels to be in the global environment
#
# Args:
# prcomp_results: output of prcomp, run on a transposed expression matrix
# plot.title: title of the plot; character
#
# Returns
# plot as described above
pairs(prcomp_results$x[, 1:5],
col = dplyr::recode(compendium_technology_labels,
`MICROARRAY` = "#E69F00",
`RNASEQ` = "#56B4E9"),
main = plot.title)
}
# directory to hold the plots
plot_dir <- "plots"
lengthScaledTPM.tsv
fileslist_of_rnaseq_files <- list.files(file.path("..", "select_imputation_method",
"data", "rnaseq"),
full.names = TRUE)
Read in the files such that they are individual data.frame
in a list
rnaseq_df_list <- lapply(list_of_rnaseq_files, readr::read_tsv)
# bind all the individual sample data.frame together
rnaseq_df <- dplyr::bind_cols(rnaseq_df_list)
# we end up with a bunch of "extra" gene columns this way, but they should be
# in the same order
all.equal(rnaseq_df$Gene1, rnaseq_df$Gene112)
[1] TRUE
# drop the extra gene columns and set the rownames to gene identifiers -- this
# will help with calculating the proportion of zeros, etc.
rnaseq_df <- rnaseq_df %>%
as.data.frame() %>%
tibble::column_to_rownames("Gene") %>%
dplyr::select(-dplyr::contains("Gene"))
# remove the list of individual sample data.frame
rm(rnaseq_df_list)
# read in gene lengths
gene_lengths_df <-
readr::read_tsv(file.path("data",
"Danio_rerio.GRCz11_gene_lengths_exons.tsv"))
Parsed with column specification:
cols(
Gene = col_character(),
gene_length = col_integer()
)
We will explore this gene lengths-observations/expression levels relationship in all genes and also subset to only the genes in the test compendium, so we need to pull those gene identifiers out of the test compendium expression matrix.
# read in smaller test compendium and convert to expression matrix
compendium_exprs_file <- file.path("refinebio-data", "DANIO_RERIO",
"DANIO_RERIO.tsv")
exprs_mat <- readr::read_tsv(compendium_exprs_file, progress = FALSE) %>%
as.data.frame() %>%
tibble::column_to_rownames("X1") %>%
as.matrix()
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
.default = col_double(),
X1 = col_character()
)
See spec(...) for full column specifications.
# derive the technology labels from the sample names
sample_names <- colnames(exprs_mat)
compendium_technology_labels <- rep("MICROARRAY", length(sample_names))
compendium_technology_labels[grep("lengthScaledTPM", sample_names)] <- "RNASEQ"
# get the technology labels as a data.frame for plotting later
technology_df <- as.data.frame(compendium_technology_labels)
# get the list of compendium genes
compendium_genes <- rownames(exprs_mat)
# calculate the proportion of samples a gene is zero for
gene_zeroes_proportion <- data.frame(
"Gene" = rownames(rnaseq_df),
"Zero_proportion" = apply(rnaseq_df, 1,
function(x) sum(x == 0) / length(x))
)
# bind to lengths, this will be used for plotting the relationship between
# gene length and number of zeroes
zeroes_df <- dplyr::inner_join(gene_lengths_df, gene_zeroes_proportion,
by = "Gene")
Column `Gene` joining character vector and factor, coercing into character vector
# there are some really long genes that will probably make it difficult to
# visualize what is going on
summary(zeroes_df$gene_length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10 896 1794 2384 3155 98586
# filtering genes based on length to get a clearer picture
zeroes_df %>%
dplyr::filter(gene_length <= 20000) %>%
zero_proportion_point_wrapper(plot.subtitle = "All genes")
ggsave(filename = file.path(plot_dir,
"gene_length_vs_zero_count_all_genes.png"),
plot = last_plot())
Saving 7 x 7 in image
# subset just to genes that are in the compendium
zeroes_df %>%
dplyr::filter(Gene %in% compendium_genes,
gene_length <= 20000) %>%
zero_proportion_point_wrapper(plot.subtitle = "Test compendium genes only")
ggsave(filename = file.path(plot_dir,
"gene_length_vs_zero_count_compendium_genes.png"),
plot = last_plot())
Saving 7 x 7 in image
Shorter genes do tend to have a larger proportion of samples where they have zero values. It stands to reason that if we do not observe shorter genes, we can not correct for length-bias effectively. (Recall that lengthScaledTPM
are calculated using the average tx length observed across samples.)
Does removing shorter genes from the species compendium remove evidence of a technology bias?
pca_results <- prcomp(t(exprs_mat))
pca_plot_wrapper(prcomp_results = pca_results,
plot.title = "Test Compendium by Technology: All genes")
ggsave(filename = file.path(plot_dir,
"small_test_compendium_PCA.png"),
plot = last_plot())
Saving 7 x 7 in image
png(file.path(plot_dir, "small_test_compendium_pairs_plot.png"))
pca_pairs_plot_wrapper(prcomp_results = pca_results,
plot.title = "Test Compendium by Technology: All genes")
dev.off()
null device
1
# what genes are short? using 25th percentile
short_genes <- zeroes_df %>%
dplyr::filter(gene_length <= quantile(zeroes_df$gene_length, 0.25)) %>%
dplyr::pull(Gene)
length_filtered_exprs_mat <- exprs_mat[!rownames(exprs_mat) %in% short_genes, ]
dim(length_filtered_exprs_mat)
[1] 8650 830
# PCA on the length filtered exprs mat
filtered_pca_results <- prcomp(t(length_filtered_exprs_mat))
pca_plot_wrapper(prcomp_results = filtered_pca_results,
plot.title = paste("Test Compendium by Technology:",
"Short genes removed"))
ggsave(filename = file.path(plot_dir,
"gene_lengths_test_compendium_PCA.png"),
plot = last_plot())
Saving 7 x 7 in image
png(file.path(plot_dir, "gene_lengths_test_compendium_pairs_plot.png"))
pca_pairs_plot_wrapper(prcomp_results = filtered_pca_results,
plot.title = paste("Test Compendium by Technology:",
"Short genes removed"))
dev.off()
null device
1
Removing short genes from the test species compendium does not remove technology bias, but there are some considerations:
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 ggfortify_0.4.5 ggplot2_3.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 pillar_1.3.0 compiler_3.5.1 plyr_1.8.4 bindr_0.1.1 tools_3.5.1 digest_0.6.17 tibble_1.4.2
[9] gtable_0.2.0 nlme_3.1-137 lattice_0.20-35 mgcv_1.8-24 pkgconfig_2.0.2 rlang_0.2.2 Matrix_1.2-14 rstudioapi_0.7
[17] yaml_2.2.0 gridExtra_2.3 withr_2.1.2 dplyr_0.7.6 stringr_1.3.1 knitr_1.20 hms_0.4.2 grid_3.5.1
[25] tidyselect_0.2.4 glue_1.3.0 R6_2.2.2 purrr_0.2.5 tidyr_0.8.1 readr_1.1.1 magrittr_1.5 scales_1.0.0
[33] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3 stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4