We observed a slight “platform bias” in 05-pca_test_compendium – do the lowly expressed genes account for the difference in platforms?

Set up

`%>%` <- dplyr::`%>%`

Function for calculating the mode from https://stackoverflow.com/questions/2547402/is-there-a-built-in-function-for-finding-the-mode

calculate_mode <- function(x) {
  # where x is a numeric vector, return the numeric value of the mode 
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Read in data

exprs_mat <- readRDS(file.path("data", "larger_compendium_exprs_mat.RDS"))
# inferring sample technology from column name
sample_names <- colnames(exprs_mat)
compendium_technology_labels <- rep("MICROARRAY", length(sample_names))
compendium_technology_labels[grep("SRR|ERR|DRR", sample_names)] <- "RNA-SEQ"
table(compendium_technology_labels)
compendium_technology_labels
MICROARRAY    RNA-SEQ 
      2077      11513 

Split by technology, they have different distributions. More on that below!

seq_mat <- exprs_mat[, which(compendium_technology_labels == "RNA-SEQ")]
array_mat <- exprs_mat[, which(compendium_technology_labels == "MICROARRAY")]

Explore distributions

This test compendium is quantile normalized (QN’d). RNA-seq data, even after all the other processing steps, likely has many zeroes and we can spot that in the data post-QN. Let’s look at the distribution of a single RNA-seq sample for illustrative purposes.

# selecting a single sample for illustrative purposes -- there is no reason for
# using this sample in particular
random_index <- 34
example_ecdf_plot <- data.frame(x = seq_mat[, random_index]) %>%
  ggplot2::ggplot(ggplot2::aes(x)) +
  ggplot2::stat_ecdf(geom = "point", size = 0.75, alpha = 0.75) +
  ggplot2::labs(y = "ecdf(x)", x = "expression value (x)",
                title = "RNA-seq sample") +
  ggplot2::theme_bw()
example_ecdf_plot

We can see that because of the quantile normalization there is a “run” of values at the lowest express value in the sample but it’s not zero.

I think the expression values that get filled in for the zero values can be determined by finding the mode for each sample rather than the minimum value for each sample because there’s nothing forcing the lowest value in a RNA-seq sample to be positive. We’re assuming that the mode for an RNA-seq sample before we QN is zero, which is probably a pretty safe assumption; we can dig into this more later.

Let’s take a look at our example, adding a dotted line for the mode.

example_mode <- calculate_mode(seq_mat[, random_index])
example_ecdf_plot + ggplot2::geom_vline(xintercept = example_mode,
                                        lty = "dotted")

And another example.

another_random_index <- 387
data.frame(x = seq_mat[, another_random_index]) %>%
  ggplot2::ggplot(ggplot2::aes(x)) +
  ggplot2::stat_ecdf(geom = "point", size = 0.75, alpha = 0.75) +
  ggplot2::labs(y = "ecdf(x)", x = "expression value (x)") +
  ggplot2::theme_bw() +
  ggplot2::geom_vline(xintercept = 
                        calculate_mode(seq_mat[, another_random_index]),
                      lty = "dotted")

Picking an array sample at random we can look at the two ECDFs.

data.frame(
  expression_value = c(array_mat[, another_random_index],
                       seq_mat[, another_random_index]),
  technology = c(rep("MICROARRAY", nrow(array_mat)),
                 rep("RNA-SEQ", nrow(seq_mat)))
) %>% 
  ggplot2::ggplot(ggplot2::aes(x = expression_value,
                               colour = technology)) +
  ggplot2::stat_ecdf(geom = "point", size = 0.75, alpha = 0.2) +
  ggplot2::labs(y = "ecdf(x)", x = "expression value (x)") +
  ggplot2::theme_bw() +
  ggplot2::scale_color_manual(labels = c("MICROARRAY", "RNA-SEQ"), 
                              values = c("#E69F00", "#56B4E9"))

Consistent with expectations based on what we’ve laid out above: the microarray sample doesn’t have a run of repeated values at the low end.

Explore lowly expressed genes

The differences between platforms could be driven, in part, by these “used-to-be-zero” values in the RNA-seq data. So what are they and do they tend to be the same genes in many RNA-seq samples?

RNA-seq

Find the gene identifiers of each of the “used-to-be-zero” values

zero_gene_list <- apply(seq_mat, 2, 
                       function(x) { 
                         mode_value <- calculate_mode(x)
                         mode_index <- which(x == mode_value)
                         rownames(seq_mat)[mode_index]
                       })

Is it the same genes over and over again? Let’s count.

zero_counts_per_gene <- data.frame(table(unlist(zero_gene_list)))
colnames(zero_counts_per_gene)[1] <- "Gene"
head(zero_counts_per_gene)

What does the distribution of these counts look like?

zero_counts_per_gene %>%
  ggplot2::ggplot(ggplot2::aes(x = Freq)) +
  ggplot2::geom_bar() +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "number of samples",
                title = "Frequency of zero values") +
  ggplot2::xlim(c(0, ncol(seq_mat)))

This tells use that there a few genes that are missing (zero) in many of the RNA-seq samples we’ve tested (but none that are missing in all samples) and lots of genes that are missing in only a few samples.

Okay, do the genes that are missing in a bunch of samples tend to have lower expression values in the test compendium than all the other genes?

high_zero_count_genes <- zero_counts_per_gene %>%
  dplyr::filter(Freq >= quantile(Freq, 0.75)) %>%
  dplyr::pull(Gene) %>%
  as.character()

Get the average expression for each gene for the microarray samples

array_avg_exprs_df <- data.frame(average_expression = rowMeans(array_mat),
                                 gene = rownames(array_mat)) %>%
  dplyr::mutate(gene_category = dplyr::case_when(
    gene %in% high_zero_count_genes ~ "high zero count",
    TRUE ~ "other"
  ))

Plot the distributions

array_avg_exprs_df %>%
  ggplot2::ggplot(ggplot2::aes(x = average_expression,
                               fill = gene_category)) +
  ggplot2::geom_density(alpha = 0.4) +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "average expression in microarray samples") +
  ggplot2::guides(fill = ggplot2::guide_legend(title = "gene category in RNA-seq")) +
  ggplot2::scale_fill_manual(values = c("#000000", "#FFFFFF"))

plot_name <- file.path("plots", "average_expression_array_zero_count_genes.png")
ggplot2::ggsave(plot_name)
Saving 7 x 7 in image

Session info

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     rstudioapi_0.7   knitr_1.20       bindr_0.1.1      magrittr_1.5     tidyselect_0.2.4 munsell_0.5.0   
 [8] colorspace_1.3-2 R6_2.2.2         rlang_0.2.2      plyr_1.8.4       dplyr_0.7.6      tools_3.5.1      grid_3.5.1      
[15] gtable_0.2.0     digest_0.6.17    yaml_2.2.0       lazyeval_0.2.1   assertthat_0.2.0 tibble_1.4.2     crayon_1.3.4    
[22] purrr_0.2.5      ggplot2_3.0.0    glue_1.3.0       labeling_0.3     compiler_3.5.1   pillar_1.3.0     scales_1.0.0    
[29] pkgconfig_2.0.2 
---
title: "Lowly expressed genes"
output:   
  html_notebook: 
    toc: true
    toc_float: true
author: J. Taroni for CCDL
date: 2019
---

We observed a slight "platform bias" in `05-pca_test_compendium` --
do the lowly expressed genes account for the difference in platforms?

## Set up

```{r}
`%>%` <- dplyr::`%>%`
```

Function for calculating the mode from 
https://stackoverflow.com/questions/2547402/is-there-a-built-in-function-for-finding-the-mode

```{r}
calculate_mode <- function(x) {
  # where x is a numeric vector, return the numeric value of the mode 
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
```

## Read in data

```{r}
exprs_mat <- readRDS(file.path("data", "larger_compendium_exprs_mat.RDS"))
```

```{r}
# inferring sample technology from column name
sample_names <- colnames(exprs_mat)
compendium_technology_labels <- rep("MICROARRAY", length(sample_names))
compendium_technology_labels[grep("SRR|ERR|DRR", sample_names)] <- "RNA-SEQ"
```

```{r}
table(compendium_technology_labels)
```

Split by technology, they have different distributions.
More on that below!

```{r}
seq_mat <- exprs_mat[, which(compendium_technology_labels == "RNA-SEQ")]
array_mat <- exprs_mat[, which(compendium_technology_labels == "MICROARRAY")]
```

## Explore distributions

This test compendium is quantile normalized (QN'd). 
RNA-seq data, even after all the other processing steps, likely has many 
zeroes and we can spot that in the data post-QN.
Let's look at the distribution of a single RNA-seq sample for illustrative
purposes.

```{r}
# selecting a single sample for illustrative purposes -- there is no reason for
# using this sample in particular
random_index <- 34
```

```{r}
example_ecdf_plot <- data.frame(x = seq_mat[, random_index]) %>%
  ggplot2::ggplot(ggplot2::aes(x)) +
  ggplot2::stat_ecdf(geom = "point", size = 0.75, alpha = 0.75) +
  ggplot2::labs(y = "ecdf(x)", x = "expression value (x)",
                title = "RNA-seq sample") +
  ggplot2::theme_bw()
example_ecdf_plot
```

We can see that because of the quantile normalization there is a "run" of
values at the lowest express value in the sample _but_ it's not zero.

I _think_ the expression values that get filled in for the zero values can be 
determined by finding the mode for each sample rather than the minimum value
for each sample because there's nothing forcing the lowest value in a RNA-seq
sample to be _positive_.
We're assuming that the _mode_ for an RNA-seq sample before we QN is zero, which
is probably a pretty safe assumption;
we can dig into this more later.

Let's take a look at our example, adding a dotted line for the mode.

```{r}
example_mode <- calculate_mode(seq_mat[, random_index])
example_ecdf_plot + ggplot2::geom_vline(xintercept = example_mode,
                                        lty = "dotted")
```

And another example.

```{r}
another_random_index <- 387
data.frame(x = seq_mat[, another_random_index]) %>%
  ggplot2::ggplot(ggplot2::aes(x)) +
  ggplot2::stat_ecdf(geom = "point", size = 0.75, alpha = 0.75) +
  ggplot2::labs(y = "ecdf(x)", x = "expression value (x)") +
  ggplot2::theme_bw() +
  ggplot2::geom_vline(xintercept = 
                        calculate_mode(seq_mat[, another_random_index]),
                      lty = "dotted")
```

Picking an array sample at random we can look at the two ECDFs.

```{r}
data.frame(
  expression_value = c(array_mat[, another_random_index],
                       seq_mat[, another_random_index]),
  technology = c(rep("MICROARRAY", nrow(array_mat)),
                 rep("RNA-SEQ", nrow(seq_mat)))
) %>% 
  ggplot2::ggplot(ggplot2::aes(x = expression_value,
                               colour = technology)) +
  ggplot2::stat_ecdf(geom = "point", size = 0.75, alpha = 0.2) +
  ggplot2::labs(y = "ecdf(x)", x = "expression value (x)") +
  ggplot2::theme_bw() +
  ggplot2::scale_color_manual(labels = c("MICROARRAY", "RNA-SEQ"), 
                              values = c("#E69F00", "#56B4E9"))
```

Consistent with expectations based on what we've laid out above: the microarray
sample doesn't have a run of repeated values at the low end.

## Explore lowly expressed genes

The differences between platforms could be driven, in part, by these 
"used-to-be-zero" values in the RNA-seq data.
So what are they and do they tend to be the same genes in many RNA-seq samples?

### RNA-seq

Find the gene identifiers of each of the "used-to-be-zero" values

```{r}
zero_gene_list <- apply(seq_mat, 2, 
                        function(x) { 
                          mode_value <- calculate_mode(x)
                          mode_index <- which(x == mode_value)
                          rownames(seq_mat)[mode_index]
                        })
```

Is it the same genes over and over again? 
Let's count.

```{r}
zero_counts_per_gene <- data.frame(table(unlist(zero_gene_list)))
colnames(zero_counts_per_gene)[1] <- "Gene"
head(zero_counts_per_gene)
```

What does the distribution of these counts look like?

```{r}
zero_counts_per_gene %>%
  ggplot2::ggplot(ggplot2::aes(x = Freq)) +
  ggplot2::geom_bar() +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "number of samples",
                title = "Frequency of zero values") +
  ggplot2::xlim(c(0, ncol(seq_mat)))
```

This tells use that there a few genes that are missing (zero) in many of the 
RNA-seq samples we've tested (but none that are missing in all samples) and lots
of genes that are missing in only a few samples.

Okay, do the genes that are missing in a bunch of samples tend to have lower
expression values in the test compendium than all the other genes?

```{r}
high_zero_count_genes <- zero_counts_per_gene %>%
  dplyr::filter(Freq >= quantile(Freq, 0.75)) %>%
  dplyr::pull(Gene) %>%
  as.character()
```

Get the average expression for each gene for the microarray samples

```{r}
array_avg_exprs_df <- data.frame(average_expression = rowMeans(array_mat),
                                 gene = rownames(array_mat)) %>%
  dplyr::mutate(gene_category = dplyr::case_when(
    gene %in% high_zero_count_genes ~ "high zero count",
    TRUE ~ "other"
  ))
```

Plot the distributions

```{r}
array_avg_exprs_df %>%
  ggplot2::ggplot(ggplot2::aes(x = average_expression,
                               fill = gene_category)) +
  ggplot2::geom_density(alpha = 0.4) +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "average expression in microarray samples") +
  ggplot2::guides(fill = ggplot2::guide_legend(title = "gene category in RNA-seq")) +
  ggplot2::scale_fill_manual(values = c("#000000", "#FFFFFF"))
```

```{r}
plot_name <- file.path("plots", "average_expression_array_zero_count_genes.png")
ggplot2::ggsave(plot_name)
```

## Session info

```{r}
sessionInfo()
```

