This was an exploration of how TMB is affected by using nonsynonymous filters. It addresses part 1 of this OpenPBTA issue As far as the TMB comparisons go, it doesn’t matter if we use Variant_Classification
filters or not. They are highly correlated no matter what and the same general TCGA to PBTA differences seem to persist.
If both run_caller_consensus_analysis-tcga.sh
and run_caller_consensus_analysis-pbta.sh
have been run, you can run this command. This will prepare the files needed for this notebook and run this notebook:
# bash run_explorations.sh
Table of Contents generated with DocToc
# Magrittr pipe
`%>%` <- dplyr::`%>%`
source(file.path("..", "..", "tmb-compare", "util", "cdf-plot-function.R"))
dir.create("plots", showWarnings = FALSE)
Import the tmb calculations that are not filtered by nonsynonymous classifications.
tmb_pbta_no_filter <- data.table::fread(file.path(
"..",
"results",
"no_filter",
"pbta-snv-mutation-tmb-coding.tsv"
)) %>%
dplyr::filter(experimental_strategy != "Panel") %>%
dplyr::mutate(filter = "no filter")
Repeat same steps for original tmb calculations (with filter).
tmb_pbta_with_filter <- data.table::fread(file.path(
"..",
"results",
"consensus",
"pbta-snv-mutation-tmb-coding.tsv"
)) %>%
dplyr::filter(experimental_strategy != "Panel") %>%
dplyr::mutate(filter = "filter")
Combine the unfilter and filtered data sets into one.
tmb_pbta <- dplyr::inner_join(tmb_pbta_no_filter,
dplyr::select(tmb_pbta_with_filter, Tumor_Sample_Barcode, tmb),
by = "Tumor_Sample_Barcode",
suffix = c("_no_filter", "_filter")
)
Repeat the same steps for the TCGA data.
tmb_tcga_no_filter <- data.table::fread(file.path(
"..",
"results",
"no_filter",
"tcga-snv-mutation-tmb-coding.tsv"
)) %>%
dplyr::mutate(filter = "no filter")
tmb_tcga_with_filter <- data.table::fread(file.path(
"..",
"results",
"consensus",
"tcga-snv-mutation-tmb-coding.tsv"
)) %>%
dplyr::mutate(filter = "filter")
Join these together.
tmb_tcga <- dplyr::inner_join(tmb_tcga_no_filter,
dplyr::select(tmb_tcga_with_filter, Tumor_Sample_Barcode, tmb),
by = "Tumor_Sample_Barcode",
suffix = c("_no_filter", "_filter")
)
Correlate first with Pearson’s.
cor.test(tmb_pbta$tmb_filter, tmb_pbta$tmb_no_filter)
Pearson's product-moment correlation
data: tmb_pbta$tmb_filter and tmb_pbta$tmb_no_filter
t = 2488.9, df = 934, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9999143 0.9999337
sample estimates:
cor
0.9999246
pbta_cor_plot <- ggplot2::ggplot(
tmb_pbta,
ggplot2::aes(x = tmb_no_filter, y = tmb_filter, color = short_histology)
) +
ggplot2::geom_point() +
ggplot2::theme_classic() +
ggplot2::theme(legend.position = "none")
pbta_cor_plot
This data is very skewed, let’s try a nonparametric correlation.
cor.test(tmb_pbta$tmb_filter, tmb_pbta$tmb_no_filter, method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: tmb_pbta$tmb_filter and tmb_pbta$tmb_no_filter
S = 3031423, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9778195
Still pretty good.
cor.test(tmb_tcga$tmb_filter, tmb_tcga$tmb_no_filter)
Pearson's product-moment correlation
data: tmb_tcga$tmb_filter and tmb_tcga$tmb_no_filter
t = 4296.2, df = 316, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9999893 0.9999931
sample estimates:
cor
0.9999914
tcga_cor_plot <- ggplot2::ggplot(
tmb_tcga,
ggplot2::aes(x = tmb_no_filter, y = tmb_filter, color = short_histology)
) +
ggplot2::geom_point() +
ggplot2::xlim(0, 5) +
ggplot2::ylim(0, 5) +
ggplot2::theme_classic() +
ggplot2::theme(legend.position = "none")
tcga_cor_plot
Correlate with nonparametric correlation since we did that for PBTA.
cor.test(tmb_tcga$tmb_filter, tmb_tcga$tmb_no_filter, method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: tmb_tcga$tmb_filter and tmb_tcga$tmb_no_filter
S = 60671, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9886798
Let’s make the same plots but by short_histology
.
Plotting PBTA data, but setting the scales to “free” since the histologies have very different tmb ranges.
pbta_cor_plot +
ggplot2::facet_wrap(~short_histology, scales = "free")