OpenPBTA-analysis

Focal copy number file preparation

Module authors: Chante Bethell (@cbethell), Joshua Shapiro (@jashapiro), and Jaclyn Taroni (@jaclyn-taroni)

The copy number data from OpenPBTA are provided as ranges or segments. The purpose of this module is to map from those ranges to gene identifiers for consumption by downstream analyses (e.g., OncoPrint plotting).

Running this analysis

This analysis requires at least ~24 GB of RAM to run to completion

To run this analysis only on consensus SEG file,

use OPENPBTA_BASE_SUBTYPING=1 to run this module using the pbta-histologies-base.tsv from data folder and relative path to copy_number_consensus_call/results/pbta-cnv-consensus.seg.gz while running molecular-subtyping modules for release.

OPENPBTA_BASE_SUBTYPING=1 bash analyses/focal-cn-file-preparation/run-prepare-cn.sh

Or by default runs analyses using pbta-histologies.tsv and downloaded files from data release:

bash analyses/focal-cn-file-preparation/run-prepare-cn.sh

Note: The run-bedtools.snakemake script is implemented in run-prepare-cn.sh to run the bedtools coverage steps between the UCSC cytoband file and the samples in the copy number files produced in 02-add-ploidy-consensus.Rmd. This script currently takes a while to run, and therefore slows down the processing speed of the main shell script run-prepare-cn.sh.

Running the following from the root directory of the repository will run the steps for the original copy number call files (CNVkit and ControlFreeC) in addition to the consensus SEG file:

RUN_ORIGINAL=1 bash analyses/focal-cn-file-preparation/run-prepare-cn.sh

Scripts and notebooks

Output files for downstream consumption

Note: The output files from 03-prepare-cn-file.R have neutral calls filtered out to reduce file size.

results
├── cnvkit_annotated_cn_autosomes.tsv.gz
├── cnvkit_annotated_cn_x_and_y.tsv.gz
├── consensus_seg_annotated_cn_autosomes.tsv.gz
├── consensus_seg_annotated_cn_x_and_y.tsv.gz
├── consensus_seg_most_focal_cn_status.tsv.gz
├── consensus_seg_recurrent_focal_cn_units.tsv
├── consensus_seg_with_ucsc_cytoband_status.tsv.gz
├── controlfreec_annotated_cn_autosomes.tsv.gz
└── controlfreec_annotated_cn_x_and_y.tsv.gz

Folder Structure

focal-cn-file-preparation
├── 01-add-ploidy-cnvkit.Rmd
├── 01-add-ploidy-cnvkit.nb.html
├── 02-add-ploidy-consensus.Rmd
├── 02-add-ploidy-consensus.nb.html
├── 03-add-cytoband-status-consensus.Rmd
├── 03-add-cytoband-status-consensus.nb.html
├── 04-prepare-cn-file.R
├── 05-define-most-focal-cn-units.Rmd
├── 05-define-most-focal-cn-units.nb.html
├── 06-find-recurrent-calls.Rmd
├── 06-find-recurrent-calls.nb.html
├── README.md
├── annotation_files
│   └── txdb_from_gencode.v27.gtf.db
├── display-plots.md
├── driver-lists
├── gistic-results
│   └── pbta-cnv-cnvkit-gistic
│       ├── D.cap1.5.mat
│       ├── all_data_by_genes.txt
│       ├── all_lesions.conf_90.txt
│       ├── all_thresholded.by_genes.txt
│       ├── amp_genes.conf_90.txt
│       ├── amp_qplot.pdf
│       ├── amp_qplot.png
│       ├── broad_data_by_genes.txt
│       ├── broad_gistic_plot.pdf
│       ├── broad_significance_results.txt
│       ├── broad_values_by_arm.txt
│       ├── del_genes.conf_90.txt
│       ├── del_qplot.pdf
│       ├── del_qplot.png
│       ├── focal_dat.0.98.mat
│       ├── focal_data_by_genes.txt
│       ├── freqarms_vs_ngenes.pdf
│       ├── gistic_inputs.mat
│       ├── peak_regs.mat
│       ├── perm_ads.mat
│       ├── raw_copy_number.pdf
│       ├── raw_copy_number.png
│       ├── regions_track.conf_90.bed
│       ├── sample_cutoffs.txt
│       ├── sample_seg_counts.txt
│       ├── scores.0.98.mat
│       ├── scores.gistic
│       └── wide_peak_regs.mat
├── plots
│   ├── cnvkit_annotated_cn_autosomes_polya_loss_cor_plot.png
│   ├── cnvkit_annotated_cn_autosomes_polya_stacked_plot.png
│   ├── cnvkit_annotated_cn_autosomes_polya_zero_cor_plot.png
│   ├── cnvkit_annotated_cn_autosomes_stranded_loss_cor_plot.png
│   ├── cnvkit_annotated_cn_autosomes_stranded_stacked_plot.png
│   ├── cnvkit_annotated_cn_autosomes_stranded_zero_cor_plot.png
│   ├── cnvkit_annotated_cn_x_and_y_polya_loss_cor_plot.png
│   ├── cnvkit_annotated_cn_x_and_y_polya_stacked_plot.png
│   ├── cnvkit_annotated_cn_x_and_y_polya_zero_cor_plot.png
│   ├── cnvkit_annotated_cn_x_and_y_stranded_loss_cor_plot.png
│   ├── cnvkit_annotated_cn_x_and_y_stranded_stacked_plot.png
│   ├── cnvkit_annotated_cn_x_and_y_stranded_zero_cor_plot.png
│   ├── consensus_seg_annotated_cn_autosomes_polya_loss_cor_plot.png
│   ├── consensus_seg_annotated_cn_autosomes_polya_stacked_plot.png
│   ├── consensus_seg_annotated_cn_autosomes_polya_zero_cor_plot.png
│   ├── consensus_seg_annotated_cn_autosomes_stranded_loss_cor_plot.png
│   ├── consensus_seg_annotated_cn_autosomes_stranded_stacked_plot.png
│   ├── consensus_seg_annotated_cn_autosomes_stranded_zero_cor_plot.png
│   ├── consensus_seg_annotated_cn_x_and_y_polya_loss_cor_plot.png
│   ├── consensus_seg_annotated_cn_x_and_y_polya_stacked_plot.png
│   ├── consensus_seg_annotated_cn_x_and_y_polya_zero_cor_plot.png
│   ├── consensus_seg_annotated_cn_x_and_y_stranded_loss_cor_plot.png
│   ├── consensus_seg_annotated_cn_x_and_y_stranded_stacked_plot.png
│   ├── consensus_seg_annotated_cn_x_and_y_stranded_zero_cor_plot.png
│   ├── controlfreec_annotated_cn_autosomes_polya_loss_cor_plot.png
│   ├── controlfreec_annotated_cn_autosomes_polya_stacked_plot.png
│   ├── controlfreec_annotated_cn_autosomes_polya_zero_cor_plot.png
│   ├── controlfreec_annotated_cn_autosomes_stranded_loss_cor_plot.png
│   ├── controlfreec_annotated_cn_autosomes_stranded_stacked_plot.png
│   ├── controlfreec_annotated_cn_autosomes_stranded_zero_cor_plot.png
│   ├── controlfreec_annotated_cn_x_and_y_polya_loss_cor_plot.png
│   ├── controlfreec_annotated_cn_x_and_y_polya_stacked_plot.png
│   ├── controlfreec_annotated_cn_x_and_y_polya_zero_cor_plot.png
│   ├── controlfreec_annotated_cn_x_and_y_stranded_loss_cor_plot.png
│   ├── controlfreec_annotated_cn_x_and_y_stranded_stacked_plot.png
│   └── controlfreec_annotated_cn_x_and_y_stranded_zero_cor_plot.png
├── results
│   ├── cnvkit_annotated_cn_autosomes.tsv.gz
│   ├── cnvkit_annotated_cn_x_and_y.tsv.gz
│   ├── consensus_seg_annotated_cn_autosomes.tsv.gz
│   ├── consensus_seg_annotated_cn_x_and_y.tsv.gz
│   ├── consensus_seg_most_focal_cn_status.tsv.gz
│   ├── consensus_seg_recurrent_focal_cn_units.tsv
│   ├── consensus_seg_with_ucsc_cytoband_status.tsv.gz
│   ├── controlfreec_annotated_cn_autosomes.tsv.gz
│   └── controlfreec_annotated_cn_x_and_y.tsv.gz
├── rna-expression-validation.R
├── run-bedtools.snakemake
├── run-prepare-cn.sh
└── util
    └── rna-expression-functions.R