OpenPBTA-analysis

Data Formats in Data Download

The release notes for each release are provided in the release-notes.md file that accompanies the data files. A table with brief descriptions for each data file is provided in the data-files-description.md file included in the download.

PBTA Data Files

PBTA data files are all files derived from samples (e.g., tumors, cell lines) that are processed upstream of this repository and are not the product of any analysis code in the AlexsLemonade/OpenPBTA-analysis repository.

Quality Control Data

MendQC output files *readDist.txt and *bam_umend_qc.tsv, along with a manifest mapping filename to biospecimen, are provided.

Somatic Single Nucleotide Variant (SNV) Data

Somatic Single Nucleotide Variant (SNV) data are provided in Annotated MAF format files for each of the applied software packages and denoted with the pbta-snv prefix. Briefly, VCFs are VEP annotated and converted to MAF format via vcf2maf to produce the following files.

TCGA

A subset of TCGA brain tumor MAF files are also provided and denoted with the pbta-tcga-snv prefix:

The manifest file pbta-tcga-manifest.tsv contains primary diagnosis information as well as a columns denoting which BED files correspond to that sample. Capture_Kit refers to the BED file(s) obtained from GDC (hg19). BED_In_Use is the file used for mutation calling and what should be used for downstream analyses (lifted over to hg38). For some samples, the capture kit was not uniquely identified by the GDC (denoted by | in Capture_Kit) and the intersection of multiple BEDs was used. These new BED files are included in the data download.

Somatic Copy Number Variant (CNV) Data

Somatic Copy Number Variant (CNV) data are provided in a modified SEG format for each of the applied software packages and denoted with the pbta-cnv prefix. Somatic copy number data is only generated for whole genome sequencing (WGS) samples.

A Note on Ploidy

The copy number annotated in the CNVkit SEG file is annotated with respect to ploidy 2, however, the status annotated in the ControlFreeC TSV file is annotated with respect to inferred ploidy from the algorithm, which is recorded in the pbta-histologies.tsv file. See the table below for examples of possible interpretations.

Ploidy Copy Number Gain/Loss Interpretation
2 0 Deep deletion; homozygous deletion
2 1 Loss; hemizygous deletion
2 2 Copy neutral
2 3 Gain; one copy gain
2 4 Gain; two copy gain
2 5+ Amplification; possible amplification
3 0 Deep deletion; 3 copy loss
3 1 Loss; 2 copy loss
3 2 Loss; 1 copy loss
3 3 Copy neutral
3 4 Gain; one copy gain
3 5 Gain; two copy gain
3 6 Gain; three copy gain
3 7+ Amplification; possible amplification
4 0 Deep deletion; 4 copy loss
4 1 Loss; 3 copy loss
4 2 Loss; 2 copy loss
4 3 Loss; 1 copy loss
4 4 Copy neutral
4 5 Gain; one copy gain
4 6 Gain; two copy gain
4 7 Gain; two copy gain
4 8+ Amplification; possible amplification

Somatic Structural Variant (SV) Data

Somatic Structural Variant Data (Somatic SV) are provided in the Annotated Manta TSV format produced by the applied software packages and denoted with the pbta-sv prefix. Somatic structural variant data is only generated for whole genome sequencing (WGS) samples.

Gene Expression Data

Gene expression estimates from the applied software packages are provided as a feature (e.g., gene or transcript) by sample matrix. For each method/measure (e.g., RSEM TPM, RSEM isoform counts), there are two matrices provided: one for each library selection method (poly-A, stranded). See this notebook for more information about why this was necessary. Gene expression are available in multiple forms in the following files:

See the data description file for more information about the individual gene expression files.

If your analysis requires de-duplicated gene symbols as row names, please use the collapsed matrices provided as part of the data download (see below).

STAR *Log.final.out files, along with a manifest mapping filename to biospecimen, are provided (see section 4.1 here).

Gene Fusion Data

Gene Fusions produced by the applied software packages are provided as Arriba TSV and STARFusion TSV respectively. These files are denoted with the prefix pbta-fusion.

Harmonized Clinical Data

Harmonized clinical data are released as tab separated values in the following file:

Mapping Between DNA-seq and RNA-seq Data for the Same Sample

Many analyses rely on examining DNA-seq (e.g., WGS/WXS/Panel) and RNA-seq data together. The identifiers in the Kids_First_Biospecimen_ID column of the pbta-histologies.tsv file will be non-overlapping for different experimental strategies, as these identifiers map to a library or assay. Analysts should use the identifiers in the sample_id to connect DNA-seq and RNA-seq assays from the same sample.

For an example, see the sample_id mapping step of the OncoPrint pipeline.

Note that some individual participants (tracked via the Kids_First_Participant_ID column) will have multiple samples included in the PBTA dataset. Please see the independent specimens files section for more information.

Analysis Files

Analysis files are created by a script in analyses/*. They can be viewed as derivatives of PBTA data files.

Consensus Mutation Files

Consensus mutation files are products of the analyses/snv-callers analysis module.

Mutation hotspots

Mutation calls that overlap hotspots from MSKCC cancer hotspot database or overlapping TERT promoter region are retained even if called by 1 caller ,excluding Vardict-only calls because Vardict uniquely calls a large number (~39 million) of very low VAF mutations as discussed here suggesting these could be false calls.

Collapsed Expression Matrices

Collapsed expression matrices are products of the analyses/collapse-rnaseq analysis module. In cases where more than one Ensembl gene identifier maps to the same gene symbol, the instance of the gene symbol with the maximum mean FPKM in the RSEM FPKM file is retained to produce the following files:

Independent Specimen Lists

Independent specimen lists are the products of the analyses/independent-samples analysis module. For participants with multiple tumor specimens, independent specimen lists are provided as TSV files with columns for participant ID (Kids_First_Participant_ID) and specimen ID (Kids_First_Biospecimen_ID). The methods are described here). These files are used for analyses such as mutation co-occurence, where repeated samples might cause bias.

Note that these independent specimen files do not address the issue of participants with multiple tumor specimens in RNA-seq data at this time.

Derived Fusion Files

The filtered and prioritized fusion and downstream files are a product of the analyses/fusion_filtering analysis module.

Derived Copy Number Files

Consensus Copy Number File

Copy number consensus calls from the copy number and structural variant callers are a product of the analyses/copy_number_consensus_call analysis module.

Focal Copy Number Files

Focal copy number files map the consensus calls (genomic segments) above to genes for downstream analysis and are a product of the analysis/focal-cn-file-preparation. Note: these files contain biospecimens and genes with copy number changes; neutral regions are excluded.

GISTIC Output File Formats

pbta-cnv-cnvkit-gistic.zip is the output of running GISTIC 2.0 on the CNVkit results (pbta-cnv-cnvkit.seg). pbta-cnv-consensus-gistic.zip is the output of running GISTIC 2.0 on the CNV consensus calls (pbta-cnv-consensus.seg.gz), described below. The scripts used to run GISTIC are linked here: CNVkit and Consensus calls.

Note that GISTIC is run on the entire cohort and therefore the output reflects regions that are significantly amplified or deleted across the entire cohort.

The GISTIC output data files below, which are commonly leveraged for downstream analyses, are described in more detail on the Broad Institute’s GenePattern website.

Additional relevant output files are described below:
Use cases for these files include:

Data Caveats

The clinical manifest will be updated and versioned as molecular subgroups are identified based on genomic analyses.

Analyses related to molecular subtyping are as follows: