OpenPBTA-analysis

General scripts

Table of Contents generated with DocToc

Scripts required to prepare for a data release

The overall steps for preparing a data release are as follows:

  1. Start a release (termed release-vX-YYYYMMDD below) that contains all of the PBTA data files (i.e., upstream files) included.
  2. Run scripts/generate-analysis-files-for-release.sh using the PBTA data files in release-vX-YYYYMMDD and commit any changes to files tracked in the repository.
  3. Add the analysis files in scratch/analysis_files_for_release to release-vX-YYYYMMDD.
  4. Run scripts/run-for-subtyping.sh using the PBTA data files and analysis files in release-vX-YYYYMMDD and commit any changes to files tracked in the repository.
  5. Add pbta-histologies.tsv to release-vX-YYYYMMDD.

For definitions of the kinds of files in data releases, please see this documentation.

Analysis file generation

Running the following from this directory will generate all analysis files that are included in data releases and compile them in scratch/analysis_files_for_release for convenience:

bash generate-analysis-files-for-release.sh

This script also generates a file that contains the MD5 checksums for the analysis files (scratch/analysis_files_for_release/analysis_files_md5sum.txt).

Notes

RUN_LOCAL=1 bash generate-analysis-files-for-release.sh

Molecular subtyping

Molecular subtyping as part of data release can be run with the following from this directory:

bash run-for-subtyping.sh

This will re-run subtyping for the following broad histologies:

It will also run any analysis steps used for subtyping that do not generate files included in a release and molecular-subtyping-pathology & molecular-subtyping-integrate modules to generate the compiled_molecular_subtypes_with_clinical_pathology_feedback.tsv file containing the molecular_subtype column.

Adding summary analyses to run-for-subtyping.sh

For an analysis to be run for subyping, it must use pbta-histologies-base.tsv as input, and it should not depend on molecular_subtype or integrated_diagnosis columns for molecular-subtyping-* modules.

Please set OPENPBTA_BASE_SUBTYPING=1 as a condition to run code with pbta-histologies-base.tsv.

Here is an example from the TP53 classifier module (assumes root of repo):

OPENPBTA_BASE_SUBTYPING=1 bash analyses/tp53_nf1_score/run_classifier.sh

Generating analysis files for the manuscript

Once a new data release has been cut, analysis modules should be run with the new data release. Specifically, non-deprecated analyses which appear in manuscript should be run, and as well as certain analyses that were run in generate-analysis-files-for-release.sh which export output files in scratch/ that are needed for figure generation or require disease label information in the released histologies file. Note that subtyping modules do not need to be re-run, since subtyping was performed to create the data release itself. The script run-manuscript-analyses.sh can be used for this purpose as:

bash run-manuscript-analyses.sh

By default, this script will run all relevant analyses as described. However, some of those analyses have significant memory requirements which are generally not available on local machines. Therefore, to run only analyses that can be run locally, set RUN_LOCAL=1:

RUN_LOCAL=1 bash run-manuscript-analyses.sh

Other scripts