On this page, we’ve assembled some resources you may find helpful for working with bulk transcriptomic data, include bulk RNA-seq and microarray data.

Table of contents

Obtaining practice datasets from refine.bio

The Childhood Cancer Data Lab built and maintains refine.bio, a resource of uniformly processed transcriptomic data obtained from publicly available sources. You can read more about how we process data in refine.bio in our documentation.

If you’d like to practice some of the skills we cover in training or gain some additional ones like making highly customizable heatmaps with the ComplexHeatmap R package, obtaining processed data from refine.bio is a great starting point. You may find our examples for working with data from refine.bio helpful as you look to practice and expand your skills. In those examples, we use R Notebooks, which you will be familiar with from this workshop! See the “Getting Started” section for more information on utilizing our example notebooks.

You can start by searching refine.bio for keywords relevant to your scientific questions and filtering to the organism and technology (e.g., microarray vs. RNA-seq; refine.bio contains both) you’re interested in.

Bulk RNA-seq data on refine.bio

The format of the RNA-seq data you can download from the web interface of refine.bio data will be slightly different from what we cover in training. We summarize our data to the gene-level with tximport (docs), instead of tximeta like we do in training, before you download it. When downloading your data from refine.bio, we recommend checking the box that says “Skip quantile normalization for RNA-seq samples” to obtain the non-quantile normalized data (docs). You will receive a TSV file that you can use as the counts matrix input for a DESeqDataSet. Note that we recommend using non-quantile normalized data as the DESeqDataSetFromMatrix() function requires a counts matrix and not a matrix with normalized or corrected value like TPMs. See this nice DESeq2 vignette for more information (Love et al., 2014). You can read more about using DESeq2 with refine.bio data here.

If you identify an RNA-seq experiment from refine.bio that you’d like to use with DESeq2 (specifically with DESeqDataSetFromMatrix()), you can begin by following the instructions in the “Obtain the dataset from refine.bio” section of any of our RNA-seq refine.bio example notebooks and continue following the steps up until the “Create a DESeqDataset” section, as these steps remain pretty much the same across notebooks. Note that you will also need the associated metadata file, which is included in your download in a TSV file that starts with metadata_, to create a DESeqDataSet object.

Microarray data on refine.bio

In this version of our workshop, we won’t work with microarray data, but there are hundreds of thousands of microarray samples available from refine.bio. The microarray datasets you can download from the refine.bio web interface are quantile normalized and are distributed as TSV files you can read into R using functions we cover in training. The metadata is included in your download in a TSV file that starts with metadata_. You may find our microarray example notebooks for working with refine.bio data helpful with your differential expression, dimension reduction, or GSEA pathway analyses, to name a few. Note that our training material is largely RNA-seq specific, so if you obtain microarray data from refine.bio, you should not expect to use the exact same code as we do in training.

Transcriptome indices for common organisms

We have prepared transcriptome indices for select organisms frequently used in childhood cancer studies including human, mouse, zebrafish, and dog. Note that for most of files, you will need to perform a few extra steps to read in the quantification data with tximeta after performing quantification. Please see the notebook RNA-seq/00c-tximeta_other_species.Rmd for details on how to set this up.

If you have RNA-seq data for an organism that is not listed, please post in the training-specific Slack channel and let your instructors know.

Homo sapiens

Ensembl GRCh38 (hg38) v95

File description File use File path
Human Salmon index -k 23 Salmon index for use with salmon quant; appropriate for reads shorter than 75bp or for increased sensitivity with --validateMappings (docs) ~/shared-data/reference/refgenie/hg38_cdna/salmon_index/short
Human Salmon index -k 31 Salmon index for use with salmon quant; appropriate for reads 75bp or longer (docs) ~/shared-data/reference/refgenie/hg38_cdna/salmon_index/long

Mus musculus

Ensembl GRCm38 (mm10) v95

File description File use File path
Mouse Salmon index -k 23 Salmon index for use with salmon quant; appropriate for reads shorter than 75bp or for increased sensitivity with --validateMappings (docs) ~/shared-data/reference/refgenie/mm10_cdna/salmon_index/short
Mouse Salmon index -k 31 Salmon index for use with salmon quant; appropriate for reads 75bp or longer (docs) ~/shared-data/reference/refgenie/mm10_cdna/salmon_index/long

Danio rerio

Ensembl GRCz11 v95

File description File use File path
Zebrafish Salmon index -k 23 Salmon index for use with salmon quant; appropriate for reads shorter than 75bp or for increased sensitivity with --validateMappings (docs) ~/shared-data/reference/refgenie/z11_cdna/salmon_index/short
Zebrafish Salmon index -k 31 Salmon index for use with salmon quant; appropriate for reads 75bp or longer (docs) ~/shared-data/reference/refgenie/z11_cdna/salmon_index/long

Canis lupus familiaris

Ensembl CanFam3.1 v95

File description File use File path
Dog Salmon index -k 23 Salmon index for use with salmon quant; appropriate for reads shorter than 75bp or for increased sensitivity with --validateMappings (docs) ~/shared-data/reference/refgenie/CanFam3p1_cdna/salmon_index/short
Dog Salmon index -k 31 Salmon index for use with salmon quant; appropriate for reads 75bp or longer (docs) ~/shared-data/reference/refgenie/CanFam3p1_cdna/salmon_index/long