Data analyses are generally not “one size fits all”; this is particularly true when with approaches used to analyze RNA-seq and microarray data. The characteristics of the data produced by these two technologies can be quite different. This tutorial has example analyses organized by technology so you can follow examples that are more closely tailored to the nature of the data at hand.

Table of Contents generated with DocToc

0.1 Introduction to RNA-seq technology

RNA-sequencing measures gene expression by direct high-throughput sequencing methods after the RNA has been isolated from a sample.

(based on diagram from “An introduction to RNA-Seq methods, applications, experimental design, and technical challenges” 2015)

The diagram above provides a brief overview of the RNA-seq process, which includes extracting the total RNA from a tissue population, isolating the specific RNA species, converting the RNA to cDNA, and constructing a sequencing library to perform PCR amplification and sequencing. As with all experimental methods, RNA-seq has strengths and limitations that you should consider in regards to your scientific questions.

0.1.1 RNA-seq data strengths

  • RNA-seq can assay unknown transcripts, as it is not bound to a pre-determined set of probes like microarrays (Wang et al. 2009).
  • Its values are considered more dynamic than microarray values which are constrained to a smaller range based on background signal and probe sets being saturated (Wang et al. 2009).

0.1.2 RNA-seq data limitations/biases

The nature of sequencing introduces several different kinds of biases:

  • GC bias: higher GC content sequences are less likely to be observed.
  • 3’ bias (positional bias): for most sequencing methods, the 3 prime end of transcripts are more likely to be observed.
  • Complexity bias: some sequences are easier to be bound and amplified than others.
  • Library size or sequencing depth: the total number of reads is not always equivalent between samples.
  • Gene length: longer genes are more likely to be observed.

Love (2016) discusses these biases in this blog post which includes this handy figure.

Most normalization methods, including refine.bio’s processing methods, attempt to mitigate these biases, but these biases can never be fully negated. Some of these biases have been addressed to the extent that they can by our refine.bio processing methods so you don’t have to worry too much about them. In brief, refine.bio data is quantified by Salmon using their correction algorithms: --seqbias and --gcbias.

0.1.3 About quantile normalization

refine.bio data is available for you quantile normalized, which can address some library size biases. But more often than not, our example modules will recommend using the option for downloading non-quantile normalized data (note that this is RNA-seq specific, and microarray data does not have this download option).

See here for more about the quantile normalization process in refine.bio.

0.2 About DESeq2

DESeq2 is an R package that is built for differential expression analysis but also has other useful functions for prepping RNA-seq counts data (Love et al. 2014). Our refine.bio data is summarized to the gene-level with tximport before you download it (Soneson et al. 2015) which is also from the same creators as DESeq2, so they are made to play well together. We generally like DESeq2 because it has great documentation and helpful tutorials.

0.2.1 DESeq2 objects

Many R Bioconductor packages have specialized object types they want your data to be formatted as. For DESeq2, before we can use a lot the special functions, we need to get our data into a DESeqDataSet object. DESeqDataSet objects not only store your data, but additional transformations of your data, model information, etc.

From our refine.bio datasets, we will use a function DESeqDataSetFromMatrix() to create our DESeqDataSet objects. This DESeq2 function requires you provide counts and not a normalized or corrected value like TPMs. Which is why our examples advise downloading non-quantile normalized from refine.bio.

0.2.2 DESeq2 transformation methods

Our examples recommend using DESeq2 for normalizing your RNA-seq data. You may have heard about or worked with FPKM, TPM, RPKMs; how does DESeq2’s normalization compare? This handy table from an online Harvard Bioinformatics Core course nicely summarizes and compares these different methods (Harvard Chan Bioinformatics Core). For more about the steps behind DESeq2 normalization, we highly recommend this StatQuest video which explains it quite nicely (Starmer 2017a).

To normalize and transform our data with DESeq2, we generally use vst() (variance stabilizing transformation) or rlog() (regularized logarithm transformation). Both methods are very similar. Both normalize your data by correcting for library size differences but they also transform your data removing the dependence of the variance on the mean, meaning that low mean genes won’t have inflated variance from just one or a few samples having higher values than the rest (Michael I. Love and Huber 2020). Of the two methods, rlog() takes a bit longer to run (Michael I. Love and Huber 2019). If you end up using a larger dataset and rlog() transformation takes a bit too long, you can switch to using vst() with confidence since they yield similar results given the dataset is large enough (Michael I. Love and Huber 2019).

0.3 Common questions

0.3.0.1 Why isn’t the gene I care about in a refine.bio dataset?

If a gene is not detected in any of the samples in a set through our processing, it is still kept in the Gene column, but it will be reported as 0. But if the gene you are interested in does not have an Ensembl ID according to the version of the annotation we used, it will not be reported in any refine.bio download.

0.3.0.2 What about edgeR?

In short, both edgeR and DESeq2 are good options and we at the CCDL just went with one of our preferences! See this blog that summarizes these – by one of the creators of DESeq2 – he agrees edgeR is also great.

If you have strong preferences for edgeR, you can definitely use your refine.bio data with it, but we currently do not have examples of that. In this case, we’d refer you to edgeR’s section of this example analysis and wish you the best of luck on your data adventures (Hansen)!

0.3.0.3 What if I care about isoforms?

Unfortunately at this time, all download-ready refine.bio data is summarized to the gene level, and there’s no great way to examine isoforms with this data. If your research needs to know transcript isoform information, you may need to look elsewhere. This paper discusses some tools for these kinds of questions (Zhang et al. 2017).

References

An introduction to RNA-Seq methods, applications, experimental design, and technical challenges, 2015. https://rna-seqblog.com/an-introduction-to-rna-seq-methods-applications-experimental-design-and-technical-challenges/
Hadfield J., 2016 An introduction to RNA-seq. https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/
Hansen K. D., S. E. Brenner, and S. Dudoit, 2010 Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 38: e131. https://doi.org/10.1093/nar/gkq224
Hansen K. D., Count based RNA-seq analysis. https://kasperdanielhansen.github.io/genbioconductor/html/Count_Based_RNAseq.html
Harvard Chan Bioinformatics Core, Introduction to DGE - count normalization. https://hbctraining.github.io/DGE_workshop_salmon/lessons/02_DGE_count_normalization.html
Harvard Chan Bioinformatics Core, Introduction to DGE - DESeq2 analysis. https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
Love M. I., 2016 RNA-seq fragment sequence bias. https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/
Love M. I., J. B. Hogenesch, and R. A. Irizarry, 2016 Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nature Biotechnology 34: 1287–1291. https://doi.org/10.1038/nbt.3682
Michael I. Love Simon Anders, and W. Huber, 2014 Beginner’s guide to using the DESeq2 package. https://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf
Michael I. Love V. K. Simon Anders, and W. Huber, 2019 RNA-seq workflow: Gene-level exploratory analysis and differential expression. http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-variance-stabilizing-transformation-and-the-rlog
Michael I. Love Simon Anders, and W. Huber, 2020 Analyzing RNA-seq data with DESeq2. https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
Pepke S., B. Wold, and A. Mortazavi, 2009 Computation for ChIP-seq and RNA-seq studies. Nature Methods 6: 22–32. https://doi.org/10.1038/nmeth.1371
Soneson C., M. I. Love, and M. D. Robinson, 2015 Differential analyses for RNA-seq: Transcript-level estimates improve gene-level inferences. F1000Research 4. https://doi.org/10.12688/f1000research.7563.2
Starmer J., 2017a StatQuest: DESeq2, part 1, library normalization. https://www.youtube.com/watch?v=UFB993xufUU
Starmer J., 2017b StatQuest: A gentle introduction to RNA-seq. https://www.youtube.com/watch?v=tlf6wYJrwKY
Wang Z., M. Gerstein, and M. Snyder, 2009 RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics 10: 57–63. https://doi.org/10.1038/nrg2484
Zhang C., B. Zhang, L. L. Lin, and S. Zhao, 2017 Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics 18: 583. https://doi.org/10.1186/s12864-017-4002-1