Data analyses are generally not “one size fits all”; this is particularly true when with approaches used to analyze RNA-seq and microarray data. The characteristics of the data produced by these two technologies can be quite different. This tutorial has example analyses organized by technology so you can follow examples that are more closely tailored to the nature of the data at hand.
Table of Contents generated with DocToc
RNA-sequencing measures gene expression by direct high-throughput sequencing methods after the RNA has been isolated from a sample.
(based on diagram from “An introduction to RNA-Seq methods, applications, experimental design, and technical challenges” 2015)
The diagram above provides a brief overview of the RNA-seq process, which includes extracting the total RNA from a tissue population, isolating the specific RNA species, converting the RNA to cDNA, and constructing a sequencing library to perform PCR amplification and sequencing. As with all experimental methods, RNA-seq has strengths and limitations that you should consider in regards to your scientific questions.
The nature of sequencing introduces several different kinds of biases:
Love (2016) discusses these biases in this blog post which includes this handy figure.
Most normalization methods, including refine.bio’s processing methods, attempt to mitigate these biases, but these biases can never be fully negated. Some of these biases have been addressed to the extent that they can by our refine.bio processing methods so you don’t have to worry too much about them. In brief, refine.bio data is quantified by Salmon using their correction algorithms: --seqbias
and --gcbias
.
refine.bio data is available for you quantile normalized, which can address some library size biases. But more often than not, our example modules will recommend using the option for downloading non-quantile normalized data (note that this is RNA-seq specific, and microarray data does not have this download option).
See here for more about the quantile normalization process in refine.bio.
DESeq2 is an R package that is built for differential expression analysis but also has other useful functions for prepping RNA-seq counts data (Love et al. 2014). Our refine.bio data is summarized to the gene-level with tximport before you download it (Soneson et al. 2015) which is also from the same creators as DESeq2, so they are made to play well together. We generally like DESeq2 because it has great documentation and helpful tutorials.
Many R Bioconductor packages have specialized object types they want your data to be formatted as. For DESeq2, before we can use a lot the special functions, we need to get our data into a DESeqDataSet
object. DESeqDataSet
objects not only store your data, but additional transformations of your data, model information, etc.
From our refine.bio datasets, we will use a function DESeqDataSetFromMatrix()
to create our DESeqDataSet
objects. This DESeq2 function requires you provide counts and not a normalized or corrected value like TPMs. Which is why our examples advise downloading non-quantile normalized from refine.bio.
Our examples recommend using DESeq2 for normalizing your RNA-seq data. You may have heard about or worked with FPKM, TPM, RPKMs; how does DESeq2’s normalization compare? This handy table from an online Harvard Bioinformatics Core course nicely summarizes and compares these different methods (Harvard Chan Bioinformatics Core). For more about the steps behind DESeq2 normalization, we highly recommend this StatQuest video which explains it quite nicely (Starmer 2017a).
To normalize and transform our data with DESeq2, we generally use vst()
(variance stabilizing transformation) or rlog()
(regularized logarithm transformation). Both methods are very similar. Both normalize your data by correcting for library size differences but they also transform your data removing the dependence of the variance on the mean, meaning that low mean genes won’t have inflated variance from just one or a few samples having higher values than the rest (Michael I. Love and Huber 2020). Of the two methods, rlog()
takes a bit longer to run (Michael I. Love and Huber 2019). If you end up using a larger dataset and rlog()
transformation takes a bit too long, you can switch to using vst()
with confidence since they yield similar results given the dataset is large enough (Michael I. Love and Huber 2019).
If a gene is not detected in any of the samples in a set through our processing, it is still kept in the Gene
column, but it will be reported as 0
. But if the gene you are interested in does not have an Ensembl ID according to the version of the annotation we used, it will not be reported in any refine.bio download.
In short, both edgeR and DESeq2 are good options and we at the CCDL just went with one of our preferences! See this blog that summarizes these – by one of the creators of DESeq2 – he agrees edgeR is also great.
If you have strong preferences for edgeR, you can definitely use your refine.bio data with it, but we currently do not have examples of that. In this case, we’d refer you to edgeR’s section of this example analysis and wish you the best of luck on your data adventures (Hansen)!
Unfortunately at this time, all download-ready refine.bio data is summarized to the gene level, and there’s no great way to examine isoforms with this data. If your research needs to know transcript isoform information, you may need to look elsewhere. This paper discusses some tools for these kinds of questions (Zhang et al. 2017).