This notebook will demonstrate how to:
We will first learn how to process RNA-seq data at the command line using samples that were assayed with paired-end sequencing.
These samples come from a project (PRJNA178120
)
that includes 8 samples from normal gastric tissue, gastric cancer cell
lines and primary gastric tumor cell cultures.
Here we will perform quality control checks, trimming, and estimate the transcript abundances for a single sample, SRR585570.
Later, we will use the full dataset (n = 8) to explore how to summarize estimates to the gene level and do some exploratory data analyses with data the course directors have processed ahead of time.
We’ll first want to set our working directory to the top-level of the RNA-seq folder.
Copy and paste the text in the code blocks below into your
Terminal
window in RStudio. It should be in the lower left
hand corner as a tab next to Console
.
Set current directory to the top-level of the RNA-seq module:
cd ~/training-modules/RNA-seq
Here ~/
refers to your home directory on the
RStudio Server, which is the base folder in which your files live,
including most of the materials for training. This is also the default
working directory when you open a new RStudio session. A home directory
is specific to you as a user on the RStudio Server; each user has their
own folder to store their files.
Because these steps are computationally time intensive, we’ve prepared a script to start running things. Once we start running the script, we will give a short lecture to introduce this module and then walk through and explain each of the individual steps that the script is executing.
Enter the following in the Terminal to start running the script:
bash scripts/run_SRR585570.sh
Note: Don’t worry if the Salmon step does not complete by the time we move on to the next notebook. This is a time and resource intensive step, so we have prepared the required output in case we need it.
The raw data FASTQ files (fastq.gz
) for this sample,
SRR585570, are in data/gastric-cancer/fastq/SRR585570
. The
first two directories, data/
and
gastric-cancer/
, tell us that these files are data and
which experiment or dataset these data are from. We’ll be working with
an additional dataset later in the module, so this latter distinction
will become more important. The third directory, fastq/
,
tells us that this is where we will be storing fastq files.
The final directory, SRR585570
, is specific to the
sample we are working with. The use of the SRR585570
folder
might seem unnecessary because we are only processing a single sample
here, but keeping files for individual samples in their own folder helps
keep things organized for multi-sample workflows. (You can peek ahead
and look at the data/NB-cell/quant
folder for such an
example.)
There is no “one size fits all” approach for project organization. It’s most important that it’s consistent, easy for you and others to find the files you need quickly, and minimizes the likelihood for errors (e.g., writing over files accidentally).
The first thing our script does is use FastQC for quality control in command line mode. Here’s a link to the FastQC documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
Let’s take a look at some example reports from the authors of FastQC:
FastQC runs a series of quality checks on sequencing data and provides an HTML report. As the authors point out in the docs:
It is important to stress that although the analysis results appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library.
The documentation for individual modules/analyses in FastQC is a great resource!
To save time, our script only runs one FASTQ file for SRR585570 with the following commands:
mkdir -p QC/gastric-cancer/fastqc/SRR585570
mkdir
allows us to create a new folder in the
QC/gastric-cancer/fastqc
directory specifically to hold the
report information that will be generated by FastQC for this
sample. The -p
allows us to create parent
directories and will prevent an error if the directory we specify
already exists.
# In the interest of time, we'll run one of the fastq files through FastQC
fastqc data/gastric-cancer/fastq/SRR585570/SRR585570_1.fastq.gz \
-o QC/gastric-cancer/fastqc/SRR585570
-o
The -o
flag allows us to specify where the output of
FastQC is saved. Note that this is saved in a separate place than the
raw data files and in a directory specifically for quality control
information.
For comparison to the report for SRR585570_1.fastq.gz
we
generate with our script, we’ve prepared a FastQC report for one of the
sets of reads for another sample in the experiment. It can be found at
QC/gastric-cancer/fastqc/SRR585574/SRR585574_1_fastqc.html
.
Let’s look at the reports for both samples.
We use fastp to preprocess the FASTQ files (Chen et al. Bioinformatics. 2018.). Note that fastp has quality control functionality and many different options for preprocessing (see all options on GitHub), most of which we will not cover. Here, we focus on adapter trimming, quality filtering, and length filtering.
Below, we discuss the commands we used in the script.
# Create a directory to hold the trimmed fastq files
mkdir -p data/gastric-cancer/fastq-trimmed/SRR585570
# Create a directory to hold the QC output from fastp
mkdir -p QC/gastric_cancer/fastp/SRR585570
As we’ll cover below, fastp essentially has two kinds of output: trimmed and filtered FASTQ files (data) and reports (quality control).
# Run the adapter and quality trimming step -- also produces QC report
fastp -i data/gastric-cancer/fastq/SRR585570/SRR585570_1.fastq.gz \
-I data/gastric-cancer/fastq/SRR585570/SRR585570_2.fastq.gz \
-o data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_1.fastq.gz \
-O data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_2.fastq.gz \
--qualified_quality_phred 15 \
--length_required 20 \
--report_title "SRR585570" \
--json QC/gastric-cancer/fastp/SRR585570/SRR585570_fastp.json \
--html QC/gastric-cancer/fastp/SRR585570/SRR585570_fastp.html
Below, we’ll walk through the arguments/options we used to run
fastp
. By default, fastp performs adapter trimming, which
you can read more about here. For
paired-end data like the data we have for SRR585570, adapters can be
detected automatically without specifying an adapter sequence.
-i
and -I
These arguments specify the read1 input and read2 (sometimes called left and right) input, respectively.
-o
and -O
These arguments specify the read1 output and read2 output,
respectively. Note that the output is being placed in
data/gastric-cancer/fastq-trimmed/SRR585570/
, so the
processed FASTQ files will be kept separate from from the original
files. It is generally good practice to treat your “raw” data and its
directories as fixed and separate from any processing and analysis that
you do, to prevent accidentally modification of those original files.
And in the event that you accidentally do modify the originals, you know
exactly which files and directories to reset.
--qualified_quality_phred
Phred scores are the quality information included in a FASTQ file and the values indicate the chances that a base is called incorrectly. Let’s look at a screenshot of the Per Base Sequence Quality module from FastQC bad Illumina example we linked to above.