Objectives

This notebook will demonstrate how to:

  • Navigate the terminal interface
  • Organize an analysis project
  • Apply FastQC for quality control analysis of Illumina sequencing data
  • Preprocess sequencing reads with fastp
  • Quantify RNA-seq expression with Salmon

We will first learn how to process RNA-seq data at the command line using samples that were assayed with paired-end sequencing.

These samples come from a project (PRJNA178120) that includes 8 samples from normal gastric tissue, gastric cancer cell lines and primary gastric tumor cell cultures.

Here we will perform quality control checks, trimming, and estimate the transcript abundances for a single sample, SRR585570.

Later, we will use the full dataset (n = 8) to explore how to summarize estimates to the gene level and do some exploratory data analyses with data the course directors have processed ahead of time.


We’ll first want to set our working directory to the top-level of the RNA-seq folder.

Copy and paste the text in the code blocks below into your Terminal window in RStudio. It should be in the lower left hand corner as a tab next to Console.

Set current directory to the top-level of the RNA-seq module:

cd ~/training-modules/RNA-seq

Here ~/ refers to your home directory on the RStudio Server, which is the base folder in which your files live, including most of the materials for training. This is also the default working directory when you open a new RStudio session. A home directory is specific to you as a user on the RStudio Server; each user has their own folder to store their files.

Because these steps are computationally time intensive, we’ve prepared a script to start running things. Once we start running the script, we will give a short lecture to introduce this module and then walk through and explain each of the individual steps that the script is executing.

Enter the following in the Terminal to start running the script:

bash scripts/run_SRR585570.sh

Note: Don’t worry if the Salmon step does not complete by the time we move on to the next notebook. This is a time and resource intensive step, so we have prepared the required output in case we need it.

Input files

The raw data FASTQ files (fastq.gz) for this sample, SRR585570, are in data/gastric-cancer/fastq/SRR585570. The first two directories, data/ and gastric-cancer/, tell us that these files are data and which experiment or dataset these data are from. We’ll be working with an additional dataset later in the module, so this latter distinction will become more important. The third directory, fastq/, tells us that this is where we will be storing fastq files.

The final directory, SRR585570, is specific to the sample we are working with. The use of the SRR585570 folder might seem unnecessary because we are only processing a single sample here, but keeping files for individual samples in their own folder helps keep things organized for multi-sample workflows. (You can peek ahead and look at the data/NB-cell/quant folder for such an example.)

There is no “one size fits all” approach for project organization. It’s most important that it’s consistent, easy for you and others to find the files you need quickly, and minimizes the likelihood for errors (e.g., writing over files accidentally).

Quality control with FastQC

The first thing our script does is use FastQC for quality control in command line mode. Here’s a link to the FastQC documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/

Let’s take a look at some example reports from the authors of FastQC:

FastQC runs a series of quality checks on sequencing data and provides an HTML report. As the authors point out in the docs:

It is important to stress that although the analysis results appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library.

The documentation for individual modules/analyses in FastQC is a great resource!

To save time, our script only runs one FASTQ file for SRR585570 with the following commands:

mkdir -p QC/gastric-cancer/fastqc/SRR585570

mkdir allows us to create a new folder in the QC/gastric-cancer/fastqc directory specifically to hold the report information that will be generated by FastQC for this sample. The -p allows us to create parent directories and will prevent an error if the directory we specify already exists.

# In the interest of time, we'll run one of the fastq files through FastQC
fastqc data/gastric-cancer/fastq/SRR585570/SRR585570_1.fastq.gz \
    -o QC/gastric-cancer/fastqc/SRR585570

-o

The -o flag allows us to specify where the output of FastQC is saved. Note that this is saved in a separate place than the raw data files and in a directory specifically for quality control information.

For comparison to the report for SRR585570_1.fastq.gz we generate with our script, we’ve prepared a FastQC report for one of the sets of reads for another sample in the experiment. It can be found at QC/gastric-cancer/fastqc/SRR585574/SRR585574_1_fastqc.html.

Let’s look at the reports for both samples.

Preprocessing with fastp

We use fastp to preprocess the FASTQ files (Chen et al. Bioinformatics. 2018.). Note that fastp has quality control functionality and many different options for preprocessing (see all options on GitHub), most of which we will not cover. Here, we focus on adapter trimming, quality filtering, and length filtering.

Below, we discuss the commands we used in the script.

Create output directories

# Create a directory to hold the trimmed fastq files
mkdir -p data/gastric-cancer/fastq-trimmed/SRR585570
# Create a directory to hold the QC output from fastp
mkdir -p QC/gastric_cancer/fastp/SRR585570

As we’ll cover below, fastp essentially has two kinds of output: trimmed and filtered FASTQ files (data) and reports (quality control).

fastp

# Run the adapter and quality trimming step -- also produces QC report
fastp -i data/gastric-cancer/fastq/SRR585570/SRR585570_1.fastq.gz \
    -I data/gastric-cancer/fastq/SRR585570/SRR585570_2.fastq.gz \
    -o data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_1.fastq.gz \
    -O data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_2.fastq.gz \
    --qualified_quality_phred 15 \
    --length_required 20 \
    --report_title "SRR585570" \
    --json QC/gastric-cancer/fastp/SRR585570/SRR585570_fastp.json \
    --html QC/gastric-cancer/fastp/SRR585570/SRR585570_fastp.html

Below, we’ll walk through the arguments/options we used to run fastp. By default, fastp performs adapter trimming, which you can read more about here. For paired-end data like the data we have for SRR585570, adapters can be detected automatically without specifying an adapter sequence.

Input: -i and -I

These arguments specify the read1 input and read2 (sometimes called left and right) input, respectively.

fastq output: -o and -O

These arguments specify the read1 output and read2 output, respectively. Note that the output is being placed in data/gastric-cancer/fastq-trimmed/SRR585570/, so the processed FASTQ files will be kept separate from from the original files. It is generally good practice to treat your “raw” data and its directories as fixed and separate from any processing and analysis that you do, to prevent accidentally modification of those original files. And in the event that you accidentally do modify the originals, you know exactly which files and directories to reset.

--qualified_quality_phred

Phred scores are the quality information included in a FASTQ file and the values indicate the chances that a base is called incorrectly. Let’s look at a screenshot of the Per Base Sequence Quality module from FastQC bad Illumina example we linked to above.