This notebook illustrates one way that you can use RNA-seq data from refine.bio to perform Principal Component Analysis (PCA) and plot the scores using ggplot2
.
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
file of this analysisTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples.” Note that this option will only be available for RNA-seq datasets.
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this prostate cancer dataset.
The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP133573
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedSRP133573
folder which contains:
plots
(currently empty)results
(currently empty)In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "SRP133573")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "SRP133573.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_SRP133573.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots to the plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using the R package DESeq2
(Love et al. 2014) for normalization and production of PCA values and the R package ggplot2
(Prabhakaran 2016) for plotting the PCA values.
if (!("DESeq2" %in% installed.packages())) {
# Install DESeq2
::install("DESeq2", update = FALSE)
BiocManager }
Attach the DESeq2
and ggplot2
libraries:
# Attach the `DESeq2` library
library(DESeq2)
# Attach the `ggplot2` library for plotting
library(ggplot2)
# We will need this so we can use the pipe: %>%
library(magrittr)
# Set the seed so our results are reproducible:
set.seed(12345)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
## refinebio_cell_line = col_logical(),
## refinebio_compound = col_logical(),
## refinebio_disease_stage = col_logical(),
## refinebio_genetic_information = col_logical(),
## refinebio_processed = col_logical(),
## refinebio_sex = col_logical(),
## refinebio_source_archive_url = col_logical(),
## refinebio_specimen_part = col_logical(),
## refinebio_time = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
expression_df # Tuck away the gene ID column as row names, leaving only numeric values
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
# Make the sure the columns (samples) are in the same order as the metadata
<- expression_df %>%
expression_df ::select(metadata$refinebio_accession_code)
dplyr
# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
Now we are going to use a combination of functions from the DESeq2
and ggplot2
packages to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our pre-ADT and post-ADT samples.
DESEq2
We need to make sure all of the metadata column variables that we would like to use to annotate our plot, are converted into factors.
# convert the columns we will be using for annotation into factors
<- metadata %>%
metadata ::mutate(
dplyrrefinebio_treatment = factor(
refinebio_treatment,# specify the possible levels in the order we want them to appear
levels = c("pre-adt", "post-adt")
),refinebio_disease = as.factor(refinebio_disease)
)
We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. This threshold might vary depending on the number of samples and expression patterns across your data set. Note that rows represent gene data and the columns represent sample data in our dataset.
# Define a minimum counts cutoff and filter the data to include
# only rows (genes) that have total counts above the cutoff
<- expression_df %>%
filtered_expression_df ::filter(rowSums(.) >= 10) dplyr
We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix()
function.
# The `DESeqDataSetFromMatrix()` function needs the values to be integers
<- round(filtered_expression_df) filtered_expression_df
We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
<- DESeqDataSetFromMatrix(
dds countData = filtered_expression_df, # the counts values for all samples in our dataset
colData = metadata, # annotation data for the samples in the counts data frame
design = ~1 # Here we are not specifying a model
# Replace with an appropriate design variable for your analysis
)
## converting counts to integer mode
We are going to use the vst()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
# Normalize and transform the data in the `DESeqDataSet` object
# using the `vst()` function from the `DESeq2` R package
<- vst(dds) dds_norm
DESeq2 has built-in functions to calculate and plot PCA values, which we will use here. The plotPCA()
function allows us to specify our group of interest with the intgroup
argument, which will be used to color the points in our plot. In this code chunk, we are using refinebio_treatment
as the grouping variable, as part of the goal of the experiment was to analyze the sample transcriptional responses to androgen deprivation therapy (ADT).
plotPCA(
dds_norm,intgroup = "refinebio_treatment"
)
In the next chunk, we are going to add another variable to our plot for annotation.
Now we’ll plot the PCA using both refinebio_treatment
and refinebio_disease
variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).
plotPCA(
dds_norm,intgroup = c("refinebio_treatment", "refinebio_disease")
# We are able to add another variable to the intgroup argument
# by providing a vector of the variable names with `c()` function
)
In the plot above, it is hard to distinguish the different refinebio_treatment
values which contain the data on whether or not samples have been treated with ADT versus the refinebio_disease
values which refer to the method by which the samples were obtained from patients (i.e. biopsy).
Let’s use the ggplot2
package functionality to customize our plot further and make the annotation labels better distinguishable.
First let’s use plotPCA()
to receive and store the PCA values for plotting.
# We first have to save the results of the `plotPCA()` function for use with `ggplot2`
<-
pca_results plotPCA(
dds_norm,intgroup = c("refinebio_treatment", "refinebio_disease"),
returnData = TRUE # This argument tells R to return the PCA values
)
Now let’s plot our pca_results
using ggplot2
functionality.
# Plot using `ggplot()` function and save to an object
<- ggplot(
annotated_pca_plot
pca_results,aes(
x = PC1,
y = PC2,
# plot points with different colors for each `refinebio_treatment` group
color = refinebio_treatment,
# plot points with different shapes for each `refinebio_disease` group
shape = refinebio_disease
)+
) # Make a scatter plot
geom_point()
# display annotated plot
annotated_pca_plot
You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave()
function to the respective file suffix.
# Save plot using `ggsave()` function
ggsave(
file.path(plots_dir, "SRP133573_pca_plot.png"),
# Replace with a file name relevant your plotted data
plot = annotated_pca_plot # the plot object that we want saved to file
)
## Saving 7 x 5 in image
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## annotate 1.68.0 2020-10-27 [1] Bioconductor
## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
## BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor
## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.3)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.3)
## bitops 1.0-7 2021-04-24 [1] RSPM (R 4.0.4)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cachem 1.0.5 2021-05-15 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## colorspace 2.0-1 2021-05-04 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## DelayedArray 0.16.3 2021-03-24 [1] Bioconductor
## DESeq2 * 1.30.1 2021-02-19 [1] Bioconductor
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.0.3)
## genefilter 1.72.1 2021-01-21 [1] Bioconductor
## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## GenomeInfoDb * 1.26.7 2021-04-08 [1] Bioconductor
## GenomeInfoDbData 1.2.4 2022-03-01 [1] Bioconductor
## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## highr 0.9 2021-04-16 [1] RSPM (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.3)
## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.0.3)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.5)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## Matrix 1.3-2 2021-01-06 [2] CRAN (R 4.0.5)
## MatrixGenerics * 1.2.1 2021-01-30 [1] Bioconductor
## matrixStats * 0.58.0 2021-01-29 [1] RSPM (R 4.0.3)
## memoise 2.0.0 2021-01-26 [1] RSPM (R 4.0.3)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## RCurl 1.98-1.3 2021-03-16 [1] RSPM (R 4.0.4)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## RSQLite 2.2.7 2021-04-22 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
## survival 3.2-10 2021-03-16 [2] CRAN (R 4.0.5)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## XML 3.99-0.6 2021-03-16 [1] RSPM (R 4.0.4)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.3)
## XVector 0.30.0 2020-10-27 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library