Sequence Read Archive (SRA) is a public repository of sequencing data, including many RNA-seq datasets.
SRAdb is an R package on Bioconductor that can help you retrieve metadata associated with samples on SRA, which is what we’ll use it for here. It also can help you look up URLs for files you might want to retrieve.
In this example, we will show you how to retrieve metadata from SRAdb for a project or study id (eg. SRP123456). The authors of SRAdb provide an sqlite file, SRAmetadb.sqlite
, that contains the database that we will query to get the metadata we need.
The Tidyverse includes the dbplyr
package made for working with database files like sqlite
using much of the same dplyr
syntax that we have already been using. We will use *db*plyr
(an extension of *d*plyr
functions) to extract sample information related to our SRA project ID (SRP
) id of interest. You will recognize a lot of the functions from our Intro to the Tidyverse notebook.
SRAmetadb.sqlite
is already downloaded for you in ~/shared-data/SRAdb
. We are including this code for your reference so you know how to obtain this file outside of our RStudio Server.
Note that this file is very large (24 GB!), so please be sure to avoid downloading another copy to our RStudio Server.
# Install the SRAdb package if it is not installed.
# if (!("SRAdb" %in% installed.packages())) {
# BiocManager::install("SRAdb")
# }
# Declare directory to hold SRAdb file
# sra_db_dr <- <DIRECTORY NAME HERE>
# Create the directory where the SRAdb file will be downloaded to
# if (!dir.exists(sra_db_dir)) {
# dir.create(sra_db_dir, recursive = TRUE)
# }
# Downloading this file will take some time, so we will only download it if
# it doesn't exist.
# if (!file.exists(sqlite_file)) {
# # Use SRAdb's function to download the file
# SRAdb::getSRAdbFile(destdir = sra_db_dir)
# }
For this example, we will use SRP045496, an RNA-seq medulloblastoma mouse model experiment.
For more on SRA ids and more than you’ll need to know about the SRA database, see this knowledge base.
# Declare the project id you are interested in.
study_id <- "SRP045496"
# magrittr pipe
`%>%` <- dplyr::`%>%`
Set up SRA input directory and file path.
We have already downloaded this SRAdb file to the shared-data
folder for you.
# Declare SRA directory
# Here ~ refers to your home directory
sra_db_dir <- file.path("~", "shared-data", "SRAdb")
# Declare database file path
sqlite_file <- file.path(sra_db_dir, "SRAmetadb.sqlite")
Set up output directory and file path.
# Declare output directory
output_dir <- file.path("SRA_metadata")
# Declare output file path using the project ID
output_metadata_file <- file.path(output_dir, paste0(study_id, "_metadata.tsv"))
# Create the directory if it doesn't exist.
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
Connect to the sqlite file.
# Make connection to sqlite database file
sra_con <- DBI::dbConnect(RSQLite::SQLite(), sqlite_file)
sqlite
databases are made up of a set of tables, much like the data frames we have been using. We need to create variables that point to each table that we want to use with dplyr
.
Using dplyr::tbl()
, create an object that refers to the sra
table from the sqlite connection: sra_con
, to the sqlite file, sqlite_file
.
sra_table <- dplyr::tbl(sra_con, "sra")
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
Create an object that refers to the study
table from the same sqlite connection/file..
study_table <- dplyr::tbl(sra_con, "study")
Use the d*b*plyr
extension of dplyr
functions to collect information related to our declared study_id
. In this context, these transformation steps that you’ll recognize (filter()
, inner_join()
) are working on the sqlite database directly (using dbplyr
), which is why the as.data.frame()
step is required at the end to bring the data into the R environment as a standard data frame.
# Use the sqlite connection table to apply functions to it
sra_df <- sra_table %>%
# Filter to the study_id we are looking for
dplyr::filter(study_accession == study_id) %>%
# Inner join to the study table that has more specific info about the study
dplyr::inner_join(study_table, by = "study_accession") %>%
# We need to do this so the dbplyr queries are transformed to a data frame
as.data.frame()
Create a vector of sample ids that are related to our study_id
.
# Pull out sample accessions for the corresponding study
sample_accessions <- sra_df %>%
dplyr::pull(sample_accession)
Connect to sample
table in our sqlite connection, sra_con
.
sample_table <- dplyr::tbl(sra_con, "sample")
Use dbplyr
functions to filter the sample_table
to only the samples related to our study_id
.
# Filter the sample_table of the sqlite file
sample_df <- sample_table %>%
# Collect the samples that we identified in the previous set of steps
dplyr::filter(sample_accession %in% sample_accessions) %>%
# Turn into a data.frame
as.data.frame()
Here we’ll do some cleaning. These cleaning steps will be dependent the metadata itself and on what information you are interested in.
cleaned_sample_df <- sample_df %>%
# Here we getting rid of any columns that only consist of NAs
dplyr::select(-which(apply(is.na(.), 2, all)))
We’re interested in the sample accession and the sample attributes only, so we will use the dplyr::select()
function to pick the columns we want to retain.
cleaned_sample_df <- cleaned_sample_df %>%
dplyr::select(sample_accession,
sample_attribute)
The metadata we’re interested in is in a column called sample_attribute
, but unfortunately it is in a format that is not usable yet. Let’s take a look at that column.
cleaned_sample_df$sample_attribute
[1] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: wild type"
[2] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: wild type"
[3] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: wild type"
[4] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[5] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[6] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[7] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[8] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[9] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[10] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[11] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by Olig1Cre"
[12] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
[13] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
[14] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
[15] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
[16] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
[17] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
[18] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
[19] "source_name: cerebellum || strain: C57BL/6 || tissue: cerebellum || age: post natal day 60 || genotype: Gsa conditional knockout mediated by hGFAPCre"
Notice how this single column contains 5 columns worth of information: source_name
, strain
, tissue
, age
, and genotype
. We’d prefer the 5 columns!
Luckily there’s another package in the Tidyverse that has functionality for splitting up the column, tidyr
, and the function is called separate()
.
Each column’s worth of information is divided up by ||
, so we can use that to split sample_attribute
up into individual columns. |
is a special character in regular expressions or regex, which means we need to place \\
in front of each |
to escape the character.
First we need to create a character vector that contains the new column names.
# We use the attribute names from above. This character vector will be specific
# to the experiment you are working with.
new_columns <- c("source_name",
"strain",
"tissue",
"age",
"genotype")
cleaned_sample_df <- cleaned_sample_df %>%
# We first tell separate which column we'd like to separate into multiple
# column
tidyr::separate(col = sample_attribute,
# Here we need to tell the function what columns to split
# things *into* - we'll use the vector we created above
into = new_columns,
# What characters can be used to mark the separation between
# items that should become multiple columns
# We need to use \\ in front of each | because | is a
# special character
sep = " \\|\\| ")
Let’s take a look at our new columns.
cleaned_sample_df %>%
dplyr::select(new_columns)
This is certainly better, but it’s not quite what we want yet. We’ll want to remove the words before the :
in each column. We can use dplyr::mutate()
and yet another Tidyverse package, stringr
.
Let’s first look at how we can use stringr::str_extract()
to get only the relevant information we want. Here the relevant information is everything after the pattern :
. We’ll use source_name
as an example. What .*
does below is say that the pattern we want to extract from source_name
is everything before :
.
stringr::str_extract(cleaned_sample_df$source_name, ".*: ")
[1] "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: "
[10] "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: " "source_name: "
[19] "source_name: "
Using .*:
returns everything we want to remove from this column. We can use this same approach with stringr::str_remove()
to remove everything before and including :
:
stringr::str_remove(cleaned_sample_df$source_name, ".*: ")
[1] "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
[13] "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
This is exactly what we want to keep in the source_name
column!
Now we’re ready to use stringr::str_remove()
with dplyr::mutate()
to get some cleaned metadata! We’ll actually use dplyr::mutate_at()
, which will apply the function at each column we specify.
cleaned_sample_df <- cleaned_sample_df %>%
# We can use the same character vector as before to specify which columns
# we want to apply stringr::str_remove() at
dplyr::mutate_at(new_columns,
# This tells mutate_at to apply stringr::str_remove where
# the column is the first argument to stringr::str_remove -
# the string we are removing a pattern from
~ stringr::str_remove(., ".*: "))
Let’s take a look at our mutated columns.
head(cleaned_sample_df)
Run accessions in SRA essentially correspond to libraries. refine.bio, and often other resources dealing with processed RNA-seq data, use run accessions because that is what usually corresponds to FASTQ files. Multiple runs can map to the same sample accession, but a run will only map to a single sample accession. To use this metadata with refine.bio expression values, we’ll want to include the run accession for each sample. We have that information in sra_df
already.
cleaned_sample_df <- sra_df %>%
# Select only the relevant accessions, run and sample, from sra_df to
# join to our cleaned metadata
dplyr::select(run_accession,
sample_accession) %>%
# This effectively "tacks on" the run accessions as the first column of
# our sample metadata
dplyr::inner_join(cleaned_sample_df,
by = "sample_accession")
cleaned_sample_df %>%
readr::write_tsv(output_metadata_file)
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 rstudioapi_0.11 knitr_1.28 magrittr_1.5 hms_0.5.3 tidyselect_0.2.5 bit_1.1-14 R6_2.4.1 rlang_0.4.5
[10] stringr_1.4.0 blob_1.2.0 dplyr_0.8.3 tools_3.6.1 xfun_0.13 DBI_1.1.0 dbplyr_1.4.2 htmltools_0.4.0 ellipsis_0.3.0
[19] yaml_2.2.1 bit64_0.9-7 assertthat_0.2.1 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0 crayon_1.3.4 readr_1.3.1 purrr_0.3.4
[28] tidyr_1.0.0 base64enc_0.1-3 vctrs_0.2.4 evaluate_0.14 memoise_1.1.0 glue_1.4.0 RSQLite_2.2.0 rmarkdown_2.1 stringi_1.4.6
[37] compiler_3.6.1 pillar_1.4.3 jsonlite_1.6.1 pkgconfig_2.0.3