The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using AnnotationDbi
packages (Pagès et al. 2020).
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
If you want to use this dataset with other examples you may want to download it non-quantile normalized. See the RNA-seq header section about quantile normalization.
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this RNA-seq zebrafish lateral line hair cell dataset.
This dataset has 24 RNA-seq zebrafish lateral line hair cell samples. To identify changes in gene expression in the mantle and inner support cells after hair cell death, cells were isolated from regenerating and control, non-regenerating transgenic zebrafish using fluorescence-activated cell sorting (FACS).
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession IDs usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP040561
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedSRP040561
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "SRP040561")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "SRP040561.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_SRP040561.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.
For different species, wherever the abbreviation org.Dr.eg.db
or Dr
is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db
or Hs
would be used. In the case of our microarray gene identifier annotation example notebook, a Mouse (Mus musculus) dataset is used, meaning org.Mm.eg.db
or Mm
would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.
refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our zebrafish dataset to obtain the associated Entrez IDs.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using the org.Dr.eg.db
R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi
framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.
# Install the Zebrafish package
if (!("org.Dr.eg.db" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("org.Dr.eg.db", update = FALSE)
BiocManager }
Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db
will automatically also attach AnnotationDbi
.
# Attach the library
library(org.Dr.eg.db)
# We will need this so we can use the pipe: %>%
library(magrittr)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_logical(),
## refinebio_accession_code = col_character(),
## experiment_accession = col_character(),
## refinebio_organism = col_character(),
## refinebio_platform = col_character(),
## refinebio_source_database = col_character(),
## refinebio_title = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
expression_df # Tuck away the Gene ID column as row names
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
# Make the data in the order of the metadata
<- expression_df %>%
expression_df ::select(metadata$refinebio_accession_code)
dplyr
# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
# Bring back the "Gene" column in preparation for mapping
<- expression_df %>%
expression_df ::rownames_to_column("Gene") tibble
The mapIds()
function has a multiVals
argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds
to see more options or strategies.
In the next chunk, we will run the mapIds()
function and supply the multiVals
argument with the "list"
option in order to get a large list with all the mapped values found for each gene identifier.
# Map Ensembl IDs to their associated Entrez IDs
<- mapIds(
mapped_list # Replace with annotation package for your organism
org.Dr.eg.db, keys = expression_df$Gene,
keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
column = "ENTREZID", # The type of gene identifiers you would like to map to
multiVals = "list"
)
## 'select()' returned 1:many mapping between keys and columns
Now, let’s take a look at our mapped object to see how the mapping went.
# Let's use the `head()` function for a preview of our mapped list
head(mapped_list)
## $ENSDARG00000000001
## [1] "368418"
##
## $ENSDARG00000000002
## [1] "368419"
##
## $ENSDARG00000000018
## [1] "64604"
##
## $ENSDARG00000000019
## [1] "368425"
##
## $ENSDARG00000000068
## [1] "327272"
##
## $ENSDARG00000000069
## [1] "58093"
It looks like we have Entrez IDs that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list
object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.
# Let's make our list a bit more manageable by turning it into a data frame
<- mapped_list %>%
mapped_df ::enframe(name = "Ensembl", value = "Entrez") %>%
tibble# enframe() makes a `list` column; we will simplify it with unnest()
# This will result in one row of our data frame per list item
::unnest(cols = Entrez) tidyr
Now let’s take a peek at our data frame.
head(mapped_df)
We can see that our data frame has a new column Entrez
. Let’s get a summary of the values returned in the Entrez
column of our mapped data frame.
# Use the `summary()` function to show the distribution of Entrez values
# We need to use `as.factor()` here to get the count of unique values
# `maxsum = 10` limits the summary to 10 distinct values
summary(as.factor(mapped_df$Entrez), maxsum = 10)
## 100126027 100150038 100331412 794549 554097 794406 100536331
## 28 28 28 28 11 7 6
## 100148591 (Other) NA's
## 5 23089 9026
There are 9026 NA
s in our data frame, which means that 9026 out of the 31882 Ensembl IDs did not map to Entrez IDs. This means if you are depending on Entrez
IDs for your downstream analyses, you may not be able to use the 9026 unmapped genes.
Now let’s check to see how many genes we have that were mapped to multiple IDs.
<- mapped_df %>%
multi_mapped # Let's count the number of times each Ensembl ID appears in `Ensembl` column
::count(Ensembl, name = "entrez_id_count") %>%
dplyr# Arrange by the genes with the highest number of Entrez IDs mapped
::arrange(desc(entrez_id_count))
dplyr
# Let's look at the first 6 rows of our `multi_mapped` object
head(multi_mapped)
Looks like we have one case where one Ensembl ID mapped to 15 Entrez IDs! We have a total of 235 out of 31882 Ensembl IDs with multiple mappings to Entrez IDs. One option is to filter out the multi-mapped IDs and keep only the 1:1 mappings for our downstream analyses by supplying the "filter"
option to the multiVals
argument of mapIds()
. If you want to see an example of a multiVals = "filter"
strategy see the microarray gene identifier annotation example notebook.
If we are not sure which Entrez ID is most relevant to our downstream analyses, we could store all of the mapped information for future use. In the next chunk, we show how we can collapse all the Entrez IDs into one column where we store them all for each unique Ensembl ID.
<- mapped_df %>%
collapsed_mapped_df # Group by Ensembl IDs
::group_by(Ensembl) %>%
dplyr# Collapse the Entrez IDs `mapped_df` into one column named `all_entrez_ids`
::summarize(all_entrez_ids = paste(Entrez, collapse = ";")) dplyr
Let’s take a look at our new collapsed all_entrez_ids
column in the collapsed_mapped_df
object.
%>%
collapsed_mapped_df # Filter `collapsed_mapped_df` to include only the rows where
# `all_entrez_ids` values include the ";" character --
# these are the rows with multiple mapped values
::filter(stringr::str_detect(all_entrez_ids, ";")) %>%
dplyr# We only need a preview here
head()
You may have a list of Entrez IDs you are interested in, in which case, the above format may work best for you. In a different study, you may want the oldest Entrez ID (which is probably first), in which case, you can create a column that stores just the first mapped Entrez ID that comes up for each Ensembl ID. We will do this in the next section by re-running the mapIds()
function with multiVals = "first"
.
If we don’t have a particular preference of which Entrez ID is returned, we can have mapIds()
use its default of returning the first Entrez ID listed.
Let’s re-run mapIds()
, this time using multiVals = "first"
.
<- data.frame(
final_mapped_df "first_mapped_entrez_id" = mapIds(
# Replace with annotation package for your organism
org.Dr.eg.db, keys = expression_df$Gene,
keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
column = "ENTREZID", # The type of gene identifiers you would like to map to
multiVals = "first" # Keep only the first mapped value for each Ensembl ID
)%>%
) # Make an `Ensembl` column to store the rownames
::rownames_to_column("Ensembl") %>%
tibble# Add the multiple mappings data from `collapsed_mapped_df` using Ensembl IDs
::inner_join(collapsed_mapped_df, by = "Ensembl") %>%
dplyr# Now let's add on the rest of the expression data
::inner_join(expression_df, by = c("Ensembl" = "Gene")) dplyr
## 'select()' returned 1:many mapping between keys and columns
Our final_mapped_df
object now has a column named first_mapped_entrez_id
that contains only the first mapped Entrez ID, in addition to the all_entrez_ids
column that contains all mapped Entrez IDs per Ensembl ID.
Let’s look at the multi-mapped data again
%>%
final_mapped_df # Filter `final_mapped_df` to rows with multiple mapped values
::filter(stringr::str_detect(all_entrez_ids, ";")) %>%
dplyrhead()
Now let’s write our mapped results and data to file!
# Write mapped and annotated data frame to output file
::write_tsv(final_mapped_df, file.path(
readr
results_dir,"SRP040561_Entrez_IDs.tsv" # Replace with a relevant output file name
))
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
## BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.3)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.3)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cachem 1.0.5 2021-05-15 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## memoise 2.0.0 2021-01-26 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## org.Dr.eg.db * 3.12.0 2022-03-01 [1] Bioconductor
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## RSQLite 2.2.7 2021-04-22 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyr 1.1.3 2021-03-03 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library