1 Purpose of this analysis

The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using AnnotationDbi packages (Pagès et al. 2020).

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the `.Rmd` file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

If you want to use this dataset with other examples you may want to download it non-quantile normalized. See the RNA-seq header section about quantile normalization.

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this RNA-seq zebrafish lateral line hair cell dataset.

This dataset has 24 RNA-seq zebrafish lateral line hair cell samples. To identify changes in gene expression in the mantle and inner support cells after hair cell death, cells were isolated from regenerating and control, non-regenerating transgenic zebrafish using fluorescence-activated cell sorting (FACS).

2.5 Place the dataset in your new `data/` folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession IDs usually look something like GSE1235 or SRP12345.

Copy and paste the SRP040561 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

The example analysis .Rmd you downloaded
A folder called “data” which contains:
- The SRP040561 folder which contains:
  - The gene expression
  - The metadata TSV
A folder for plots (currently empty)
A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "SRP040561")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "SRP040561.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_SRP040561.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)

## [1] TRUE

# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)

## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.

For different species, wherever the abbreviation org.Dr.eg.db or Dr is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db or Hs would be used. In the case of our microarray gene identifier annotation example notebook, a Mouse (Mus musculus) dataset is used, meaning org.Mm.eg.db or Mm would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.

4 Obtaining Annotation for Ensembl IDs - RNA-seq

refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our zebrafish dataset to obtain the associated Entrez IDs.

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using the org.Dr.eg.db R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.

# Install the Zebrafish package
if (!("org.Dr.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Dr.eg.db", update = FALSE)
}

Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db will automatically also attach AnnotationDbi.

# Attach the library
library(org.Dr.eg.db)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)

## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_logical(),
##   refinebio_accession_code = col_character(),
##   experiment_accession = col_character(),
##   refinebio_organism = col_character(),
##   refinebio_platform = col_character(),
##   refinebio_source_database = col_character(),
##   refinebio_title = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
  # Tuck away the Gene ID column as row names
  tibble::column_to_rownames("Gene")

## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Let’s ensure that the metadata and data are in the same sample order.

# Make the data in the order of the metadata
expression_df <- expression_df %>%
  dplyr::select(metadata$refinebio_accession_code)

# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)

## [1] TRUE

# Bring back the "Gene" column in preparation for mapping
expression_df <- expression_df %>%
  tibble::rownames_to_column("Gene")

4.3 Map Ensembl IDs to Entrez IDs

The mapIds() function has a multiVals argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds to see more options or strategies.

In the next chunk, we will run the mapIds() function and supply the multiVals argument with the "list" option in order to get a large list with all the mapped values found for each gene identifier.

# Map Ensembl IDs to their associated Entrez IDs
mapped_list <- mapIds(
  org.Dr.eg.db, # Replace with annotation package for your organism
  keys = expression_df$Gene,
  keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
  column = "ENTREZID", # The type of gene identifiers you would like to map to
  multiVals = "list"
)

## 'select()' returned 1:many mapping between keys and columns

4.4 Explore gene ID conversion

Now, let’s take a look at our mapped object to see how the mapping went.

# Let's use the `head()` function for a preview of our mapped list
head(mapped_list)

## $ENSDARG00000000001
## [1] "368418"
## 
## $ENSDARG00000000002
## [1] "368419"
## 
## $ENSDARG00000000018
## [1] "64604"
## 
## $ENSDARG00000000019
## [1] "368425"
## 
## $ENSDARG00000000068
## [1] "327272"
## 
## $ENSDARG00000000069
## [1] "58093"

It looks like we have Entrez IDs that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.

# Let's make our list a bit more manageable by turning it into a data frame
mapped_df <- mapped_list %>%
  tibble::enframe(name = "Ensembl", value = "Entrez") %>%
  # enframe() makes a `list` column; we will simplify it with unnest()
  # This will result in one row of our data frame per list item
  tidyr::unnest(cols = Entrez)

Now let’s take a peek at our data frame.

head(mapped_df)

We can see that our data frame has a new column Entrez. Let’s get a summary of the values returned in the Entrez column of our mapped data frame.

# Use the `summary()` function to show the distribution of Entrez values
# We need to use `as.factor()` here to get the count of unique values
# `maxsum = 10` limits the summary to 10 distinct values
summary(as.factor(mapped_df$Entrez), maxsum = 10)

## 100126027 100150038 100331412    794549    554097    794406 100536331 
##        28        28        28        28        11         7         6 
## 100148591   (Other)      NA's 
##         5     23089      9026

There are 9026 NAs in our data frame, which means that 9026 out of the 31882 Ensembl IDs did not map to Entrez IDs. This means if you are depending on Entrez IDs for your downstream analyses, you may not be able to use the 9026 unmapped genes.

Now let’s check to see how many genes we have that were mapped to multiple IDs.

multi_mapped <- mapped_df %>%
  # Let's count the number of times each Ensembl ID appears in `Ensembl` column
  dplyr::count(Ensembl, name = "entrez_id_count") %>%
  # Arrange by the genes with the highest number of Entrez IDs mapped
  dplyr::arrange(desc(entrez_id_count))

# Let's look at the first 6 rows of our `multi_mapped` object
head(multi_mapped)

Looks like we have one case where one Ensembl ID mapped to 15 Entrez IDs! We have a total of 235 out of 31882 Ensembl IDs with multiple mappings to Entrez IDs. One option is to filter out the multi-mapped IDs and keep only the 1:1 mappings for our downstream analyses by supplying the "filter" option to the multiVals argument of mapIds(). If you want to see an example of a multiVals = "filter" strategy see the microarray gene identifier annotation example notebook.

4.4.1 Store all mapped values

If we are not sure which Entrez ID is most relevant to our downstream analyses, we could store all of the mapped information for future use. In the next chunk, we show how we can collapse all the Entrez IDs into one column where we store them all for each unique Ensembl ID.

collapsed_mapped_df <- mapped_df %>%
  # Group by Ensembl IDs
  dplyr::group_by(Ensembl) %>%
  # Collapse the Entrez IDs `mapped_df` into one column named `all_entrez_ids`
  dplyr::summarize(all_entrez_ids = paste(Entrez, collapse = ";"))

Let’s take a look at our new collapsed all_entrez_ids column in the collapsed_mapped_df object.

collapsed_mapped_df %>%
  # Filter `collapsed_mapped_df` to include only the rows where
  # `all_entrez_ids` values include the ";" character --
  # these are the rows with multiple mapped values
  dplyr::filter(stringr::str_detect(all_entrez_ids, ";")) %>%
  # We only need a preview here
  head()

You may have a list of Entrez IDs you are interested in, in which case, the above format may work best for you. In a different study, you may want the oldest Entrez ID (which is probably first), in which case, you can create a column that stores just the first mapped Entrez ID that comes up for each Ensembl ID. We will do this in the next section by re-running the mapIds() function with multiVals = "first".

4.4.2 Map Ensembl IDs to Entrez – keeping only the first mapped value

If we don’t have a particular preference of which Entrez ID is returned, we can have mapIds() use its default of returning the first Entrez ID listed.

Let’s re-run mapIds(), this time using multiVals = "first".

final_mapped_df <- data.frame(
  "first_mapped_entrez_id" = mapIds(
    org.Dr.eg.db, # Replace with annotation package for your organism
    keys = expression_df$Gene,
    keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
    column = "ENTREZID", # The type of gene identifiers you would like to map to
    multiVals = "first" # Keep only the first mapped value for each Ensembl ID
  )
) %>%
  # Make an `Ensembl` column to store the rownames
  tibble::rownames_to_column("Ensembl") %>%
  # Add the multiple mappings data from `collapsed_mapped_df` using Ensembl IDs
  dplyr::inner_join(collapsed_mapped_df, by = "Ensembl") %>%
  # Now let's add on the rest of the expression data
  dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))

## 'select()' returned 1:many mapping between keys and columns

Our final_mapped_df object now has a column named first_mapped_entrez_id that contains only the first mapped Entrez ID, in addition to the all_entrez_ids column that contains all mapped Entrez IDs per Ensembl ID.

Let’s look at the multi-mapped data again

final_mapped_df %>%
  # Filter `final_mapped_df` to rows with multiple mapped values
  dplyr::filter(stringr::str_detect(all_entrez_ids, ";")) %>%
  head()

Now let’s write our mapped results and data to file!

4.5 Write mapped results to file

# Write mapped and annotated data frame to output file
readr::write_tsv(final_mapped_df, file.path(
  results_dir,
  "SRP040561_Entrez_IDs.tsv" # Replace with a relevant output file name
))

5 Resources for further learning

Marc Carlson has prepared a nice Introduction to Bioconductor Annotation Packages (Carlson 2020)
See our microarray gene ID conversion notebook as another applicable example, since the steps for this workflow do not change with technology.

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()

## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package       * version date       lib source        
##  AnnotationDbi * 1.52.0  2020-10-27 [1] Bioconductor  
##  assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports       1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase       * 2.50.0  2020-10-27 [1] Bioconductor  
##  BiocGenerics  * 0.36.1  2021-04-16 [1] Bioconductor  
##  bit             4.0.4   2020-08-04 [1] RSPM (R 4.0.3)
##  bit64           4.0.5   2020-08-30 [1] RSPM (R 4.0.3)
##  blob            1.2.1   2020-01-20 [1] RSPM (R 4.0.3)
##  bslib           0.2.5   2021-05-12 [1] RSPM (R 4.0.4)
##  cachem          1.0.5   2021-05-15 [1] RSPM (R 4.0.4)
##  cli             2.5.0   2021-04-26 [1] RSPM (R 4.0.4)
##  crayon          1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  DBI             1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  digest          0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr           1.0.6   2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.0.4)
##  evaluate        0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi           0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  fastmap         1.1.0   2021-01-25 [1] RSPM (R 4.0.3)
##  generics        0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  getopt          1.20.3  2019-03-22 [1] RSPM (R 4.0.0)
##  glue            1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  hms             1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools       0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  IRanges       * 2.24.1  2020-12-12 [1] Bioconductor  
##  jquerylib       0.1.4   2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite        1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr           1.33    2021-04-24 [1] RSPM (R 4.0.4)
##  lifecycle       1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr      * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  memoise         2.0.0   2021-01-26 [1] RSPM (R 4.0.3)
##  optparse      * 1.6.6   2020-04-16 [1] RSPM (R 4.0.0)
##  org.Dr.eg.db  * 3.12.0  2022-03-01 [1] Bioconductor  
##  pillar          1.6.1   2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  ps              1.6.0   2021-02-28 [1] RSPM (R 4.0.3)
##  purrr           0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache         0.15.0  2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3     1.8.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo            1.24.0  2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils         2.10.1  2020-08-26 [1] RSPM (R 4.0.3)
##  R6              2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  Rcpp            1.0.6   2021-01-15 [1] RSPM (R 4.0.3)
##  readr           1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2        2.1.2   2020-05-01 [1] RSPM (R 4.0.3)
##  rlang           0.4.11  2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown       2.8     2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite         2.2.7   2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi      0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  S4Vectors     * 0.28.1  2020-12-09 [1] Bioconductor  
##  sass            0.4.0   2021-05-12 [1] RSPM (R 4.0.4)
##  sessioninfo     1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi         1.6.1   2021-05-10 [1] RSPM (R 4.0.4)
##  stringr         1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  styler          1.4.1   2021-03-30 [1] RSPM (R 4.0.4)
##  tibble          3.1.2   2021-05-16 [1] RSPM (R 4.0.4)
##  tidyr           1.1.3   2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect      1.1.1   2021-04-30 [1] RSPM (R 4.0.4)
##  utf8            1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs           0.3.8   2021-04-29 [1] RSPM (R 4.0.4)
##  withr           2.4.2   2021-04-18 [1] RSPM (R 4.0.4)
##  xfun            0.23    2021-05-15 [1] RSPM (R 4.0.4)
##  yaml            2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

Carlson M., 2020 AnnotationDbi: Introduction to bioconductor annotation packages. https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf

Pagès H., M. Carlson, S. Falcon, and N. Li, 2020 AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html

Obtaining Annotation for Ensembl IDs - RNA-seq

CCDL for ALSF

October 2020

1 Purpose of this analysis

2 How to run this example

2.1 Obtain the `.Rmd` file

2.2 Set up your analysis folders

2.3 Obtain the dataset from refine.bio

2.4 About the dataset we are using for this example

2.5 Place the dataset in your new `data/` folder

2.6 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Obtaining Annotation for Ensembl IDs - RNA-seq

4.1 Install libraries

4.2 Import and set up data

4.3 Map Ensembl IDs to Entrez IDs

4.4 Explore gene ID conversion

4.4.1 Store all mapped values

4.4.2 Map Ensembl IDs to Entrez – keeping only the first mapped value

4.5 Write mapped results to file

5 Resources for further learning

6 Session info

References

Obtaining Annotation for Ensembl IDs - RNA-seq

CCDL for ALSF

October 2020

1 Purpose of this analysis

2 How to run this example

2.1 Obtain the .Rmd file

2.2 Set up your analysis folders

2.3 Obtain the dataset from refine.bio

2.4 About the dataset we are using for this example

2.5 Place the dataset in your new data/ folder

2.6 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

4 Obtaining Annotation for Ensembl IDs - RNA-seq

4.1 Install libraries

4.2 Import and set up data

4.3 Map Ensembl IDs to Entrez IDs

4.4 Explore gene ID conversion

4.4.1 Store all mapped values

4.4.2 Map Ensembl IDs to Entrez – keeping only the first mapped value

4.5 Write mapped results to file

5 Resources for further learning

6 Session info

References

2.1 Obtain the `.Rmd` file

2.5 Place the dataset in your new `data/` folder