1 Purpose of this analysis

The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using AnnotationDbi packages (Pagès et al. 2020).

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this mouse glioma stem cells dataset.

This dataset has 15 microarrays measuring gene expression in a transgenic mouse model of glioma. The authors compared cells from side populations and non-side populations in both tumor samples and normal neural stem cells.

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession IDs usually look something like GSE1235 or SRP12345.

Copy and paste the GSE13490 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The GSE13490 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "GSE13490")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "GSE13490.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Mouse, (Mus musculus), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.

For different species, wherever the abbreviation org.Mm.eg.db or Mm is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db or Hs would be used. In the case of our RNA-seq gene identifier annotation example notebook, a Zebrafish (Danio rerio) dataset is used, meaning org.Dr.eg.db or Dr would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.


 

4 Obtaining Annotation for Ensembl IDs - Microarray

refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our mouse dataset to obtain the associated gene symbols.

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using the org.Mm.eg.db R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.

# Install the mouse annotation package
if (!("org.Mm.eg.db" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("org.Mm.eg.db", update = FALSE)
}

Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db will automatically also attach AnnotationDbi.

# Attach the library
library(org.Mm.eg.db)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   refinebio_age = col_logical(),
##   refinebio_cell_line = col_logical(),
##   refinebio_compound = col_logical(),
##   refinebio_disease = col_logical(),
##   refinebio_disease_stage = col_logical(),
##   refinebio_genetic_information = col_logical(),
##   refinebio_processed = col_logical(),
##   refinebio_race = col_logical(),
##   refinebio_sex = col_logical(),
##   refinebio_source_archive_url = col_logical(),
##   refinebio_specimen_part = col_logical(),
##   refinebio_subject = col_logical(),
##   refinebio_time = col_logical(),
##   refinebio_treatment = col_logical(),
##   channel_count = col_double(),
##   data_row_count = col_double(),
##   taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
  # Tuck away the Gene ID column as row names
  tibble::column_to_rownames("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   Gene = col_character(),
##   GSM340064 = col_double(),
##   GSM340065 = col_double(),
##   GSM340066 = col_double(),
##   GSM340067 = col_double(),
##   GSM340068 = col_double(),
##   GSM340069 = col_double(),
##   GSM340070 = col_double(),
##   GSM340071 = col_double(),
##   GSM340072 = col_double(),
##   GSM340073 = col_double(),
##   GSM340074 = col_double(),
##   GSM340075 = col_double(),
##   GSM340076 = col_double(),
##   GSM340077 = col_double(),
##   GSM340078 = col_double()
## )

Let’s ensure that the metadata and data are in the same sample order.

# Make the data in the order of the metadata
expression_df <- expression_df %>%
  dplyr::select(metadata$geo_accession)

# Check if this is in the same order
all.equal(colnames(expression_df), metadata$geo_accession)
## [1] TRUE
# Bring back the "Gene" column in preparation for mapping
expression_df <- expression_df %>%
  tibble::rownames_to_column("Gene")

4.3 Map Ensembl IDs to gene symbols

The main work of translating among annotations will be done with the the AnnotationDbi function mapIds(). The mapIds() function has a multiVals argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds to see more options or strategies.

In the next chunk, we will run the mapIds() function and supply the multiVals argument with the "list" option in order to get a large list with all the mapped values found for each gene identifier.

# Map ensembl IDs to their associated gene symbols
mapped_list <- mapIds(
  org.Mm.eg.db, # Replace with annotation package for your organism
  keys = expression_df$Gene,
  keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
  column = "SYMBOL", # The type of gene identifiers you would like to map to
  multiVals = "list"
)
## 'select()' returned 1:many mapping between keys and columns

4.4 Explore gene ID conversion

Now, let’s take a look at our mapped object to see how the mapping went.

# Let's use the `head()` function for a preview of our mapped list
head(mapped_list)
## $ENSMUSG00000000001
## [1] "Gnai3"
## 
## $ENSMUSG00000000003
## [1] "Pbsn"
## 
## $ENSMUSG00000000028
## [1] "Cdc45"
## 
## $ENSMUSG00000000031
## [1] "H19"
## 
## $ENSMUSG00000000037
## [1] "Scml2"
## 
## $ENSMUSG00000000049
## [1] "Apoh"

It looks like we have gene symbols that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.

# Let's make our list a bit more manageable by turning it into a data frame
mapped_df <- mapped_list %>%
  tibble::enframe(name = "Ensembl", value = "Symbol") %>%
  # enframe() makes a `list` column; we will simplify it with unnest()
  # This will result in a data frame with one row per list item
  tidyr::unnest(cols = Symbol)

Now let’s take a peek at our data frame.

head(mapped_df)

We can see that our data frame has a new column Symbol. Let’s get a summary of the gene symbols in the Symbol column of our mapped data frame.

# Use the `summary()` function to show the distribution of Symbol values
# We need to use `as.factor()` here to get the counts of unique values
# `maxsum = 10` limits the summary to 10 distinct values
summary(as.factor(mapped_df$Symbol), maxsum = 10)
##       Cyp2c39          Pms2 0610005C13Rik 0610009B22Rik 0610009L18Rik 
##             2             2             1             1             1 
## 0610010F05Rik 0610012G03Rik 0610030E20Rik       (Other)          NA's 
##             1             1             1         17021           942

There are 942 NAs in the Symbol column, which means that 942 out of the 17918 Ensembl IDs did not map to gene symbols. 942 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.

However, if you have almost all NAs it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.

Now let’s check to see if we have any genes that were mapped to multiple symbols.

multi_mapped <- mapped_df %>%
  # Let's count the number of times each Ensembl ID appears in `Ensembl` column
  dplyr::count(Ensembl, name = "gene_symbol_count") %>%
  # Arrange by the genes with the highest number of symbols mapped
  dplyr::arrange(desc(gene_symbol_count)) %>%
  # Filter to include only the rows with multi mappings
  dplyr::filter(gene_symbol_count > 1)

# Let's look at the first 6 rows of our `multi_mapped` object
head(multi_mapped)

Looks like we have some cases where 3 gene symbols mapped to a single Ensembl ID. We have a total of 130 out of 17984 Ensembl IDs with multiple mappings to gene symbols. If we are not too worried about the 130 IDs with multiple mappings, we can filter them out for the purpose of having 1:1 mappings for our downstream analysis.

4.5 Map Ensembl IDs to gene symbols – filtering out multi mappings

The next code chunk we will rerun the mapIds() function, this time supplying the "filter" option to the multiVals argument. This will remove all instances of multiple mappings and return a list of only the gene identifiers and symbols that had 1:1 mapping. Use ?mapIds to see more options or strategies.

# Map Ensembl IDs to their associated gene symbols
filtered_mapped_df <- data.frame(
  "Symbol" = mapIds(
    org.Mm.eg.db, # Replace with annotation package for your organism
    keys = expression_df$Gene,
    keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
    column = "SYMBOL", # The type of gene identifiers you would like to map to
    multiVals = "filter" # This will drop any genes that have multiple matches
  )
) %>%
  # Make an `Ensembl` column to store the rownames
  tibble::rownames_to_column("Ensembl") %>%
  # Join the remaining data from `expression_df` using the Ensembl IDs
  dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))
## 'select()' returned 1:many mapping between keys and columns

Now, let’s write our filtered and mapped results to file!

4.6 Write mapped results to file

# Write mapped and annotated data frame to output file
readr::write_tsv(filtered_mapped_df, file.path(
  results_dir,
  "GSE13490_Gene_Symbols.tsv" # Replace with a relevant output file name
))

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package       * version date       lib source        
##  AnnotationDbi * 1.52.0  2020-10-27 [1] Bioconductor  
##  assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports       1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase       * 2.50.0  2020-10-27 [1] Bioconductor  
##  BiocGenerics  * 0.36.1  2021-04-16 [1] Bioconductor  
##  bit             4.0.4   2020-08-04 [1] RSPM (R 4.0.3)
##  bit64           4.0.5   2020-08-30 [1] RSPM (R 4.0.3)
##  blob            1.2.1   2020-01-20 [1] RSPM (R 4.0.3)
##  bslib           0.2.5   2021-05-12 [1] RSPM (R 4.0.4)
##  cachem          1.0.5   2021-05-15 [1] RSPM (R 4.0.4)
##  cli             2.5.0   2021-04-26 [1] RSPM (R 4.0.4)
##  crayon          1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  DBI             1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  digest          0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr           1.0.6   2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.0.4)
##  evaluate        0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi           0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  fastmap         1.1.0   2021-01-25 [1] RSPM (R 4.0.3)
##  generics        0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  getopt          1.20.3  2019-03-22 [1] RSPM (R 4.0.0)
##  glue            1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  hms             1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools       0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  IRanges       * 2.24.1  2020-12-12 [1] Bioconductor  
##  jquerylib       0.1.4   2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite        1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr           1.33    2021-04-24 [1] RSPM (R 4.0.4)
##  lifecycle       1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr      * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  memoise         2.0.0   2021-01-26 [1] RSPM (R 4.0.3)
##  optparse      * 1.6.6   2020-04-16 [1] RSPM (R 4.0.0)
##  org.Mm.eg.db  * 3.12.0  2022-03-01 [1] Bioconductor  
##  pillar          1.6.1   2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  ps              1.6.0   2021-02-28 [1] RSPM (R 4.0.3)
##  purrr           0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache         0.15.0  2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3     1.8.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo            1.24.0  2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils         2.10.1  2020-08-26 [1] RSPM (R 4.0.3)
##  R6              2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  Rcpp            1.0.6   2021-01-15 [1] RSPM (R 4.0.3)
##  readr           1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2        2.1.2   2020-05-01 [1] RSPM (R 4.0.3)
##  rlang           0.4.11  2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown       2.8     2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite         2.2.7   2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi      0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  S4Vectors     * 0.28.1  2020-12-09 [1] Bioconductor  
##  sass            0.4.0   2021-05-12 [1] RSPM (R 4.0.4)
##  sessioninfo     1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi         1.6.1   2021-05-10 [1] RSPM (R 4.0.4)
##  stringr         1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  styler          1.4.1   2021-03-30 [1] RSPM (R 4.0.4)
##  tibble          3.1.2   2021-05-16 [1] RSPM (R 4.0.4)
##  tidyr           1.1.3   2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect      1.1.1   2021-04-30 [1] RSPM (R 4.0.4)
##  utf8            1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs           0.3.8   2021-04-29 [1] RSPM (R 4.0.4)
##  withr           2.4.2   2021-04-18 [1] RSPM (R 4.0.4)
##  xfun            0.23    2021-05-15 [1] RSPM (R 4.0.4)
##  yaml            2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Carlson M., 2019 Genome wide annotation for mouse. https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html
Carlson M., 2020 AnnotationDbi: Introduction to bioconductor annotation packages. https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
Pagès H., M. Carlson, S. Falcon, and N. Li, 2020 AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html