The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using AnnotationDbi
packages (Pagès et al. 2020).
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this mouse glioma stem cells dataset.
This dataset has 15 microarrays measuring gene expression in a transgenic mouse model of glioma. The authors compared cells from side populations and non-side populations in both tumor samples and normal neural stem cells.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession IDs usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE13490
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedGSE13490
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "GSE13490")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "GSE13490.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_GSE13490.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Mouse, (Mus musculus), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.
For different species, wherever the abbreviation org.Mm.eg.db
or Mm
is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db
or Hs
would be used. In the case of our RNA-seq gene identifier annotation example notebook, a Zebrafish (Danio rerio) dataset is used, meaning org.Dr.eg.db
or Dr
would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.
refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our mouse dataset to obtain the associated gene symbols.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using the org.Mm.eg.db
R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi
framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.
# Install the mouse annotation package
if (!("org.Mm.eg.db" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("org.Mm.eg.db", update = FALSE)
BiocManager }
Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db
will automatically also attach AnnotationDbi
.
# Attach the library
library(org.Mm.eg.db)
# We will need this so we can use the pipe: %>%
library(magrittr)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
## refinebio_cell_line = col_logical(),
## refinebio_compound = col_logical(),
## refinebio_disease = col_logical(),
## refinebio_disease_stage = col_logical(),
## refinebio_genetic_information = col_logical(),
## refinebio_processed = col_logical(),
## refinebio_race = col_logical(),
## refinebio_sex = col_logical(),
## refinebio_source_archive_url = col_logical(),
## refinebio_specimen_part = col_logical(),
## refinebio_subject = col_logical(),
## refinebio_time = col_logical(),
## refinebio_treatment = col_logical(),
## channel_count = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
expression_df # Tuck away the Gene ID column as row names
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## Gene = col_character(),
## GSM340064 = col_double(),
## GSM340065 = col_double(),
## GSM340066 = col_double(),
## GSM340067 = col_double(),
## GSM340068 = col_double(),
## GSM340069 = col_double(),
## GSM340070 = col_double(),
## GSM340071 = col_double(),
## GSM340072 = col_double(),
## GSM340073 = col_double(),
## GSM340074 = col_double(),
## GSM340075 = col_double(),
## GSM340076 = col_double(),
## GSM340077 = col_double(),
## GSM340078 = col_double()
## )
Let’s ensure that the metadata and data are in the same sample order.
# Make the data in the order of the metadata
<- expression_df %>%
expression_df ::select(metadata$geo_accession)
dplyr
# Check if this is in the same order
all.equal(colnames(expression_df), metadata$geo_accession)
## [1] TRUE
# Bring back the "Gene" column in preparation for mapping
<- expression_df %>%
expression_df ::rownames_to_column("Gene") tibble
The main work of translating among annotations will be done with the the AnnotationDbi
function mapIds()
. The mapIds()
function has a multiVals
argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds
to see more options or strategies.
In the next chunk, we will run the mapIds()
function and supply the multiVals
argument with the "list"
option in order to get a large list with all the mapped values found for each gene identifier.
# Map ensembl IDs to their associated gene symbols
<- mapIds(
mapped_list # Replace with annotation package for your organism
org.Mm.eg.db, keys = expression_df$Gene,
keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
column = "SYMBOL", # The type of gene identifiers you would like to map to
multiVals = "list"
)
## 'select()' returned 1:many mapping between keys and columns
Now, let’s take a look at our mapped object to see how the mapping went.
# Let's use the `head()` function for a preview of our mapped list
head(mapped_list)
## $ENSMUSG00000000001
## [1] "Gnai3"
##
## $ENSMUSG00000000003
## [1] "Pbsn"
##
## $ENSMUSG00000000028
## [1] "Cdc45"
##
## $ENSMUSG00000000031
## [1] "H19"
##
## $ENSMUSG00000000037
## [1] "Scml2"
##
## $ENSMUSG00000000049
## [1] "Apoh"
It looks like we have gene symbols that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list
object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.
# Let's make our list a bit more manageable by turning it into a data frame
<- mapped_list %>%
mapped_df ::enframe(name = "Ensembl", value = "Symbol") %>%
tibble# enframe() makes a `list` column; we will simplify it with unnest()
# This will result in a data frame with one row per list item
::unnest(cols = Symbol) tidyr
Now let’s take a peek at our data frame.
head(mapped_df)
We can see that our data frame has a new column Symbol
. Let’s get a summary of the gene symbols in the Symbol
column of our mapped data frame.
# Use the `summary()` function to show the distribution of Symbol values
# We need to use `as.factor()` here to get the counts of unique values
# `maxsum = 10` limits the summary to 10 distinct values
summary(as.factor(mapped_df$Symbol), maxsum = 10)
## Cyp2c39 Pms2 0610005C13Rik 0610009B22Rik 0610009L18Rik
## 2 2 1 1 1
## 0610010F05Rik 0610012G03Rik 0610030E20Rik (Other) NA's
## 1 1 1 17021 942
There are 942 NA
s in the Symbol
column, which means that 942 out of the 17918 Ensembl IDs did not map to gene symbols. 942 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.
However, if you have almost all NA
s it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.
Now let’s check to see if we have any genes that were mapped to multiple symbols.
<- mapped_df %>%
multi_mapped # Let's count the number of times each Ensembl ID appears in `Ensembl` column
::count(Ensembl, name = "gene_symbol_count") %>%
dplyr# Arrange by the genes with the highest number of symbols mapped
::arrange(desc(gene_symbol_count)) %>%
dplyr# Filter to include only the rows with multi mappings
::filter(gene_symbol_count > 1)
dplyr
# Let's look at the first 6 rows of our `multi_mapped` object
head(multi_mapped)
Looks like we have some cases where 3 gene symbols mapped to a single Ensembl ID. We have a total of 130 out of 17984 Ensembl IDs with multiple mappings to gene symbols. If we are not too worried about the 130 IDs with multiple mappings, we can filter them out for the purpose of having 1:1 mappings for our downstream analysis.
The next code chunk we will rerun the mapIds()
function, this time supplying the "filter"
option to the multiVals
argument. This will remove all instances of multiple mappings and return a list of only the gene identifiers and symbols that had 1:1 mapping. Use ?mapIds
to see more options or strategies.
# Map Ensembl IDs to their associated gene symbols
<- data.frame(
filtered_mapped_df "Symbol" = mapIds(
# Replace with annotation package for your organism
org.Mm.eg.db, keys = expression_df$Gene,
keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
column = "SYMBOL", # The type of gene identifiers you would like to map to
multiVals = "filter" # This will drop any genes that have multiple matches
)%>%
) # Make an `Ensembl` column to store the rownames
::rownames_to_column("Ensembl") %>%
tibble# Join the remaining data from `expression_df` using the Ensembl IDs
::inner_join(expression_df, by = c("Ensembl" = "Gene")) dplyr
## 'select()' returned 1:many mapping between keys and columns
Now, let’s write our filtered and mapped results to file!
# Write mapped and annotated data frame to output file
::write_tsv(filtered_mapped_df, file.path(
readr
results_dir,"GSE13490_Gene_Symbols.tsv" # Replace with a relevant output file name
))
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
## BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.3)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.3)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cachem 1.0.5 2021-05-15 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## memoise 2.0.0 2021-01-26 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## org.Mm.eg.db * 3.12.0 2022-03-01 [1] Bioconductor
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## RSQLite 2.2.7 2021-04-22 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyr 1.1.3 2021-03-03 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library