This notebook demonstrates how you can use data from the HUGO Gene Nomenclature Committee (HGNC) database to perform ortholog mapping for RNA-seq data obtained from refine.bio. In this notebook, we will use mouse data from refine.bio and annotate it with human Ensembl IDs from HGNC.
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples.” Note that this option will only be available for RNA-seq datasets.
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this acute myeloid leukemia sample dataset.
The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) mouse models), containing RNA-sequencing results for types of AML under controlled treatment conditions. More specifically, IDH2-mutant AML mouse models were treated with either vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials). The TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent).
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP070849
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedSRP070849
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "SRP070849")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "SRP070849.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_SRP070849.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.
Attach a package we need for this analysis.
# We will need this so we can use the pipe: %>%
library(magrittr)
The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).
The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including mouse, which we will use in this notebook (HGNC Team 2020).
We can download the human mouse file we need for this example using download.file()
command. For this notebook, we want to download the file named human_mouse_hcop_fifteen_column.txt.gz
.
First we’ll declare a sensible file path for this.
# Declare what we want the downloaded file to be called and its location
<- file.path(
mouse_hgnc_file
data_dir,# The name the file will have locally
"human_mouse_hcop_fifteen_column.txt.gz"
)
Using the file path we just declared, we can use the destfile
argument to download the file we need to this directory and use this file name.
We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.
download.file(
paste0(
"http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
# Replace with the file name for the species conversion you want
"human_mouse_hcop_fifteen_column.txt.gz"
),# The file will be saved to the name and location we defined earlier
destfile = mouse_hgnc_file
)
If you are using a different dataset, in the last chunk you can replace mouse
in human_mouse_hcop_fifteen_column.txt.gz
with the name of the species you have data for (if you see it listed in the directory).
Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
Now let’s double check that the mouse ortholog file is in the right place.
# Check if the organism orthology file file is in the `data` directory
file.exists(mouse_hgnc_file)
## [1] TRUE
Now we can read in the orthology file that we downloaded.
# Read in the data from HGNC
<- readr::read_tsv(mouse_hgnc_file) mouse
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## human_entrez_gene = col_character(),
## human_ensembl_gene = col_character(),
## hgnc_id = col_character(),
## human_name = col_character(),
## human_symbol = col_character(),
## human_chr = col_character(),
## human_assert_ids = col_character(),
## mouse_entrez_gene = col_character(),
## mouse_ensembl_gene = col_character(),
## mgi_id = col_character(),
## mouse_name = col_character(),
## mouse_symbol = col_character(),
## mouse_chr = col_character(),
## mouse_assert_ids = col_character(),
## support = col_character()
## )
Let’s take a look at what mouse
looks like.
mouse
We are going to manipulate some of the column names to make things easier when calling them downstream.
<- mouse %>%
mouse set_names(names(.) %>%
# Removing extra word in some of the column names
gsub("_gene$", "", .))
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.
We stored our data file path as an object named data_file
in this previous step.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
mouse_genes # We only want the gene IDs so let's keep only the `Gene` column
::select("Gene") dplyr
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
refine.bio data uses Ensembl gene identifiers, which will be in the first column.
# Let's take a look at the first 6 rows of `mouse_genes`
head(mouse_genes)
Ensembl gene identifiers have different species-specific prefixes. In mouse, Ensembl gene identifiers begin with ENSMUSG
(in human, ENSG
, etc.).
Now let’s do the mapping!
We’re interested in the human_ensembl
, mouse_ensembl
, and support
columns specifically. The support
column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.
<- mouse %>%
human_mouse_key # We'll want to subset mouse to only the columns we're interested in
::select(mouse_ensembl, human_ensembl, support)
dplyr
# Since we ignored the additional columns in `mouse`, let's check to see if we
# have any duplicates in our `human_mouse_key`
any(duplicated(human_mouse_key))
## [1] TRUE
We do have duplicates! We don’t want to handle duplicate data, so let’s remove those duplicates before moving forward.
<- human_mouse_key %>%
human_mouse_key # Use the `distinct()` function to remove duplicates resulting from
# dropping the additional columns in the `mouse` data frame
::distinct() dplyr
Now let’s join the mapped data from human_mouse_key
with the gene data in mouse_genes
.
# First, we need to convert our vector of mouse genes into a data frame
<- mouse_genes %>%
human_mouse_mapped_df # Now we can join the mapped data
::left_join(human_mouse_key, by = c("Gene" = "mouse_ensembl")) dplyr
Here’s what the new data frame looks like:
head(human_mouse_mapped_df, n = 10)
Looks like we have mapped IDs!
So now we have all the mouse genes mapped to human, but there might be places where there are multiple mouse genes that are orthologous to the same human gene, or vice versa.
Let’s get a summary of the Ensembl IDs returned in the human_ensembl
column of our mapped data frame.
# We can use this `count()` function to get a tally of how many
# `mouse_ensembl` IDs there are per `human_ensembl` IDs
%>%
human_mouse_mapped_df # Count the number of rows per human gene
::count(human_ensembl) %>%
dplyr# Sort by highest `n` which will be the human Ensembl ID with the most mapped
# mouse Ensembl IDs
::arrange(desc(n)) dplyr
There are certainly a good number of places where we mapped multiple mouse Ensembl IDs to the same human symbol! We’ll look at this in a bit.
We can also see that there 19,126 mouse Ensembl IDs that did not map to a human Ensembl ID. These are the rows with a value of NA. This seems like a lot, but most of these are likely to be non-protein-coding genes that do not map easily across species.
If a mouse Ensembl gene ID maps to multiple human Ensembl IDs, the associated values will get duplicated. Let’s look at the ENSMUSG00000000290
example below.
%>%
human_mouse_mapped_df ::filter(Gene == "ENSMUSG00000000290") dplyr
On the other hand, if you were to look at the original data associated to the mouse Ensembl IDs, when a human Ensembl ID maps to multiple mouse Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human Ensembl ID. Let’s look at the ENSG00000001617
example below.
%>%
human_mouse_mapped_df ::filter(human_ensembl == "ENSG00000001617") dplyr
Remember that if a mouse Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.
In the next chunk, we show how we can collapse all the human Ensembl IDs into one column where we store them all for each unique mouse Ensembl ID.
<- human_mouse_mapped_df %>%
collapsed_human_ensembl_df # Group by mouse Ensembl IDs
::group_by(Gene) %>%
dplyr# Collapse the mapped values in `human_mouse_mapped_df` to a
# `all_human_ensembl` column, removing any duplicated human symbols
# note that we will lose the `support` column in this summarizing step
::summarize(
dplyr# combine unique symbols with semicolons between them
all_human_ensembl = paste(
sort(unique(human_ensembl)),
collapse = ";"
)
)
head(collapsed_human_ensembl_df)
Now let’s write our results to file.
::write_tsv(
readr
collapsed_human_ensembl_df,file.path(
results_dir,# Replace with a relevant output file name
"SRP070849_mouse_ensembl_to_collapsed_human_gene_symbol.tsv"
) )
Since multiple mouse Ensembl gene IDs map to the same human Ensembl gene ID, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which mouse gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the mouse lineage may result in complicated relationships among genes, especially with regard to divisions of function.
Simply combining values across mouse transcripts using an average may result in the loss of a lot of data and will likely not be representative of the mouse biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP
. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.
Therefore, we will use the support
column to decide which mappings to retain. Let’s take a look at support
.
head(human_mouse_mapped_df$support)
## [1] "OrthoDB,OrthoMCL"
## [2] "OrthoDB,OrthoMCL"
## [3] "Inparanoid,HomoloGene,PhylomeDB,NCBI,Ensembl,HGNC,OMA,Treefam,OrthoDB,Panther,OrthoMCL"
## [4] NA
## [5] "Inparanoid,HomoloGene,PhylomeDB,EggNOG,Ensembl,NCBI,HGNC,OMA,Treefam,Panther,OrthoDB,OrthoMCL"
## [6] "NCBI"
Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable. Therefore, where we have multi-mapped mouse Ensembl gene IDs, we will take the mappings with more than one database to support the assertion.
Before we do, let’s take a look how many multi-mapped genes there are in the data frame.
%>%
human_mouse_mapped_df # Count the number of rows in the data frame for each Ensembl ID
::count(human_ensembl) %>%
dplyr# Filter out the symbols without multimapped genes
::filter(n > 1) dplyr
Looks like we have 6,971 human gene Ensembl IDs with multiple mappings.
Now let’s filter out the less reliable mappings.
<- human_mouse_mapped_df %>%
filtered_mouse_ensembl_df # Count the number of databases in the support column
# by using the number of commas that separate the databases
::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
dplyr# Now filter to the rows where more than one database supports the mapping
::filter(n_databases > 1)
dplyr
head(filtered_mouse_ensembl_df)
Let’s count how many multi-mapped genes we have now.
%>%
filtered_mouse_ensembl_df # Group by human Ensembl IDs
::group_by(human_ensembl) %>%
dplyr# Count the number of rows in the data frame for each Ensembl ID
::count() %>%
dplyr# Filter out the symbols without multimapped genes
::filter(n > 1) dplyr
Now we only have 2,702 multi-mapped genes, compared to the 6,608 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.
Now let’s write our filtered_mouse_ensembl_df
object, with the reliable mouse Ensembl IDs for each unique human gene Ensembl ID, to file.
::write_tsv(
readr
filtered_mouse_ensembl_df,file.path(
results_dir,# Replace with a relevant output file name
"SRP070849_filtered_mouse_ensembl_to_human_gene_symbol.tsv"
) )
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library