1 Purpose of this analysis

This notebook demonstrates how you can use data from the HUGO Gene Nomenclature Committee (HGNC) database to perform ortholog mapping for RNA-seq data obtained from refine.bio. In this notebook, we will use mouse data from refine.bio and annotate it with human Ensembl IDs from HGNC.

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples.” Note that this option will only be available for RNA-seq datasets.

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this acute myeloid leukemia sample dataset.

The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) mouse models), containing RNA-sequencing results for types of AML under controlled treatment conditions. More specifically, IDH2-mutant AML mouse models were treated with either vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials). The TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent).

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the SRP070849 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The SRP070849 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "SRP070849")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "SRP070849.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 Ortholog Mapping - RNA-seq

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.

Attach a package we need for this analysis.

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Import data from HGNC

The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).

The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including mouse, which we will use in this notebook (HGNC Team 2020).

We can download the human mouse file we need for this example using download.file() command. For this notebook, we want to download the file named human_mouse_hcop_fifteen_column.txt.gz.

First we’ll declare a sensible file path for this.

# Declare what we want the downloaded file to be called and its location
mouse_hgnc_file <- file.path(
  data_dir,
  # The name the file will have locally
  "human_mouse_hcop_fifteen_column.txt.gz"
)

Using the file path we just declared, we can use the destfile argument to download the file we need to this directory and use this file name.

We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.

download.file(
  paste0(
    "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
    # Replace with the file name for the species conversion you want
    "human_mouse_hcop_fifteen_column.txt.gz"
  ),
  # The file will be saved to the name and location we defined earlier
  destfile = mouse_hgnc_file
)

If you are using a different dataset, in the last chunk you can replace mouse in human_mouse_hcop_fifteen_column.txt.gz with the name of the species you have data for (if you see it listed in the directory).

Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.

Now let’s double check that the mouse ortholog file is in the right place.

# Check if the organism orthology file file is in the `data` directory
file.exists(mouse_hgnc_file)
## [1] TRUE

Now we can read in the orthology file that we downloaded.

# Read in the data from HGNC
mouse <- readr::read_tsv(mouse_hgnc_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   human_entrez_gene = col_character(),
##   human_ensembl_gene = col_character(),
##   hgnc_id = col_character(),
##   human_name = col_character(),
##   human_symbol = col_character(),
##   human_chr = col_character(),
##   human_assert_ids = col_character(),
##   mouse_entrez_gene = col_character(),
##   mouse_ensembl_gene = col_character(),
##   mgi_id = col_character(),
##   mouse_name = col_character(),
##   mouse_symbol = col_character(),
##   mouse_chr = col_character(),
##   mouse_assert_ids = col_character(),
##   support = col_character()
## )

Let’s take a look at what mouse looks like.

mouse

We are going to manipulate some of the column names to make things easier when calling them downstream.

mouse <- mouse %>%
  set_names(names(.) %>%
    # Removing extra word in some of the column names
    gsub("_gene$", "", .))

4.3 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.

We stored our data file path as an object named data_file in this previous step.

# Read in data TSV file
mouse_genes <- readr::read_tsv(data_file) %>%
  # We only want the gene IDs so let's keep only the `Gene` column
  dplyr::select("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

4.4 Mapping mouse Ensembl gene IDs to human Ensembl gene IDs

refine.bio data uses Ensembl gene identifiers, which will be in the first column.

# Let's take a look at the first 6 rows of `mouse_genes`
head(mouse_genes)

Ensembl gene identifiers have different species-specific prefixes. In mouse, Ensembl gene identifiers begin with ENSMUSG (in human, ENSG, etc.).

Now let’s do the mapping!

We’re interested in the human_ensembl, mouse_ensembl, and support columns specifically. The support column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.

human_mouse_key <- mouse %>%
  # We'll want to subset mouse to only the columns we're interested in
  dplyr::select(mouse_ensembl, human_ensembl, support)

# Since we ignored the additional columns in `mouse`, let's check to see if we
# have any duplicates in our `human_mouse_key`
any(duplicated(human_mouse_key))
## [1] TRUE

We do have duplicates! We don’t want to handle duplicate data, so let’s remove those duplicates before moving forward.

human_mouse_key <- human_mouse_key %>%
  # Use the `distinct()` function to remove duplicates resulting from
  # dropping the additional columns in the `mouse` data frame
  dplyr::distinct()

Now let’s join the mapped data from human_mouse_key with the gene data in mouse_genes.

# First, we need to convert our vector of mouse genes into a data frame
human_mouse_mapped_df <- mouse_genes %>%
  # Now we can join the mapped data
  dplyr::left_join(human_mouse_key, by = c("Gene" = "mouse_ensembl"))

Here’s what the new data frame looks like:

head(human_mouse_mapped_df, n = 10)

Looks like we have mapped IDs!

So now we have all the mouse genes mapped to human, but there might be places where there are multiple mouse genes that are orthologous to the same human gene, or vice versa.

Let’s get a summary of the Ensembl IDs returned in the human_ensembl column of our mapped data frame.

# We can use this `count()` function to get a tally of how many
# `mouse_ensembl` IDs there are per `human_ensembl` IDs
human_mouse_mapped_df %>%
  # Count the number of rows per human gene
  dplyr::count(human_ensembl) %>%
  # Sort by highest `n` which will be the human Ensembl ID with the most mapped
  # mouse Ensembl IDs
  dplyr::arrange(desc(n))

There are certainly a good number of places where we mapped multiple mouse Ensembl IDs to the same human symbol! We’ll look at this in a bit.

We can also see that there 19,126 mouse Ensembl IDs that did not map to a human Ensembl ID. These are the rows with a value of NA. This seems like a lot, but most of these are likely to be non-protein-coding genes that do not map easily across species.

4.5 Take a look at some multi-mappings

If a mouse Ensembl gene ID maps to multiple human Ensembl IDs, the associated values will get duplicated. Let’s look at the ENSMUSG00000000290 example below.

human_mouse_mapped_df %>%
  dplyr::filter(Gene == "ENSMUSG00000000290")

On the other hand, if you were to look at the original data associated to the mouse Ensembl IDs, when a human Ensembl ID maps to multiple mouse Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human Ensembl ID. Let’s look at the ENSG00000001617 example below.

human_mouse_mapped_df %>%
  dplyr::filter(human_ensembl == "ENSG00000001617")

4.6 Collapse mouse genes mapping to multiple human genes

Remember that if a mouse Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.

In the next chunk, we show how we can collapse all the human Ensembl IDs into one column where we store them all for each unique mouse Ensembl ID.

collapsed_human_ensembl_df <- human_mouse_mapped_df %>%
  # Group by mouse Ensembl IDs
  dplyr::group_by(Gene) %>%
  # Collapse the mapped values in `human_mouse_mapped_df` to a
  # `all_human_ensembl` column, removing any duplicated human symbols
  # note that we will lose the `support` column in this summarizing step
  dplyr::summarize(
    # combine unique symbols with semicolons between them
    all_human_ensembl = paste(
      sort(unique(human_ensembl)),
      collapse = ";"
    )
  )

head(collapsed_human_ensembl_df)

4.6.1 Write results to file

Now let’s write our results to file.

readr::write_tsv(
  collapsed_human_ensembl_df,
  file.path(
    results_dir,
    # Replace with a relevant output file name
    "SRP070849_mouse_ensembl_to_collapsed_human_gene_symbol.tsv"
  )
)

4.7 Collapse human genes mapping to multiple mouse genes

Since multiple mouse Ensembl gene IDs map to the same human Ensembl gene ID, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which mouse gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the mouse lineage may result in complicated relationships among genes, especially with regard to divisions of function.

Simply combining values across mouse transcripts using an average may result in the loss of a lot of data and will likely not be representative of the mouse biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.

Therefore, we will use the support column to decide which mappings to retain. Let’s take a look at support.

head(human_mouse_mapped_df$support)
## [1] "OrthoDB,OrthoMCL"                                                                             
## [2] "OrthoDB,OrthoMCL"                                                                             
## [3] "Inparanoid,HomoloGene,PhylomeDB,NCBI,Ensembl,HGNC,OMA,Treefam,OrthoDB,Panther,OrthoMCL"       
## [4] NA                                                                                             
## [5] "Inparanoid,HomoloGene,PhylomeDB,EggNOG,Ensembl,NCBI,HGNC,OMA,Treefam,Panther,OrthoDB,OrthoMCL"
## [6] "NCBI"

Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable. Therefore, where we have multi-mapped mouse Ensembl gene IDs, we will take the mappings with more than one database to support the assertion.

Before we do, let’s take a look how many multi-mapped genes there are in the data frame.

human_mouse_mapped_df %>%
  # Count the number of rows in the data frame for each Ensembl ID
  dplyr::count(human_ensembl) %>%
  # Filter out the symbols without multimapped genes
  dplyr::filter(n > 1)

Looks like we have 6,971 human gene Ensembl IDs with multiple mappings.

Now let’s filter out the less reliable mappings.

filtered_mouse_ensembl_df <- human_mouse_mapped_df %>%
  # Count the number of databases in the support column
  # by using the number of commas that separate the databases
  dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
  # Now filter to the rows where more than one database supports the mapping
  dplyr::filter(n_databases > 1)

head(filtered_mouse_ensembl_df)

Let’s count how many multi-mapped genes we have now.

filtered_mouse_ensembl_df %>%
  # Group by human Ensembl IDs
  dplyr::group_by(human_ensembl) %>%
  # Count the number of rows in the data frame for each Ensembl ID
  dplyr::count() %>%
  # Filter out the symbols without multimapped genes
  dplyr::filter(n > 1)

Now we only have 2,702 multi-mapped genes, compared to the 6,608 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.

4.7.1 Write results to file

Now let’s write our filtered_mouse_ensembl_df object, with the reliable mouse Ensembl IDs for each unique human gene Ensembl ID, to file.

readr::write_tsv(
  filtered_mouse_ensembl_df,
  file.path(
    results_dir,
    # Replace with a relevant output file name
    "SRP070849_filtered_mouse_ensembl_to_human_gene_symbol.tsv"
  )
)

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports     1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  bslib         0.2.5   2021-05-12 [1] RSPM (R 4.0.4)
##  cli           2.5.0   2021-04-26 [1] RSPM (R 4.0.4)
##  crayon        1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  DBI           1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr         1.0.6   2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis      0.3.2   2021-04-29 [1] RSPM (R 4.0.4)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi         0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  generics      0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  getopt        1.20.3  2019-03-22 [1] RSPM (R 4.0.0)
##  glue          1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  hms           1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools     0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  jquerylib     0.1.4   2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite      1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr         1.33    2021-04-24 [1] RSPM (R 4.0.4)
##  lifecycle     1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr    * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  optparse    * 1.6.6   2020-04-16 [1] RSPM (R 4.0.0)
##  pillar        1.6.1   2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  ps            1.6.0   2021-02-28 [1] RSPM (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache       0.15.0  2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3   1.8.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo          1.24.0  2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils       2.10.1  2020-08-26 [1] RSPM (R 4.0.3)
##  R6            2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  readr         1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2      2.1.2   2020-05-01 [1] RSPM (R 4.0.3)
##  rlang         0.4.11  2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown     2.8     2021-05-07 [1] RSPM (R 4.0.4)
##  rstudioapi    0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] RSPM (R 4.0.4)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi       1.6.1   2021-05-10 [1] RSPM (R 4.0.4)
##  stringr       1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  styler        1.4.1   2021-03-30 [1] RSPM (R 4.0.4)
##  tibble        3.1.2   2021-05-16 [1] RSPM (R 4.0.4)
##  tidyselect    1.1.1   2021-04-30 [1] RSPM (R 4.0.4)
##  utf8          1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs         0.3.8   2021-04-29 [1] RSPM (R 4.0.4)
##  withr         2.4.2   2021-04-18 [1] RSPM (R 4.0.4)
##  xfun          0.23    2021-05-15 [1] RSPM (R 4.0.4)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The HGNC resources in 2015. Nucleic Acids Research 43. https://doi.org/10.1038/nature11327
HGNC Team, 2020 HCOP help. https://www.genenames.org/help/hcop/
Stamboulian M., R. F. Guerrero, M. W. Hahn, and P. Radivojac, 2020 The ortholog conjecture revisited: The value of orthologs and paralogs in function prediction. Bioinformatics 36: i219–i226. https://doi.org/10.1093/bioinformatics/btaa468
Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The HGNC comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2