1 Purpose of this analysis

This notebook demonstrates how you can use data from the HUGO Gene Nomenclature Committee (HGNC) database to perform ortholog mapping for microarray data obtained from refine.bio. In this notebook, we will use zebrafish data from refine.bio and annotate it with human gene symbols from HGNC.

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link. Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored. You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this CREB overexpression zebrafish dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the GSE71270 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The GSE71270 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "GSE71270")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "GSE71270.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 Ortholog Mapping - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.

Attach a package we need for this analysis.

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Import data from HGNC

The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).

The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook (HGNC Team 2020).

We can download the human to zebrafish translation file we need for this example using the download.file() command. For this notebook, we want to download the file named human_zebrafish_hcop_fifteen_column.txt.gz.

First we’ll declare a sensible file path for this.

# Declare what we want the downloaded file to be called and its location
zebrafish_hgnc_file <- file.path(
  data_dir,
  # The name the file will have locally
  "human_zebrafish_hcop_fifteen_column.txt.gz"
)

Using the file path we just declared, we can use the destfile argument to download the file we need to this directory and use this file name.

We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.

download.file(
  paste0(
    "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
    # Replace with the file name for the species conversion you want
    "human_zebrafish_hcop_fifteen_column.txt.gz"
  ),
  # The file will be saved to the name and location we defined earlier
  destfile = zebrafish_hgnc_file
)

If you are using a different dataset, in the last chunk you can replace zebrafish in human_zebrafish_hcop_fifteen_column.txt.gz with the name of the species you have data for (if you see it listed in the directory). Don’t forget to change the destination file as well to reflect what you download!

Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.

Now let’s double check that the zebrafish ortholog file is in the right place.

# Check if the organism orthology file file is in the `data` directory
file.exists(zebrafish_hgnc_file)
## [1] TRUE

Now we can read in the orthology file that we downloaded.

# Read in the data from HGNC
zebrafish <- readr::read_tsv(zebrafish_hgnc_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   human_entrez_gene = col_character(),
##   human_ensembl_gene = col_character(),
##   hgnc_id = col_character(),
##   human_name = col_character(),
##   human_symbol = col_character(),
##   human_chr = col_character(),
##   human_assert_ids = col_character(),
##   zebrafish_entrez_gene = col_character(),
##   zebrafish_ensembl_gene = col_character(),
##   zfin_id = col_character(),
##   zebrafish_name = col_character(),
##   zebrafish_symbol = col_character(),
##   zebrafish_chr = col_character(),
##   zebrafish_assert_ids = col_character(),
##   support = col_character()
## )

Let’s take a look at what zebrafish looks like.

zebrafish

We are going to manipulate some of the column names to make things easier when calling them downstream.

zebrafish <- zebrafish %>%
  set_names(names(.) %>%
    # Removing extra word in some of the column names
    gsub("_gene$", "", .))

4.3 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.

We stored our file path for the dataset in an object named data_file in this previous step.

# Read in data TSV file
zebrafish_genes <- readr::read_tsv(data_file) %>%
  # We only want the gene IDs so let's keep only the `Gene` column
  dplyr::select("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   Gene = col_character(),
##   GSM1831675 = col_double(),
##   GSM1831676 = col_double(),
##   GSM1831677 = col_double(),
##   GSM1831678 = col_double(),
##   GSM1831679 = col_double(),
##   GSM1831680 = col_double(),
##   GSM1831681 = col_double(),
##   GSM1831682 = col_double(),
##   GSM1831683 = col_double(),
##   GSM1831684 = col_double()
## )

4.4 Mapping human gene symbols to zebrafish Ensembl gene IDs

refine.bio data uses Ensembl gene identifiers, which will be in the first column.

# Let's take a look at the first 6 rows of `zebrafish_genes`
head(zebrafish_genes)

Ensembl gene identifiers have different species-specific prefixes. In zebrafish, Ensembl gene identifiers begin with ENSDARG (in human, ENSG, etc.).

Now let’s do the mapping!

We’re interested in the human_symbol, zebrafish_ensembl, and support columns specifically. The support column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.

human_zebrafish_key <- zebrafish %>%
  # Reduce the zebrafish table to only the columns we're interested in
  dplyr::select(zebrafish_ensembl, human_symbol, support)

# Since we ignored the additional columns in `zebrafish`, let's check to see if
# we have any duplicates in our `human_zebrafish_key`
any(duplicated(human_zebrafish_key))
## [1] TRUE

We do have duplicates! Let’s remove those duplicates before moving forward, as they provide no extra information at this point.

human_zebrafish_key <- human_zebrafish_key %>%
  # Use the `distinct()` function to remove duplicates resulting from
  # dropping the additional columns in the `zebrafish` data frame
  dplyr::distinct()

Now let’s join the mapped data from human_zebrafish_key with the gene data in zebrafish_genes. We are using a “left join” here so that we get at least one row per zebrafish gene, even if there is no matching human symbol in the mapping table.

human_zebrafish_mapped_df <- zebrafish_genes %>%
  # Now we can join the mapped data
  dplyr::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl"))

Here’s what the new data frame looks like:

head(human_zebrafish_mapped_df, n = 10)

Looks like we have mapped symbols!

So now we have all the zebrafish genes mapped to human, but there might be places where there are multiple zebrafish genes that are orthologous to the same human gene, or vice versa.

Let’s get a summary of the human gene symbols returned in our mapped data frame, human_zebrafish_mapped_df.

# We can use the `count()` function to get a tally of how many
# `zebrafish_ensembl` IDs there are per `human_symbol`
human_zebrafish_mapped_df %>%
  # Remove the support column
  dplyr::select(Gene, human_symbol) %>%
  # Remove any remaining duplicates
  dplyr::distinct() %>%
  # Count the number of rows per human gene
  dplyr::count(human_symbol) %>%
  # Sort by highest `n` which will be the human gene symbol with the most
  # mapped zebrafish Ensembl IDs
  dplyr::arrange(desc(n))

There are certainly a good number of places where we mapped multiple zebrafish Ensembl IDs to the same human symbol! We’ll look at this in a bit.

We can also see that there 738 zebrafish Ensembl IDs that did not map to a human symbol. These are the ones with a value of NA. This is okay because we do not expect everything to map neatly across species.

4.5 Take a look at some multi-mappings

If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated Ensembl ID values will get duplicated in our output data. Let’s look at the ENSDARG00000069142 example below.

human_zebrafish_mapped_df %>%
  dplyr::filter(Gene == "ENSDARG00000069142")

On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the Ensembl IDs will not get duplicated, but you will have multiple rows associated with that human symbol. Let’s look at the MATR3 example below.

human_zebrafish_mapped_df %>%
  dplyr::filter(human_symbol == "MATR3")

We can see that we have multiple zebrafish Ensembl IDs that mapped to the same gene. (Notice that we also still have some duplicate zebrafish Ensembl ID/human symbol pairs here because the support column was different in the original data set! This is why we removed that column before counting above.)

4.6 Collapse zebrafish genes mapping to multiple human genes

Remember that if a zebrafish Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.

In the next chunk, we show how we can collapse all the human gene symbols into one column where we store them all for each unique zebrafish Ensembl ID.

collapsed_human_symbol_df <- human_zebrafish_mapped_df %>%
  # Group by zebrafish Ensembl IDs
  dplyr::group_by(Gene) %>%
  # Collapse the mapped values in `human_zebrafish_mapped_df` to a
  # `all_human_symbols` column, removing any duplicated human symbols
  # note that we will lose the `support` column in this summarizing step
  dplyr::summarize(
    # combine unique symbols with semicolons between them
    all_human_symbols = paste(
      sort(unique(human_symbol)),
      collapse = ";"
    )
  )

head(collapsed_human_symbol_df)

4.6.1 Write results to file

Now let’s write our list of human gene symbols for each unique zebrafish Ensembl ID results to file.

readr::write_tsv(
  collapsed_human_symbol_df,
  file.path(
    results_dir,
    # Replace with a relevant output file name
    "GSE71270_zebrafish_ensembl_to_collapsed_human_gene_symbol.tsv"
  )
)

4.7 Collapse human genes mapping to multiple zebrafish genes

Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which zebrafish gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function.

Simply combining expression values across zebrafish transcripts that correspond to the same human gene using an average or other summary statistic may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.

To do this, we will use the support column to decide which mappings to retain. Let’s take a look at support.

head(human_zebrafish_mapped_df$support)
## [1] "Inparanoid,PhylomeDB,HomoloGene,Ensembl,NCBI,OMA,ZFIN,OrthoMCL,Panther,OrthoDB"     
## [2] "Inparanoid,HomoloGene,EggNOG,Ensembl,Treefam,NCBI,OMA,ZFIN,OrthoMCL,Panther,OrthoDB"
## [3] "HomoloGene,Ensembl,Treefam,ZFIN,Panther,OrthoDB"                                    
## [4] "OrthoDB"                                                                            
## [5] "Inparanoid,HomoloGene,EggNOG,Ensembl,NCBI,Treefam,ZFIN,Panther"                     
## [6] "OrthoMCL"

Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered more reliable. Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will retain the mappings with more than one database to support the assertion.

Before we do, let’s take a look how many multi-mapped genes there are in the data frame.

human_zebrafish_mapped_df %>%
  # Remove the `support` column
  dplyr::select(Gene, human_symbol) %>%
  # Remove any remaining duplicates
  dplyr::distinct() %>%
  # Count the number of rows in the dataframe for each symbol
  dplyr::count(human_symbol) %>%
  # Filter out the symbols without multimapped genes
  dplyr::filter(n > 1)

Looks like we have 4,169 human gene symbols with multiple mappings.

Now let’s filter out the less reliable mappings.

filtered_zebrafish_ensembl_df <- human_zebrafish_mapped_df %>%
  # Count the number of databases in the support column
  # by using the number of commas that separate the databases
  dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
  # Now filter to the rows where more than one database supports the mapping
  dplyr::filter(n_databases > 1)

head(filtered_zebrafish_ensembl_df)

Let’s count how many multi-mapped genes we have now.

filtered_zebrafish_ensembl_df %>%
  # Remove the support column
  dplyr::select(Gene, human_symbol) %>%
  # Remove any remaining duplicates
  dplyr::distinct() %>%
  # Count the number of rows in the dataframe for each symbol
  dplyr::count(human_symbol) %>%
  # Filter to the symbols with multimapped genes
  dplyr::filter(n > 1)

Now we only have 1,695 multi-mapped genes, compared to the 4,169 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.

4.7.1 Write results to file

Now let’s write our filtered_zebrafish_ensembl_df object, with the reliable zebrafish Ensembl IDs for each unique human gene symbol, to file.

readr::write_tsv(
  filtered_zebrafish_ensembl_df,
  file.path(
    results_dir,
    # Replace with a relevant output file name
    "GSE71270_filtered_zebrafish_ensembl_to_human_gene_symbol.tsv"
  )
)

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports     1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  bslib         0.2.5   2021-05-12 [1] RSPM (R 4.0.4)
##  cli           2.5.0   2021-04-26 [1] RSPM (R 4.0.4)
##  crayon        1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  DBI           1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr         1.0.6   2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis      0.3.2   2021-04-29 [1] RSPM (R 4.0.4)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi         0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  generics      0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  getopt        1.20.3  2019-03-22 [1] RSPM (R 4.0.0)
##  glue          1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  hms           1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools     0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  jquerylib     0.1.4   2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite      1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr         1.33    2021-04-24 [1] RSPM (R 4.0.4)
##  lifecycle     1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr    * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  optparse    * 1.6.6   2020-04-16 [1] RSPM (R 4.0.0)
##  pillar        1.6.1   2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  ps            1.6.0   2021-02-28 [1] RSPM (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache       0.15.0  2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3   1.8.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo          1.24.0  2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils       2.10.1  2020-08-26 [1] RSPM (R 4.0.3)
##  R6            2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  readr         1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2      2.1.2   2020-05-01 [1] RSPM (R 4.0.3)
##  rlang         0.4.11  2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown     2.8     2021-05-07 [1] RSPM (R 4.0.4)
##  rstudioapi    0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] RSPM (R 4.0.4)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi       1.6.1   2021-05-10 [1] RSPM (R 4.0.4)
##  stringr       1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  styler        1.4.1   2021-03-30 [1] RSPM (R 4.0.4)
##  tibble        3.1.2   2021-05-16 [1] RSPM (R 4.0.4)
##  tidyselect    1.1.1   2021-04-30 [1] RSPM (R 4.0.4)
##  utf8          1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs         0.3.8   2021-04-29 [1] RSPM (R 4.0.4)
##  withr         2.4.2   2021-04-18 [1] RSPM (R 4.0.4)
##  xfun          0.23    2021-05-15 [1] RSPM (R 4.0.4)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The HGNC resources in 2015. Nucleic Acids Research 43. https://doi.org/10.1038/nature11327
HGNC Team, 2020 HCOP help. https://www.genenames.org/help/hcop/
Stamboulian M., R. F. Guerrero, M. W. Hahn, and P. Radivojac, 2020 The ortholog conjecture revisited: The value of orthologs and paralogs in function prediction. Bioinformatics 36: i219–i226. https://doi.org/10.1093/bioinformatics/btaa468
Tregnago C., E. Manara, M. Zampini, V. Bisio, C. Borga, et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98
Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The HGNC comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2