This notebook demonstrates how you can use data from the HUGO Gene Nomenclature Committee (HGNC) database to perform ortholog mapping for microarray data obtained from refine.bio. In this notebook, we will use zebrafish data from refine.bio and annotate it with human gene symbols from HGNC.
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link. Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored. You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this CREB overexpression zebrafish dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE71270
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedGSE71270
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "GSE71270")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "GSE71270.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_GSE71270.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.
Attach a package we need for this analysis.
# We will need this so we can use the pipe: %>%
library(magrittr)
The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).
The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook (HGNC Team 2020).
We can download the human to zebrafish translation file we need for this example using the download.file()
command. For this notebook, we want to download the file named human_zebrafish_hcop_fifteen_column.txt.gz
.
First we’ll declare a sensible file path for this.
# Declare what we want the downloaded file to be called and its location
<- file.path(
zebrafish_hgnc_file
data_dir,# The name the file will have locally
"human_zebrafish_hcop_fifteen_column.txt.gz"
)
Using the file path we just declared, we can use the destfile
argument to download the file we need to this directory and use this file name.
We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.
download.file(
paste0(
"http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
# Replace with the file name for the species conversion you want
"human_zebrafish_hcop_fifteen_column.txt.gz"
),# The file will be saved to the name and location we defined earlier
destfile = zebrafish_hgnc_file
)
If you are using a different dataset, in the last chunk you can replace zebrafish
in human_zebrafish_hcop_fifteen_column.txt.gz
with the name of the species you have data for (if you see it listed in the directory). Don’t forget to change the destination file as well to reflect what you download!
Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
Now let’s double check that the zebrafish ortholog file is in the right place.
# Check if the organism orthology file file is in the `data` directory
file.exists(zebrafish_hgnc_file)
## [1] TRUE
Now we can read in the orthology file that we downloaded.
# Read in the data from HGNC
<- readr::read_tsv(zebrafish_hgnc_file) zebrafish
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## human_entrez_gene = col_character(),
## human_ensembl_gene = col_character(),
## hgnc_id = col_character(),
## human_name = col_character(),
## human_symbol = col_character(),
## human_chr = col_character(),
## human_assert_ids = col_character(),
## zebrafish_entrez_gene = col_character(),
## zebrafish_ensembl_gene = col_character(),
## zfin_id = col_character(),
## zebrafish_name = col_character(),
## zebrafish_symbol = col_character(),
## zebrafish_chr = col_character(),
## zebrafish_assert_ids = col_character(),
## support = col_character()
## )
Let’s take a look at what zebrafish
looks like.
zebrafish
We are going to manipulate some of the column names to make things easier when calling them downstream.
<- zebrafish %>%
zebrafish set_names(names(.) %>%
# Removing extra word in some of the column names
gsub("_gene$", "", .))
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.
We stored our file path for the dataset in an object named data_file
in this previous step.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
zebrafish_genes # We only want the gene IDs so let's keep only the `Gene` column
::select("Gene") dplyr
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## Gene = col_character(),
## GSM1831675 = col_double(),
## GSM1831676 = col_double(),
## GSM1831677 = col_double(),
## GSM1831678 = col_double(),
## GSM1831679 = col_double(),
## GSM1831680 = col_double(),
## GSM1831681 = col_double(),
## GSM1831682 = col_double(),
## GSM1831683 = col_double(),
## GSM1831684 = col_double()
## )
refine.bio data uses Ensembl gene identifiers, which will be in the first column.
# Let's take a look at the first 6 rows of `zebrafish_genes`
head(zebrafish_genes)
Ensembl gene identifiers have different species-specific prefixes. In zebrafish, Ensembl gene identifiers begin with ENSDARG
(in human, ENSG
, etc.).
Now let’s do the mapping!
We’re interested in the human_symbol
, zebrafish_ensembl
, and support
columns specifically. The support
column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.
<- zebrafish %>%
human_zebrafish_key # Reduce the zebrafish table to only the columns we're interested in
::select(zebrafish_ensembl, human_symbol, support)
dplyr
# Since we ignored the additional columns in `zebrafish`, let's check to see if
# we have any duplicates in our `human_zebrafish_key`
any(duplicated(human_zebrafish_key))
## [1] TRUE
We do have duplicates! Let’s remove those duplicates before moving forward, as they provide no extra information at this point.
<- human_zebrafish_key %>%
human_zebrafish_key # Use the `distinct()` function to remove duplicates resulting from
# dropping the additional columns in the `zebrafish` data frame
::distinct() dplyr
Now let’s join the mapped data from human_zebrafish_key
with the gene data in zebrafish_genes
. We are using a “left join” here so that we get at least one row per zebrafish gene, even if there is no matching human symbol in the mapping table.
<- zebrafish_genes %>%
human_zebrafish_mapped_df # Now we can join the mapped data
::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl")) dplyr
Here’s what the new data frame looks like:
head(human_zebrafish_mapped_df, n = 10)
Looks like we have mapped symbols!
So now we have all the zebrafish genes mapped to human, but there might be places where there are multiple zebrafish genes that are orthologous to the same human gene, or vice versa.
Let’s get a summary of the human gene symbols returned in our mapped data frame, human_zebrafish_mapped_df
.
# We can use the `count()` function to get a tally of how many
# `zebrafish_ensembl` IDs there are per `human_symbol`
%>%
human_zebrafish_mapped_df # Remove the support column
::select(Gene, human_symbol) %>%
dplyr# Remove any remaining duplicates
::distinct() %>%
dplyr# Count the number of rows per human gene
::count(human_symbol) %>%
dplyr# Sort by highest `n` which will be the human gene symbol with the most
# mapped zebrafish Ensembl IDs
::arrange(desc(n)) dplyr
There are certainly a good number of places where we mapped multiple zebrafish Ensembl IDs to the same human symbol! We’ll look at this in a bit.
We can also see that there 738 zebrafish Ensembl IDs that did not map to a human symbol. These are the ones with a value of NA. This is okay because we do not expect everything to map neatly across species.
If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated Ensembl ID values will get duplicated in our output data. Let’s look at the ENSDARG00000069142
example below.
%>%
human_zebrafish_mapped_df ::filter(Gene == "ENSDARG00000069142") dplyr
On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the Ensembl IDs will not get duplicated, but you will have multiple rows associated with that human symbol. Let’s look at the MATR3
example below.
%>%
human_zebrafish_mapped_df ::filter(human_symbol == "MATR3") dplyr
We can see that we have multiple zebrafish Ensembl IDs that mapped to the same gene. (Notice that we also still have some duplicate zebrafish Ensembl ID/human symbol pairs here because the support
column was different in the original data set! This is why we removed that column before counting above.)
Remember that if a zebrafish Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.
In the next chunk, we show how we can collapse all the human gene symbols into one column where we store them all for each unique zebrafish Ensembl ID.
<- human_zebrafish_mapped_df %>%
collapsed_human_symbol_df # Group by zebrafish Ensembl IDs
::group_by(Gene) %>%
dplyr# Collapse the mapped values in `human_zebrafish_mapped_df` to a
# `all_human_symbols` column, removing any duplicated human symbols
# note that we will lose the `support` column in this summarizing step
::summarize(
dplyr# combine unique symbols with semicolons between them
all_human_symbols = paste(
sort(unique(human_symbol)),
collapse = ";"
)
)
head(collapsed_human_symbol_df)
Now let’s write our list of human gene symbols for each unique zebrafish Ensembl ID results to file.
::write_tsv(
readr
collapsed_human_symbol_df,file.path(
results_dir,# Replace with a relevant output file name
"GSE71270_zebrafish_ensembl_to_collapsed_human_gene_symbol.tsv"
) )
Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which zebrafish gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function.
Simply combining expression values across zebrafish transcripts that correspond to the same human gene using an average or other summary statistic may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP
. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.
To do this, we will use the support
column to decide which mappings to retain. Let’s take a look at support
.
head(human_zebrafish_mapped_df$support)
## [1] "Inparanoid,PhylomeDB,HomoloGene,Ensembl,NCBI,OMA,ZFIN,OrthoMCL,Panther,OrthoDB"
## [2] "Inparanoid,HomoloGene,EggNOG,Ensembl,Treefam,NCBI,OMA,ZFIN,OrthoMCL,Panther,OrthoDB"
## [3] "HomoloGene,Ensembl,Treefam,ZFIN,Panther,OrthoDB"
## [4] "OrthoDB"
## [5] "Inparanoid,HomoloGene,EggNOG,Ensembl,NCBI,Treefam,ZFIN,Panther"
## [6] "OrthoMCL"
Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered more reliable. Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will retain the mappings with more than one database to support the assertion.
Before we do, let’s take a look how many multi-mapped genes there are in the data frame.
%>%
human_zebrafish_mapped_df # Remove the `support` column
::select(Gene, human_symbol) %>%
dplyr# Remove any remaining duplicates
::distinct() %>%
dplyr# Count the number of rows in the dataframe for each symbol
::count(human_symbol) %>%
dplyr# Filter out the symbols without multimapped genes
::filter(n > 1) dplyr
Looks like we have 4,169 human gene symbols with multiple mappings.
Now let’s filter out the less reliable mappings.
<- human_zebrafish_mapped_df %>%
filtered_zebrafish_ensembl_df # Count the number of databases in the support column
# by using the number of commas that separate the databases
::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
dplyr# Now filter to the rows where more than one database supports the mapping
::filter(n_databases > 1)
dplyr
head(filtered_zebrafish_ensembl_df)
Let’s count how many multi-mapped genes we have now.
%>%
filtered_zebrafish_ensembl_df # Remove the support column
::select(Gene, human_symbol) %>%
dplyr# Remove any remaining duplicates
::distinct() %>%
dplyr# Count the number of rows in the dataframe for each symbol
::count(human_symbol) %>%
dplyr# Filter to the symbols with multimapped genes
::filter(n > 1) dplyr
Now we only have 1,695 multi-mapped genes, compared to the 4,169 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.
Now let’s write our filtered_zebrafish_ensembl_df
object, with the reliable zebrafish Ensembl IDs for each unique human gene symbol, to file.
::write_tsv(
readr
filtered_zebrafish_ensembl_df,file.path(
results_dir,# Replace with a relevant output file name
"GSE71270_filtered_zebrafish_ensembl_to_human_gene_symbol.tsv"
) )
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library