Purpose: This notebook is an example annotation workflow for data obtained from refine.bio using Bioconductor annotation packages.

Ensembl IDs can be used to obtain various different annotations at the gene/transcript level. Although this example script uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain gene symbols this script can be easily converted for use with different species or annotation types eg protein IDs, gene ontology, accession numbers.

For different species, wherever the abbreviation org.Dr.eg.db or Dr is written, it must be replaced with the respective species abbreviation eg for Homo sapiens org.Hs.eg.db or Hs would be used. A full list of the annotation R packages from Bioconductor is at this link.

For different types of annotation, other annotation objects besides org.Dr.egSYMBOL can be used for mapping. For a full list annotation objects from a package this command can be used: ls("package::org.Dr.eg.db"). In this example, wherever SYMBOL is written, it would need to be replaced with the name of the desired annotation such as ACCNUM for accession numbers.

1) Install libraries with species-specific annotation package

# Install the Zebrafish package
if (!("org.Dr.eg.db" %in% installed.packages())) {
  # Install org.Dr.eg.db
  BiocManager::install("org.Dr.eg.db", update = FALSE)
}

Attach the org.Dr.eg.db library:

# Attach the annotation library 
library(org.Dr.eg.db)
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap,
    parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans, colnames, colSums,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff,
    sort, table, tapply, union, unique, unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite
    Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Make an output directory if it doesn’t exist.

# Make a results directory if it isn't created yet
if (!dir.exists("results")) {
  dir.create("results")
}

2) Import refine.bio data that has Ensembl identifiers

# Ensembl IDs are in the first column and will be read in as the rownames for 
# this data frame
df <- readr::read_tsv(file.path("data", "SRP056136.tsv"))
Parsed with column specification:
cols(
  Gene = col_character(),
  SRR1914395 = col_double(),
  SRR1914394 = col_double(),
  SRR1914393 = col_double(),
  SRR1914392 = col_double()
)

3) Map Ensembl IDs to gene symbols

Note: This code chunk will return a message that indicates that there are many mappings between our keys and columns but this is okay and is what we expect since genes and transcripts do not have a 1:1 relationship

# Map ensembl IDs to their associated gene symbols
annot.df <- data.frame("Symbols" = mapIds(org.Dr.eg.db, keys = df$Gene, 
                                         column = "SYMBOL", keytype = "ENSEMBL"),
                       df)
'select()' returned 1:many mapping between keys and columns

4) Write to file

# Write annotated data frame to output file
readr::write_tsv(annot.df, file.path("results", "SRP056136wGeneSymbols.tsv"))

Print session info:

# Print session info 
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin17.6.0 (64-bit)
Running under: macOS  10.14.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] org.Dr.eg.db_3.6.0   AnnotationDbi_1.42.1 IRanges_2.14.12      S4Vectors_0.18.3    
[5] Biobase_2.40.0       BiocGenerics_0.26.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19    rstudioapi_0.10 knitr_1.22      hms_0.4.2       bit_1.1-14      R6_2.4.0       
 [7] rlang_0.3.4     blob_1.1.1      tools_3.5.1     xfun_0.6        DBI_1.0.0       htmltools_0.3.6
[13] yaml_2.2.0      bit64_0.9-7     digest_0.6.18   tibble_2.1.1    crayon_1.3.4    readr_1.3.1    
[19] base64enc_0.1-3 memoise_1.1.0   evaluate_0.13   RSQLite_2.1.1   rmarkdown_1.12  pillar_1.3.1   
[25] compiler_3.5.1  jsonlite_1.6    pkgconfig_2.0.2
LS0tCnRpdGxlOiAiT2J0YWluaW5nIEFubm90YXRpb24gZm9yIEVuc2VtYmwgSURzIgphdXRob3I6ICJBTFNGIENDREwgLSBDYW5kYWNlIFNhdm9uZW4iCm91dHB1dDogICAKICBodG1sX25vdGVib29rOiAKICAgIHRvYzogdHJ1ZQogICAgdG9jX2Zsb2F0OiB0cnVlCi0tLQoKKlB1cnBvc2UqOiBUaGlzIG5vdGVib29rIGlzIGFuIGV4YW1wbGUgYW5ub3RhdGlvbiB3b3JrZmxvdyBmb3IgZGF0YSBvYnRhaW5lZCBmcm9tIHJlZmluZS5iaW8KdXNpbmcgQmlvY29uZHVjdG9yIGFubm90YXRpb24gcGFja2FnZXMuCgpFbnNlbWJsIElEcyBjYW4gYmUgdXNlZCB0byBvYnRhaW4gdmFyaW91cyBkaWZmZXJlbnQgYW5ub3RhdGlvbnMgYXQgdGhlIApnZW5lL3RyYW5zY3JpcHQgbGV2ZWwuCkFsdGhvdWdoIHRoaXMgZXhhbXBsZSBzY3JpcHQgdXNlcyBFbnNlbWJsIElEcyBmcm9tIFplYnJhZmlzaCwgKDxpPkRhbmlvIHJlcmlvPC9pPiksCnRvIG9idGFpbiBnZW5lIHN5bWJvbHMgdGhpcyBzY3JpcHQgY2FuIGJlIGVhc2lseSBjb252ZXJ0ZWQgZm9yIHVzZSB3aXRoIGRpZmZlcmVudCAKc3BlY2llcyBvciBhbm5vdGF0aW9uIHR5cGVzIDxpPmVnPC9pPiBwcm90ZWluIElEcywgZ2VuZSBvbnRvbG9neSwgYWNjZXNzaW9uIApudW1iZXJzLiAKCjxiPkZvciBkaWZmZXJlbnQgc3BlY2llczwvYj4sIHdoZXJldmVyIHRoZSBhYmJyZXZpYXRpb24gYG9yZy5Eci5lZy5kYmAgb3IgYERyYCBpcyAKd3JpdHRlbiwgaXQgbXVzdCBiZSByZXBsYWNlZCB3aXRoIHRoZSByZXNwZWN0aXZlIHNwZWNpZXMgYWJicmV2aWF0aW9uIDxpPmVnPC9pPgpmb3I8aT4gSG9tbyBzYXBpZW5zPC9pPiBgb3JnLkhzLmVnLmRiYCBvciBgSHNgIHdvdWxkIGJlIHVzZWQuCkEgZnVsbCBsaXN0IG9mIHRoZSBhbm5vdGF0aW9uIFIgcGFja2FnZXMgZnJvbSBCaW9jb25kdWN0b3IgaXMgYXQgdGhpcyBbbGlua10oaHR0cHM6Ly9iaW9jb25kdWN0b3Iub3JnL3BhY2thZ2VzL3JlbGVhc2UvYmlvYy92aWduZXR0ZXMvbGltbWEvaW5zdC9kb2MvdXNlcnNndWlkZS5wZGYpLgoKRm9yIGRpZmZlcmVudCB0eXBlcyBvZiBhbm5vdGF0aW9uLCBvdGhlciBhbm5vdGF0aW9uIG9iamVjdHMgYmVzaWRlcwpgb3JnLkRyLmVnU1lNQk9MYCBjYW4gYmUgdXNlZCBmb3IgbWFwcGluZy4KRm9yIGEgZnVsbCBsaXN0IGFubm90YXRpb24gb2JqZWN0cyBmcm9tIGEgcGFja2FnZSB0aGlzIGNvbW1hbmQgY2FuIGJlIHVzZWQ6IApgbHMoInBhY2thZ2U6Om9yZy5Eci5lZy5kYiIpYC4gCkluIHRoaXMgZXhhbXBsZSwgd2hlcmV2ZXIgYFNZTUJPTGAgaXMgd3JpdHRlbiwgaXQgd291bGQgbmVlZCB0byBiZSByZXBsYWNlZCAKd2l0aCB0aGUgbmFtZSBvZiB0aGUgZGVzaXJlZCBhbm5vdGF0aW9uIHN1Y2ggYXMgYEFDQ05VTWAgZm9yIGFjY2Vzc2lvbiBudW1iZXJzLgoKIyMgMSkgSW5zdGFsbCBsaWJyYXJpZXMgd2l0aCBzcGVjaWVzLXNwZWNpZmljIGFubm90YXRpb24gcGFja2FnZQoKYGBge3IgSW5zdGFsbCBsaWJyYXJpZXN9CiMgSW5zdGFsbCB0aGUgWmVicmFmaXNoIHBhY2thZ2UKaWYgKCEoIm9yZy5Eci5lZy5kYiIgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKSkpIHsKICAjIEluc3RhbGwgb3JnLkRyLmVnLmRiCiAgQmlvY01hbmFnZXI6Omluc3RhbGwoIm9yZy5Eci5lZy5kYiIsIHVwZGF0ZSA9IEZBTFNFKQp9CmBgYAoKQXR0YWNoIHRoZSBgb3JnLkRyLmVnLmRiYCBsaWJyYXJ5OgoKYGBge3J9CiMgQXR0YWNoIHRoZSBhbm5vdGF0aW9uIGxpYnJhcnkgCmxpYnJhcnkob3JnLkRyLmVnLmRiKQpgYGAKCk1ha2UgYW4gb3V0cHV0IGRpcmVjdG9yeSBpZiBpdCBkb2Vzbid0IGV4aXN0LgoKYGBge3J9CiMgTWFrZSBhIHJlc3VsdHMgZGlyZWN0b3J5IGlmIGl0IGlzbid0IGNyZWF0ZWQgeWV0CmlmICghZGlyLmV4aXN0cygicmVzdWx0cyIpKSB7CiAgZGlyLmNyZWF0ZSgicmVzdWx0cyIpCn0KYGBgCgojIyAyKSBJbXBvcnQgcmVmaW5lLmJpbyBkYXRhIHRoYXQgaGFzIEVuc2VtYmwgaWRlbnRpZmllcnMKCmBgYHtyIEltcG9ydCBkYXRhfQojIEVuc2VtYmwgSURzIGFyZSBpbiB0aGUgZmlyc3QgY29sdW1uIGFuZCB3aWxsIGJlIHJlYWQgaW4gYXMgdGhlIHJvd25hbWVzIGZvciAKIyB0aGlzIGRhdGEgZnJhbWUKZGYgPC0gcmVhZHI6OnJlYWRfdHN2KGZpbGUucGF0aCgiZGF0YSIsICJTUlAwNTYxMzYudHN2IikpCmBgYAoKIyMgMykgTWFwIEVuc2VtYmwgSURzIHRvIGdlbmUgc3ltYm9scwoqTm90ZSo6IFRoaXMgY29kZSBjaHVuayB3aWxsIHJldHVybiBhIG1lc3NhZ2UgdGhhdCBpbmRpY2F0ZXMgdGhhdCB0aGVyZSBhcmUgbWFueQptYXBwaW5ncyBiZXR3ZWVuIG91ciBrZXlzIGFuZCBjb2x1bW5zIGJ1dCB0aGlzIGlzIG9rYXkgYW5kIGlzIHdoYXQgd2UgZXhwZWN0IApzaW5jZSBnZW5lcyBhbmQgdHJhbnNjcmlwdHMgZG8gbm90IGhhdmUgYSAxOjEgcmVsYXRpb25zaGlwCgpgYGB7ciBNYXRjaCBFbnNlbWJsIElEc30KIyBNYXAgZW5zZW1ibCBJRHMgdG8gdGhlaXIgYXNzb2NpYXRlZCBnZW5lIHN5bWJvbHMKYW5ub3QuZGYgPC0gZGF0YS5mcmFtZSgiU3ltYm9scyIgPSBtYXBJZHMob3JnLkRyLmVnLmRiLCBrZXlzID0gZGYkR2VuZSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgY29sdW1uID0gIlNZTUJPTCIsIGtleXR5cGUgPSAiRU5TRU1CTCIpLAogICAgICAgICAgICAgICAgICAgICAgIGRmKQpgYGAKCiMjIDQpIFdyaXRlIHRvIGZpbGUKCmBgYHtyIFdyaXRlIHRvIGZpbGV9CiMgV3JpdGUgYW5ub3RhdGVkIGRhdGEgZnJhbWUgdG8gb3V0cHV0IGZpbGUKcmVhZHI6OndyaXRlX3Rzdihhbm5vdC5kZiwgZmlsZS5wYXRoKCJyZXN1bHRzIiwgIlNSUDA1NjEzNndHZW5lU3ltYm9scy50c3YiKSkKYGBgCgpQcmludCBzZXNzaW9uIGluZm86CgpgYGB7cn0KIyBQcmludCBzZXNzaW9uIGluZm8gCnNlc3Npb25JbmZvKCkKYGBgCg==