tximeta
This notebook will demonstrate how to:
tximeta
SummarizedExperiment
objectIn this notebook, we’ll import the transcript expression
quantification output from salmon quant
using the tximeta
package. tximeta
is in part a wrapper around another
package, tximport
,
which imports transcript expression data and summarizes it to the gene
level. Working at the gene rather than transcript level has a number of
potential advantages for interpretability, efficiency, and reduction of
false positives (Soneson et
al. 2016). tximeta
eases some of the burden of
import by automatically identifying the correct set of annotation data
to append to many data sets (Love
et al. 2020).
For more information about tximeta
, see this
excellent vignette from Love et al.
# Load the tximeta package
library(tximeta)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Load the SummarizedExperiment package
library(SummarizedExperiment)
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
# directory where the data are located
data_dir <- file.path("data", "gastric-cancer")
# directory where the quant files are located, each sample is its own
# directory
quant_dir <- file.path(data_dir, "salmon_quant")
# create a directory to hold the tximeta results if it doesn't exist yet
txi_dir <- file.path(data_dir, "txi")
fs::dir_create(txi_dir)
We’ll need the quant.sf
files for all the samples in an
experiment which we have stored in quant_dir
.
# the quant files themselves
sf_files <- list.files(
quant_dir,
recursive = TRUE,
full.names = TRUE,
pattern = "quant.sf"
)
# sample metadata file
meta_file <- file.path(data_dir, "gastric-cancer_metadata.tsv")
Output
# Name the output gastric-cancer_tximeta.rds and use the directory created
# above as the rest of the path
txi_out_file <- file.path(txi_dir, "gastric-cancer_tximeta.rds")
All output files from salmon quant
we’ll use with
tximeta
are named quant.sf
. Unfortunately,
this means that the file names themselves do not have any information
about the sample they come from!
# Let's look at the full path for the quant.sf files
sf_files
[1] "data/gastric-cancer/salmon_quant/SRR585570/quant.sf"
[2] "data/gastric-cancer/salmon_quant/SRR585571/quant.sf"
[3] "data/gastric-cancer/salmon_quant/SRR585572/quant.sf"
[4] "data/gastric-cancer/salmon_quant/SRR585573/quant.sf"
[5] "data/gastric-cancer/salmon_quant/SRR585574/quant.sf"
[6] "data/gastric-cancer/salmon_quant/SRR585575/quant.sf"
[7] "data/gastric-cancer/salmon_quant/SRR585576/quant.sf"
[8] "data/gastric-cancer/salmon_quant/SRR585577/quant.sf"
data/gastric-cancer/salmon_quant/SRR585570/quant.sf
data/gastric-cancer/salmon_quant/SRR585571/quant.sf
data/gastric-cancer/salmon_quant/SRR585572/quant.sf
data/gastric-cancer/salmon_quant/SRR585573/quant.sf
data/gastric-cancer/salmon_quant/SRR585574/quant.sf
data/gastric-cancer/salmon_quant/SRR585575/quant.sf
data/gastric-cancer/salmon_quant/SRR585576/quant.sf
data/gastric-cancer/salmon_quant/SRR585577/quant.sf
Let’s extract the sample names from the file
paths using the stringr
package.
Notice how the file path is separated by /
. If we were
to split up this character string by /
, the second to last
item is the sample names (because we used them as directory names for
the salmon
output). This is exactly what
stringr::word()
allows us to do: split up the file paths by
/
and extract the sample names.
sample_names <- stringr::word(sf_files, -2, sep = "/")
sample_names
[1] "SRR585570" "SRR585571" "SRR585572" "SRR585573" "SRR585574" "SRR585575"
[7] "SRR585576" "SRR585577"
SRR585570
SRR585571
SRR585572
SRR585573
SRR585574
SRR585575
SRR585576
SRR585577
tximeta
needs a data frame with at least these two
columns: - a files
column with the file paths to the
quant.sf files - a names
column with the sample names
coldata <- data.frame(
files = sf_files,
names = sample_names
)
We have more information about these samples stored in the metadata
file that we will also want stored in coldata
. Let’s read
in the sample metadata from the TSV file.
# Read in the sample metadata TSV file and have a look
sample_meta_df <- readr::read_tsv(meta_file)
Rows: 8 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): tissue, srr_accession, title
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sample_meta_df
We’ll want this information to be added to the coldata
,
which we can do by using a join function to match up the rows between
the two data frames and combine them.
coldata <- coldata |>
dplyr::inner_join(sample_meta_df, by = c("names" = "srr_accession"))
coldata
tximeta
Using the coldata
data frame that we set up, we can now
run the tximeta()
to import our expression data while
automatically finding and associating the transcript annotations that
were used when we performed the quantification.
The first time you run tximeta()
you may get a message
about storing downloaded transcriptome data in a cache directory so that
it can retrieve the data more quickly the next time. We recommend you
use the cache, and accept the default location.
txi_data <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8
found matching transcriptome:
[ Ensembl - Homo sapiens - release 95 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
found matching EnsDb via 'AnnotationHub'
downloading 1 resources
retrieving 1 resource
loading from cache
require("ensembldb")
generating transcript ranges
*tximeta currently works easily for most human and mouse datasets, but requires a few more steps for other species.
We’ll summarize to the gene level using the
summarizeToGene()
function.
# Summarize to the gene level
gene_summarized <- summarizeToGene(txi_data)
loading existing EnsDb created: 2024-08-08 16:10:27
obtaining transcript-to-gene mapping from database
generating gene ranges
gene ranges assigned by total range of isoforms, see `assignRanges`
summarizing abundance
summarizing counts
summarizing length
We can use the class
function to see what type of object
gene_summarized
is.
# Check what type of object `gene_summarized` is
class(gene_summarized)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
RangedSummarizedExperiment
This tells us that gene_summarized
is an object called a
SummarizedExperiment
which can be handled by functions from the package of the same name. We
more specifically have a RangedSummarizedExperiment
which
is a more specific type of SummarizedExperiment
.
SummarizedExperiment
objects have this general
structure:
This figure is from this handy vignette about SummarizedExperiment
objects.
As shown in the diagram, we can use some of the functions provided by
the SummarizedExperiment
package to extract data from our
gene_summarized
object. For example, calling
rowData()
on our object shows all the gene information that
tximeta
set up!
# rowData() shows us our gene annotation
rowData(gene_summarized)
DataFrame with 37788 rows and 9 columns
gene_id gene_name gene_biotype
<character> <character> <character>
ENSG00000000003 ENSG00000000003 TSPAN6 protein_coding
ENSG00000000005 ENSG00000000005 TNMD protein_coding
ENSG00000000419 ENSG00000000419 DPM1 protein_coding
ENSG00000000457 ENSG00000000457 SCYL3 protein_coding
ENSG00000000460 ENSG00000000460 C1orf112 protein_coding
... ... ... ...
ENSG00000285978 ENSG00000285978 AC113348.2 protein_coding
ENSG00000285982 ENSG00000285982 AC012213.5 protein_coding
ENSG00000285986 ENSG00000285986 BX248415.1 unprocessed_pseudogene
ENSG00000285990 ENSG00000285990 AL589743.7 transcribed_unproces..
seq_coord_system description gene_id_version
<character> <character> <character>
ENSG00000000003 chromosome tetraspanin 6 [Sourc.. ENSG00000000003.14
ENSG00000000005 chromosome tenomodulin [Source:.. ENSG00000000005.5
ENSG00000000419 chromosome dolichyl-phosphate m.. ENSG00000000419.12
ENSG00000000457 chromosome SCY1 like pseudokina.. ENSG00000000457.13
ENSG00000000460 chromosome chromosome 1 open re.. ENSG00000000460.16
... ... ... ...
ENSG00000285978 chromosome novel transcript ENSG00000285978.1
ENSG00000285982 chromosome novel protein ENSG00000285982.1
ENSG00000285986 chromosome complement factor H .. ENSG00000285986.1
ENSG00000285990 chromosome neurobeachin (NBEA) .. ENSG00000285990.1
symbol entrezid
<character> <list>
ENSG00000000003 TSPAN6 7105
ENSG00000000005 TNMD 64102
ENSG00000000419 DPM1 8813
ENSG00000000457 SCYL3 57147
ENSG00000000460 C1orf112 55732
... ... ...
ENSG00000285978 AC113348.2 NA
ENSG00000285982 AC012213.5 NA
ENSG00000285986 BX248415.1 NA
ENSG00000285990 AL589743.7 NA
tx_ids
<CharacterList>
ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,...
ENSG00000000005 ENST00000373031,ENST00000485971
ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,...
ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,...
ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,...
... ...
ENSG00000285978 ENST00000638723
ENSG00000285982 ENST00000649416
ENSG00000285986 ENST00000649395
ENSG00000285990 ENST00000649331
[ reached getOption("max.print") -- omitted 1 row ]
The assay
slot in SummarizedExperiment
s
holds data from the experiment. In this case, it will include our
gene-level expression information stored as a gene x sample matrix.
Multiple assays
can be stored in an
SummarizedExperiment
and we can use the
assayNames()
function to see what assays are included in
gene_summarized
.
assayNames(gene_summarized)
[1] "counts" "abundance" "length"
counts
abundance
length
If we want to extract an assay
’s data, we can use
assay()
function and specify the name of the assay we want
to extract.
counts_mat <- assay(gene_summarized, "counts")
We can use the class
function to see what type of object
the assay()
function returns.
# Check what type of object `counts_mat` is
class(counts_mat)
[1] "matrix" "array"
matrix
array
Alternatively, we could extract the TPM data – called
abundance
from gene_summarized
.
# Let's look at the first few rows of the gene-level TPM
head(assay(gene_summarized, "abundance"))
SRR585570 SRR585571 SRR585572 SRR585573 SRR585574 SRR585575
ENSG00000000003 25.014898 18.896831 12.288181 26.911045 22.088410 17.168736
ENSG00000000005 0.121627 0.000000 0.000000 0.000000 0.000000 0.000000
ENSG00000000419 26.667591 20.771196 103.246348 69.495297 66.335181 77.471536
ENSG00000000457 5.653614 2.921236 6.511128 5.107480 5.106009 3.845323
ENSG00000000460 1.757371 2.933740 1.354462 2.195826 6.341131 16.792151
ENSG00000000938 1.693652 2.807437 0.078625 0.000000 0.028502 0.000000
SRR585576 SRR585577
ENSG00000000003 17.974009 29.513266
ENSG00000000005 0.000000 0.000000
ENSG00000000419 44.036543 35.660794
ENSG00000000457 4.452530 3.346821
ENSG00000000460 5.089307 7.927036
ENSG00000000938 0.000000 0.000000
We could use readr::write_tsv
to save
counts_mat
only but gene_summarized
has a lot
of information stored here beyond the counts, so we may want to save all
of this to a RDS object.
# Write `gene_summarized` to RDS object
readr::write_rds(gene_summarized, file = txi_out_file)
We’ll import this with the DESeq2
package in the next
notebook.
Record session info for reproducibility & provenance purposes.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ensembldb_2.28.0 AnnotationFilter_1.28.0
[3] GenomicFeatures_1.56.0 AnnotationDbi_1.66.0
[5] SummarizedExperiment_1.34.0 Biobase_2.64.0
[7] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
[9] IRanges_2.38.0 S4Vectors_0.42.0
[11] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
[13] matrixStats_1.3.0 tximeta_1.22.0
[15] optparse_1.7.5
loaded via a namespace (and not attached):
[1] DBI_1.2.2 bitops_1.0-7 httr2_1.0.1
[4] biomaRt_2.60.0 rlang_1.1.3 magrittr_2.0.3
[7] compiler_4.4.1 RSQLite_2.3.6 png_0.1-8
[10] vctrs_0.6.5 txdbmaker_1.0.0 stringr_1.5.1
[13] ProtGenerics_1.36.0 pkgconfig_2.0.3 crayon_1.5.2
[16] fastmap_1.1.1 dbplyr_2.5.0 XVector_0.44.0
[19] utf8_1.2.4 Rsamtools_2.20.0 rmarkdown_2.26
[22] tzdb_0.4.0 UCSC.utils_1.0.0 purrr_1.0.2
[25] bit_4.0.5 xfun_0.43 zlibbioc_1.50.0
[28] cachem_1.0.8 jsonlite_1.8.8 progress_1.2.3
[31] blob_1.2.4 DelayedArray_0.30.0 BiocParallel_1.38.0
[34] parallel_4.4.1 prettyunits_1.2.0 R6_2.5.1
[37] bslib_0.7.0 stringi_1.8.3 rtracklayer_1.64.0
[40] jquerylib_0.1.4 knitr_1.46 readr_2.1.5
[43] Matrix_1.7-0 tidyselect_1.2.1 abind_1.4-5
[46] yaml_2.3.8 codetools_0.2-20 curl_5.2.1
[49] lattice_0.22-6 tibble_3.2.1 withr_3.0.0
[52] KEGGREST_1.44.0 evaluate_0.23 getopt_1.20.4
[55] BiocFileCache_2.12.0 xml2_1.3.6 Biostrings_2.72.0
[58] pillar_1.9.0 BiocManager_1.30.22 filelock_1.0.3
[61] generics_0.1.3 vroom_1.6.5 RCurl_1.98-1.14
[64] BiocVersion_3.19.1 hms_1.1.3 glue_1.7.0
[67] lazyeval_0.2.2 tools_4.4.1 AnnotationHub_3.12.0
[70] BiocIO_1.14.0 GenomicAlignments_1.40.0 fs_1.6.4
[73] XML_3.99-0.16.1 grid_4.4.1 GenomeInfoDbData_1.2.12
[76] restfulr_0.0.15 cli_3.6.2 rappdirs_0.3.3
[79] fansi_1.0.6 S4Arrays_1.4.0 dplyr_1.1.4
[82] sass_0.4.9 digest_0.6.35 SparseArray_1.4.0
[85] tximport_1.32.0 rjson_0.2.21 memoise_2.0.1
[88] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
[91] mime_0.12 bit64_4.0.5