tximeta
This notebook will demonstrate how to:
tximeta
SummarizedExperiment
objectIn this notebook, we’ll import the transcript expression
quantification output from salmon quant
using the tximeta
package. tximeta
is in part a wrapper around another
package, tximport
,
which imports transcript expression data and summarizes it to the gene
level. Working at the gene rather than transcript level has a number of
potential advantages for interpretability, efficiency, and reduction of
false positives (Soneson et
al. 2016). tximeta
eases some of the burden of
import by automatically identifying the correct set of annotation data
to append to many data sets (Love
et al. 2020).
For more information about tximeta
, see this
excellent vignette from Love et al.
# Load the tximeta package
library(tximeta)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Load the SummarizedExperiment package
library(SummarizedExperiment)
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
# directory where the data are located
data_dir <- file.path("data", "gastric-cancer")
# directory where the quant files are located, each sample is its own
# directory
quant_dir <- file.path(data_dir, "salmon_quant")
# create a directory to hold the tximeta results if it doesn't exist yet
txi_dir <- file.path(data_dir, "txi")
fs::dir_create(txi_dir)
We’ll need the quant.sf
files for all the samples in an
experiment which we have stored in quant_dir
.
# the quant files themselves
sf_files <- list.files(
quant_dir,
recursive = TRUE,
full.names = TRUE,
pattern = "quant.sf"
)
# sample metadata file
meta_file <- file.path(data_dir, "gastric-cancer_metadata.tsv")
Output
# Name the output gastric-cancer_tximeta.rds and use the directory created
# above as the rest of the path
txi_out_file <- file.path(txi_dir, "gastric-cancer_tximeta.rds")
All output files from salmon quant
we’ll use with
tximeta
are named quant.sf
. Unfortunately,
this means that the file names themselves do not have any information
about the sample they come from!
# Let's look at the full path for the quant.sf files
sf_files
[1] "data/gastric-cancer/salmon_quant/SRR585570/quant.sf"
[2] "data/gastric-cancer/salmon_quant/SRR585571/quant.sf"
[3] "data/gastric-cancer/salmon_quant/SRR585572/quant.sf"
[4] "data/gastric-cancer/salmon_quant/SRR585573/quant.sf"
[5] "data/gastric-cancer/salmon_quant/SRR585574/quant.sf"
[6] "data/gastric-cancer/salmon_quant/SRR585575/quant.sf"
[7] "data/gastric-cancer/salmon_quant/SRR585576/quant.sf"
[8] "data/gastric-cancer/salmon_quant/SRR585577/quant.sf"
data/gastric-cancer/salmon_quant/SRR585570/quant.sf
data/gastric-cancer/salmon_quant/SRR585571/quant.sf
data/gastric-cancer/salmon_quant/SRR585572/quant.sf
data/gastric-cancer/salmon_quant/SRR585573/quant.sf
data/gastric-cancer/salmon_quant/SRR585574/quant.sf
data/gastric-cancer/salmon_quant/SRR585575/quant.sf
data/gastric-cancer/salmon_quant/SRR585576/quant.sf
data/gastric-cancer/salmon_quant/SRR585577/quant.sf
Let’s extract the sample names from the file
paths using the stringr
package.
Notice how the file path is separated by /
. If we were
to split up this character string by /
, the second to last
item is the sample names (because we used them as directory names for
the salmon
output). This is exactly what
stringr::word()
allows us to do: split up the file paths by
/
and extract the sample names.
sample_names <- stringr::word(sf_files, -2, sep = "/")
sample_names
[1] "SRR585570" "SRR585571" "SRR585572" "SRR585573" "SRR585574" "SRR585575"
[7] "SRR585576" "SRR585577"
SRR585570
SRR585571
SRR585572
SRR585573
SRR585574
SRR585575
SRR585576
SRR585577
tximeta
needs a data frame with at least these two
columns: - a files
column with the file paths to the
quant.sf files - a names
column with the sample names
coldata <- data.frame(
files = sf_files,
names = sample_names
)
We have more information about these samples stored in the metadata
file that we will also want stored in coldata
. Let’s read
in the sample metadata from the TSV file.
# Read in the sample metadata TSV file and have a look
sample_meta_df <- readr::read_tsv(meta_file)
Rows: 8 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): tissue, srr_accession, title
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sample_meta_df
We’ll want this information to be added to the coldata
,
which we can do by using a join function to match up the rows between
the two data frames and combine them.
coldata <- coldata |>
dplyr::inner_join(sample_meta_df, by = c("names" = "srr_accession"))
coldata
tximeta
Using the coldata
data frame that we set up, we can now
run the tximeta()
to import our expression data while
automatically finding and associating the transcript annotations that
were used when we performed the quantification.
The first time you run tximeta()
you may get a message
about storing downloaded transcriptome data in a cache directory so that
it can retrieve the data more quickly the next time. We recommend you
use the cache, and accept the default location.
txi_data <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8
found matching transcriptome:
[ Ensembl - Homo sapiens - release 95 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
found matching EnsDb via 'AnnotationHub'
downloading 1 resources
retrieving 1 resource
loading from cache
require("ensembldb")
generating transcript ranges
*tximeta currently works easily for most human and mouse datasets, but requires a few more steps for other species.
We’ll summarize to the gene level using the
summarizeToGene()
function.
# Summarize to the gene level
gene_summarized <- summarizeToGene(txi_data)
loading existing EnsDb created: 2024-08-08 16:10:27
obtaining transcript-to-gene mapping from database
generating gene ranges
gene ranges assigned by total range of isoforms, see `assignRanges`
summarizing abundance
summarizing counts
summarizing length
We can use the class
function to see what type of object
gene_summarized
is.
# Check what type of object `gene_summarized` is
class(gene_summarized)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
RangedSummarizedExperiment
This tells us that gene_summarized
is an object called a
SummarizedExperiment
which can be handled by functions from the package of the same name. We
more specifically have a RangedSummarizedExperiment
which
is a more specific type of SummarizedExperiment
.
SummarizedExperiment
objects have this general
structure: