Objectives

This notebook will demonstrate how to:

  • Import RNA-seq expression quantification output using tximeta
  • Summarize transcript-level expression to the gene level
  • Interrogate and extract data from a SummarizedExperiment object

In this notebook, we’ll import the transcript expression quantification output from salmon quant using the tximeta package. tximeta is in part a wrapper around another package, tximport, which imports transcript expression data and summarizes it to the gene level. Working at the gene rather than transcript level has a number of potential advantages for interpretability, efficiency, and reduction of false positives (Soneson et al. 2016). tximeta eases some of the burden of import by automatically identifying the correct set of annotation data to append to many data sets (Love et al. 2020).

For more information about tximeta, see this excellent vignette from Love et al.

Libraries and functions

# Load the tximeta package
library(tximeta)
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Load the SummarizedExperiment package
library(SummarizedExperiment)
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors

Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians

Directories and files

# directory where the data are located
data_dir <- file.path("data", "gastric-cancer")

# directory where the quant files are located, each sample is its own
# directory
quant_dir <- file.path(data_dir, "salmon_quant")

# create a directory to hold the tximeta results if it doesn't exist yet
txi_dir <- file.path(data_dir, "txi")
fs::dir_create(txi_dir)

We’ll need the quant.sf files for all the samples in an experiment which we have stored in quant_dir.

# the quant files themselves
sf_files <- list.files(
  quant_dir, 
  recursive = TRUE, 
  full.names = TRUE,
  pattern = "quant.sf"
)
# sample metadata file
meta_file <- file.path(data_dir, "gastric-cancer_metadata.tsv")

Output

# Name the output gastric-cancer_tximeta.rds and use the directory created
# above as the rest of the path
txi_out_file <- file.path(txi_dir, "gastric-cancer_tximeta.rds")

File names

All output files from salmon quant we’ll use with tximeta are named quant.sf. Unfortunately, this means that the file names themselves do not have any information about the sample they come from!

# Let's look at the full path for the quant.sf files
sf_files
[1] "data/gastric-cancer/salmon_quant/SRR585570/quant.sf"
[2] "data/gastric-cancer/salmon_quant/SRR585571/quant.sf"
[3] "data/gastric-cancer/salmon_quant/SRR585572/quant.sf"
[4] "data/gastric-cancer/salmon_quant/SRR585573/quant.sf"
[5] "data/gastric-cancer/salmon_quant/SRR585574/quant.sf"
[6] "data/gastric-cancer/salmon_quant/SRR585575/quant.sf"
[7] "data/gastric-cancer/salmon_quant/SRR585576/quant.sf"
[8] "data/gastric-cancer/salmon_quant/SRR585577/quant.sf"
data/gastric-cancer/salmon_quant/SRR585570/quant.sf
data/gastric-cancer/salmon_quant/SRR585571/quant.sf
data/gastric-cancer/salmon_quant/SRR585572/quant.sf
data/gastric-cancer/salmon_quant/SRR585573/quant.sf
data/gastric-cancer/salmon_quant/SRR585574/quant.sf
data/gastric-cancer/salmon_quant/SRR585575/quant.sf
data/gastric-cancer/salmon_quant/SRR585576/quant.sf
data/gastric-cancer/salmon_quant/SRR585577/quant.sf

Let’s extract the sample names from the file paths using the stringr package.

Notice how the file path is separated by /. If we were to split up this character string by /, the second to last item is the sample names (because we used them as directory names for the salmon output). This is exactly what stringr::word() allows us to do: split up the file paths by / and extract the sample names.

sample_names <- stringr::word(sf_files, -2, sep = "/")
sample_names
[1] "SRR585570" "SRR585571" "SRR585572" "SRR585573" "SRR585574" "SRR585575"
[7] "SRR585576" "SRR585577"
SRR585570
SRR585571
SRR585572
SRR585573
SRR585574
SRR585575
SRR585576
SRR585577

Set up metadata

tximeta needs a data frame with at least these two columns: - a files column with the file paths to the quant.sf files - a names column with the sample names

coldata <- data.frame(
  files = sf_files,
  names = sample_names
)

We have more information about these samples stored in the metadata file that we will also want stored in coldata. Let’s read in the sample metadata from the TSV file.

# Read in the sample metadata TSV file and have a look
sample_meta_df <- readr::read_tsv(meta_file)
Rows: 8 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): tissue, srr_accession, title

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sample_meta_df

We’ll want this information to be added to the coldata, which we can do by using a join function to match up the rows between the two data frames and combine them.

coldata <- coldata |>
  dplyr::inner_join(sample_meta_df, by = c("names" = "srr_accession"))

coldata

Import expression data with tximeta

Using the coldata data frame that we set up, we can now run the tximeta() to import our expression data while automatically finding and associating the transcript annotations that were used when we performed the quantification.

The first time you run tximeta() you may get a message about storing downloaded transcriptome data in a cache directory so that it can retrieve the data more quickly the next time. We recommend you use the cache, and accept the default location.

txi_data <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 
found matching transcriptome:
[ Ensembl - Homo sapiens - release 95 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
found matching EnsDb via 'AnnotationHub'
downloading 1 resources
retrieving 1 resource
loading from cache
require("ensembldb")
generating transcript ranges

*tximeta currently works easily for most human and mouse datasets, but requires a few more steps for other species.

Summarize to gene

We’ll summarize to the gene level using the summarizeToGene() function.

# Summarize to the gene level
gene_summarized <- summarizeToGene(txi_data)
loading existing EnsDb created: 2024-08-08 16:10:27
obtaining transcript-to-gene mapping from database
generating gene ranges
gene ranges assigned by total range of isoforms, see `assignRanges`
summarizing abundance
summarizing counts
summarizing length

We can use the class function to see what type of object gene_summarized is.

# Check what type of object `gene_summarized` is
class(gene_summarized)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
RangedSummarizedExperiment

This tells us that gene_summarized is an object called a SummarizedExperiment which can be handled by functions from the package of the same name. We more specifically have a RangedSummarizedExperiment which is a more specific type of SummarizedExperiment.

SummarizedExperiment objects have this general structure: