This notebook will demonstrate how to:
DESeq2
ComplexHeatmap
packageIn this notebook, we cluster RNA-seq data from the Open Pediatric Brain Tumor Atlas (OpenPBTA) project and create a heatmap. OpenPBTA is a collaborative project organized by the Data Lab and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia conducted openly on GitHub.
You can read more about the project here.
We’ve downloaded some of the publicly available expression data from the project and selected a subset with the most common disease types for analysis here.
We’ll use a package called ComplexHeatmap
to make our heatmap. This package allows us to annotate our heatmap with
sample information, and will also perform clustering as part of
generating the heatmap. It is highly flexible and opinionated - the data
structures we pass ComplexHeatmap
functions often need to
be just right. See the ComplexHeatmap
Complete Reference for more information.
# We will manipulate RNASeq data with DESeq2 at the start
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Then we'll be doing a bit of data wrangling with the Tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::%within%() masks IRanges::%within%()
✖ dplyr::collapse() masks IRanges::collapse()
✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
✖ dplyr::count() masks matrixStats::count()
✖ dplyr::desc() masks IRanges::desc()
✖ tidyr::expand() masks S4Vectors::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::first() masks S4Vectors::first()
✖ dplyr::lag() masks stats::lag()
✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
✖ dplyr::rename() masks S4Vectors::rename()
✖ lubridate::second() masks S4Vectors::second()
✖ lubridate::second<-() masks S4Vectors::second<-()
✖ dplyr::slice() masks IRanges::slice()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# ComplexHeatmap is the package we'll use for making a heatmap
# It will do the hierarchical clustering for us as well
library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 2.20.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics 2016.
The new InteractiveComplexHeatmap package can directly export static
complex heatmaps into an interactive Shiny app with zero effort. Have a try!
This message can be suppressed by:
suppressPackageStartupMessages(library(ComplexHeatmap))
========================================
We have stored the data we’ll use in this notebook in
data/open-pbta
.
data_dir <- file.path("data", "open-pbta")
# We'll store the heatmap in plots/open-pbta - create directory if it doesn't exist yet
plots_dir <- file.path("plots", "open-pbta")
fs::dir_create(plots_dir)
# The metadata describing the samples
histologies_file <- file.path(data_dir, "pbta-histologies-subset.tsv")
# The RNA-seq counts table
rnaseq_file <- file.path(data_dir, "pbta-rsem-expected_count-subset.rds")
heatmap_file <- file.path(plots_dir,
"common_histologies_high_variance_heatmap.png")
Let’s read in the metadata file file and take a look at the data.
histologies_df <- read_tsv(histologies_file)
Rows: 607 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (31): Kids_First_Biospecimen_ID, CNS_region, sample_id, aliquot_id, Kids...
dbl (2): OS_days, age_last_update_days
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the chunk below to explore the metadata data frame.
View(histologies_df)
We’ll use the disease labels in a column called
short_histology
when we label the heatmap. Let’s count how
many samples are assigned each short_histology
label using
the Tidyverse.
histology_count_df <- histologies_df |>
# Count how many samples are in each short_histology and name the column
# with that number n
count(short_histology) |>
# Sort from largest number of samples to smallest number of samples in a
# histology
arrange(desc(n))
histology_count_df
Read in the expression count matrix (stored as a data frame).
# Read in and examine the RNA-seq data
rnaseq_exp <- read_rds(rnaseq_file)
The count data we have consists mostly of integers, but because of
the estimation procedure that RSEM uses, some counts are fractional.
DESeq2
, which we will be using below expects all integers,
so we will round all of the count data. This is easiest if we first
convert from a data frame to a matrix.
rnaseq_mat <- rnaseq_exp |>
# move gene_id to the rownames
tibble::column_to_rownames("gene_id") |>
# convert to a matrix and round
as.matrix() |>
round()
Raw counts are not usually suitable for the algorithms we use for
clustering and heatmap display, so we will use the vst()
function from the DESeq2
package to transform our data.
Since we are starting from a matrix, not a
SummarizedExperiment
as we did previously, we will need to
provide the sample information ourselves. Just to be sure nothing is out
of order, we will check that the identifiers for the sample information
stored in histologies_df
matches the columns of our
matrix.
all.equal(histologies_df$Kids_First_Biospecimen_ID,
colnames(rnaseq_mat))
[1] TRUE
Now we can make our matrix into a DESeq2
dataset, adding
on the sample information from histologies_df
. Unlike when
we were performing differential expression analysis, we won’t provide an
experimental design at this stage.
ddset <- DESeqDataSetFromMatrix(rnaseq_mat,
colData = histologies_df,
design = ~ 1) # don't store an experimental design
converting counts to integer mode
We will again remove low count genes, as they are not likely to be informative.
genes_to_keep <- rowSums(counts(ddset)) >= 10
ddset <- ddset[genes_to_keep, ]
Now we can apply the variance stabilizing transformation, saving the results in a new object.
# apply variance stabilizing transformation
vst_data <- vst(ddset, blind = TRUE)
This object stores information about the transformation that was
applied, but for now, we will only need the matrix of transformed data,
which we can extract with assay()
.
# extract transformed data
expr_mat <- assay(vst_data)
What are the dimensions of this transformed RNA-seq data matrix?
dim(expr_mat)
[1] 48663 607
Almost 50k genes would be hard to visualize on a single heatmap! If we are making a heatmap to get an idea of the structure in our data, it can be helpful to subset to high variance genes. This is because the genes that don’t vary are not likely to contribute much to the overall patterns we are interested in.
First, we’ll calculate the variance for each gene using
rowVars()
from the matrixStats
package and
then take the genes in the top 10%.
# Calculate variance from the expression data
gene_variance <- matrixStats::rowVars(expr_mat)
# Find the value that we'll use as a threshold to filter the top 10%
variance_threshold <- quantile(gene_variance, 0.9)
# Row indices of high variance genes
high_variance_index <- which(gene_variance > variance_threshold)
# What does a row index look like?
head(high_variance_index)
ENSG00000000971.15_CFH ENSG00000001617.11_SEMA3F
7 15
ENSG00000002586.18_CD99 ENSG00000002586.18_PAR_Y_CD99
24 25
ENSG00000002587.9_HS3ST1 ENSG00000002745.12_WNT16
26 28
# Get a matrix that is subset to just the high variance genes
high_var_mat <- expr_mat[high_variance_index, ]
First, we’ll set up the metadata that we want to use to label samples
in the heatmap. In ComplexHeatmap
terminology, this is
called annotation, or HeatmapAnnotation
, specifically.
sample_annotation_df <- histologies_df |>
# Select only the columns that we'll use
select(Kids_First_Biospecimen_ID,
short_histology,
composition,
tumor_descriptor)
# Let's examine these columns
sample_annotation_df
ComplexHeatmap
is going to want the data frame we
provide to have the sample identifiers as row names, so let’s set that
up.
sample_annotation_df <- sample_annotation_df |>
tibble::column_to_rownames("Kids_First_Biospecimen_ID")
To specify the colors in our annotation bar, we need to create a list
of named vectors. The names of the list need to exactly match
the column names in sample_annotation_df
and the names in
each vector need to exactly match the values in those
columns.
# The Okabe Ito palette is recommended for those with color vision deficiencies
histology_colors <- palette.colors(palette = "Okabe-Ito")[2:5]
# `palette.colors()` returns a named vector, which can cause trouble
histology_colors <- unname(histology_colors)
# annotation color list for ComplexHeatMap
sample_annotation_colors <- list(
short_histology = c(
"LGAT" = histology_colors[[1]],
"Ependymoma" = histology_colors[[2]],
"HGAT" = histology_colors[[3]],
"Medulloblastoma" = histology_colors[[4]]
),
composition = c(
"Solid Tissue" = "#A0A0A0", # light grey
"Derived Cell Line" = "#000000" # black
),
tumor_descriptor = c(
"Initial CNS Tumor" = "#3333FF",
"Progressive" = "#FFFF99",
"Recurrence" = "#CCCCFF",
"Second Malignancy" = "#000033",
"Unavailable" = "#FFFFFF" # white for missing data
)
)
We need to create a special type of object with
HeatmapAnnotation
using the annotation data frame and the
list of color vectors. We will also make the annotation labels a bit
nicer to look at than the raw columns names.
column_annotation <- HeatmapAnnotation(
df = sample_annotation_df,
col = sample_annotation_colors,
annotation_label = c("Histology", "Composition", "Tumor descriptor")
)
We will z-score the expression values for display. This is sometimes
called a standardized score. Some heatmap plotting packages will do this
for you, but ComplexHeatmap
does not. It’s calculated for
each value in a row (gene) by subtracting the gene’s mean and dividing
by the gene’s standard deviation. This will result in every row having a
mean of 0 and a standard deviation of 1.
zscores_mat <-
(high_var_mat - rowMeans(high_var_mat)) / matrixStats::rowSds(high_var_mat)
Since all genes have the same z-score variance (by definition!), if you’re filtering to high variance values, it’s important to do that prior to standardizing.
Okay, now we’re ready to make a heatmap complete with annotation bars.
Heatmap(zscores_mat,
# The distance metric used for clustering the rows
# This is different from the default (Euclidean)
clustering_distance_rows = "pearson",
# Linkage method for row clustering
# This is different from the default (complete)
clustering_method_rows = "average",
# Distance metric for columns
clustering_distance_columns = "pearson",
# Linkage for columns
clustering_method_columns = "average",
show_row_names = FALSE,
show_column_names = FALSE,
# Add annotation bars to the top of the heatmap
top_annotation = column_annotation,
# This will be used as the label for the color bar
# of the cells of the heatmap itself
name = "z-score")
The automatically generated colors map from the minus and plus 99^th of
the absolute values in the matrix. There are outliers in the matrix
whose patterns might be hidden by this color mapping. You can manually
set the color to `col` argument.
Use `suppressMessages()` to turn off this message.
`use_raster` is automatically set to TRUE for a matrix with more than
2000 rows. You can control `use_raster` argument by explicitly setting
TRUE/FALSE to it.
Set `ht_opt$message = FALSE` to turn off this message.