Objectives

This notebook will demonstrate how to:

  • Load tabular expression data and prepare it for use with DESeq2
  • Create an annotated heatmap of gene expression variation using the ComplexHeatmap package
  • Customize a PCA plot and compare the output with hierarchical clustering of the same data

In this notebook, we cluster RNA-seq data from the Open Pediatric Brain Tumor Atlas (OpenPBTA) project and create a heatmap. OpenPBTA is a collaborative project organized by the Data Lab and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia conducted openly on GitHub.

You can read more about the project here.

We’ve downloaded some of the publicly available expression data from the project and selected a subset with the most common disease types for analysis here.

We’ll use a package called ComplexHeatmap to make our heatmap. This package allows us to annotate our heatmap with sample information, and will also perform clustering as part of generating the heatmap. It is highly flexible and opinionated - the data structures we pass ComplexHeatmap functions often need to be just right. See the ComplexHeatmap Complete Reference for more information.

Set up

Libraries

# We will manipulate RNASeq data with DESeq2 at the start
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min

Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Then we'll be doing a bit of data wrangling with the Tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::%within%() masks IRanges::%within%()
✖ dplyr::collapse()     masks IRanges::collapse()
✖ dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
✖ dplyr::count()        masks matrixStats::count()
✖ dplyr::desc()         masks IRanges::desc()
✖ tidyr::expand()       masks S4Vectors::expand()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::first()        masks S4Vectors::first()
✖ dplyr::lag()          masks stats::lag()
✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
✖ purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
✖ dplyr::rename()       masks S4Vectors::rename()
✖ lubridate::second()   masks S4Vectors::second()
✖ lubridate::second<-() masks S4Vectors::second<-()
✖ dplyr::slice()        masks IRanges::slice()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# ComplexHeatmap is the package we'll use for making a heatmap
# It will do the hierarchical clustering for us as well
library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 2.20.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
    genomic data. Bioinformatics 2016.


The new InteractiveComplexHeatmap package can directly export static 
complex heatmaps into an interactive Shiny app with zero effort. Have a try!

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================

Directories and files

We have stored the data we’ll use in this notebook in data/open-pbta.

data_dir <- file.path("data", "open-pbta")

# We'll store the heatmap in plots/open-pbta - create directory if it doesn't exist yet
plots_dir <- file.path("plots", "open-pbta")
fs::dir_create(plots_dir)

Input files

# The metadata describing the samples
histologies_file <- file.path(data_dir, "pbta-histologies-subset.tsv")

# The RNA-seq counts table
rnaseq_file <- file.path(data_dir, "pbta-rsem-expected_count-subset.rds")

Output files

heatmap_file <- file.path(plots_dir,
                          "common_histologies_high_variance_heatmap.png")

Read in and prepare data

Metadata

Let’s read in the metadata file file and take a look at the data.

histologies_df <- read_tsv(histologies_file)
Rows: 607 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (31): Kids_First_Biospecimen_ID, CNS_region, sample_id, aliquot_id, Kids...
dbl  (2): OS_days, age_last_update_days

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use the chunk below to explore the metadata data frame.

View(histologies_df)

We’ll use the disease labels in a column called short_histology when we label the heatmap. Let’s count how many samples are assigned each short_histology label using the Tidyverse.

histology_count_df <- histologies_df |>
  # Count how many samples are in each short_histology and name the column
  # with that number n
  count(short_histology) |>
  # Sort from largest number of samples to smallest number of samples in a
  # histology
  arrange(desc(n))

histology_count_df

RNA-seq data

Read in the expression count matrix (stored as a data frame).

# Read in and examine the RNA-seq data
rnaseq_exp <- read_rds(rnaseq_file)

Convert and round

The count data we have consists mostly of integers, but because of the estimation procedure that RSEM uses, some counts are fractional. DESeq2, which we will be using below expects all integers, so we will round all of the count data. This is easiest if we first convert from a data frame to a matrix.

rnaseq_mat <- rnaseq_exp |>
  # move gene_id to the rownames
  tibble::column_to_rownames("gene_id") |>
  # convert to a matrix and round
  as.matrix() |>
  round()

Variance Stabilizing Transformation

Raw counts are not usually suitable for the algorithms we use for clustering and heatmap display, so we will use the vst() function from the DESeq2 package to transform our data.

Since we are starting from a matrix, not a SummarizedExperiment as we did previously, we will need to provide the sample information ourselves. Just to be sure nothing is out of order, we will check that the identifiers for the sample information stored in histologies_df matches the columns of our matrix.

all.equal(histologies_df$Kids_First_Biospecimen_ID,
          colnames(rnaseq_mat))
[1] TRUE

Now we can make our matrix into a DESeq2 dataset, adding on the sample information from histologies_df. Unlike when we were performing differential expression analysis, we won’t provide an experimental design at this stage.

ddset <- DESeqDataSetFromMatrix(rnaseq_mat,
                                colData = histologies_df,
                                design = ~ 1) # don't store an experimental design
converting counts to integer mode

We will again remove low count genes, as they are not likely to be informative.

genes_to_keep <- rowSums(counts(ddset)) >= 10
ddset <- ddset[genes_to_keep, ]

Now we can apply the variance stabilizing transformation, saving the results in a new object.

# apply variance stabilizing transformation
vst_data <- vst(ddset, blind = TRUE)

This object stores information about the transformation that was applied, but for now, we will only need the matrix of transformed data, which we can extract with assay().

# extract transformed data
expr_mat <- assay(vst_data)

What are the dimensions of this transformed RNA-seq data matrix?

dim(expr_mat)
[1] 48663   607

Almost 50k genes would be hard to visualize on a single heatmap! If we are making a heatmap to get an idea of the structure in our data, it can be helpful to subset to high variance genes. This is because the genes that don’t vary are not likely to contribute much to the overall patterns we are interested in.

First, we’ll calculate the variance for each gene using rowVars() from the matrixStats package and then take the genes in the top 10%.

# Calculate variance from the expression data
gene_variance <- matrixStats::rowVars(expr_mat)

# Find the value that we'll use as a threshold to filter the top 10%
variance_threshold <- quantile(gene_variance, 0.9)

# Row indices of high variance genes
high_variance_index <- which(gene_variance > variance_threshold)

# What does a row index look like?
head(high_variance_index)
       ENSG00000000971.15_CFH     ENSG00000001617.11_SEMA3F 
                            7                            15 
      ENSG00000002586.18_CD99 ENSG00000002586.18_PAR_Y_CD99 
                           24                            25 
     ENSG00000002587.9_HS3ST1      ENSG00000002745.12_WNT16 
                           26                            28 
# Get a matrix that is subset to just the high variance genes
high_var_mat <- expr_mat[high_variance_index, ]

Heatmap

Annotation

First, we’ll set up the metadata that we want to use to label samples in the heatmap. In ComplexHeatmap terminology, this is called annotation, or HeatmapAnnotation, specifically.

sample_annotation_df <- histologies_df |>
  # Select only the columns that we'll use
  select(Kids_First_Biospecimen_ID,
         short_histology,
         composition,
         tumor_descriptor)

# Let's examine these columns
sample_annotation_df

ComplexHeatmap is going to want the data frame we provide to have the sample identifiers as row names, so let’s set that up.

sample_annotation_df <- sample_annotation_df |>
  tibble::column_to_rownames("Kids_First_Biospecimen_ID")

To specify the colors in our annotation bar, we need to create a list of named vectors. The names of the list need to exactly match the column names in sample_annotation_df and the names in each vector need to exactly match the values in those columns.

# The Okabe Ito palette is recommended for those with color vision deficiencies
histology_colors <- palette.colors(palette = "Okabe-Ito")[2:5]
# `palette.colors()` returns a named vector, which can cause trouble
histology_colors <- unname(histology_colors)

# annotation color list for ComplexHeatMap
sample_annotation_colors <- list(
  short_histology = c(
    "LGAT" = histology_colors[[1]],
    "Ependymoma" = histology_colors[[2]],
    "HGAT" = histology_colors[[3]],
    "Medulloblastoma" = histology_colors[[4]]
  ),
  composition = c(
    "Solid Tissue" = "#A0A0A0",  # light grey
    "Derived Cell Line" = "#000000"  # black
  ),
  tumor_descriptor = c(
    "Initial CNS Tumor" = "#3333FF",
    "Progressive" = "#FFFF99",
    "Recurrence" = "#CCCCFF",
    "Second Malignancy" = "#000033",
    "Unavailable" = "#FFFFFF"  # white for missing data
  )
)

We need to create a special type of object with HeatmapAnnotation using the annotation data frame and the list of color vectors. We will also make the annotation labels a bit nicer to look at than the raw columns names.

column_annotation <- HeatmapAnnotation(
  df = sample_annotation_df,
  col = sample_annotation_colors,
  annotation_label = c("Histology", "Composition", "Tumor descriptor")
)

Values for display

We will z-score the expression values for display. This is sometimes called a standardized score. Some heatmap plotting packages will do this for you, but ComplexHeatmap does not. It’s calculated for each value in a row (gene) by subtracting the gene’s mean and dividing by the gene’s standard deviation. This will result in every row having a mean of 0 and a standard deviation of 1.

zscores_mat <-
  (high_var_mat - rowMeans(high_var_mat)) / matrixStats::rowSds(high_var_mat)

Since all genes have the same z-score variance (by definition!), if you’re filtering to high variance values, it’s important to do that prior to standardizing.

Heatmap itself!

Okay, now we’re ready to make a heatmap complete with annotation bars.

Heatmap(zscores_mat,
        # The distance metric used for clustering the rows
        # This is different from the default (Euclidean)
        clustering_distance_rows = "pearson",
        # Linkage method for row clustering
        # This is different from the default (complete)
        clustering_method_rows = "average",
        # Distance metric for columns
        clustering_distance_columns = "pearson",
        # Linkage for columns
        clustering_method_columns = "average",
        show_row_names = FALSE,
        show_column_names = FALSE,
        # Add annotation bars to the top of the heatmap
        top_annotation = column_annotation,
        # This will be used as the label for the color bar
        # of the cells of the heatmap itself
        name = "z-score")
The automatically generated colors map from the minus and plus 99^th of
the absolute values in the matrix. There are outliers in the matrix
whose patterns might be hidden by this color mapping. You can manually
set the color to `col` argument.

Use `suppressMessages()` to turn off this message.
`use_raster` is automatically set to TRUE for a matrix with more than
2000 rows. You can control `use_raster` argument by explicitly setting
TRUE/FALSE to it.

Set `ht_opt$message = FALSE` to turn off this message.