Objectives

This notebook will demonstrate how to:

  • Load tabular expression data and prepare it for use with DESeq2
  • Create an annotated heatmap of gene expression variation using the ComplexHeatmap package
  • Customize a PCA plot and compare the output with hierarchical clustering of the same data

In this notebook, we cluster RNA-seq data from the Open Pediatric Brain Tumor Atlas (OpenPBTA) project and create a heatmap. OpenPBTA is a collaborative project organized by the Data Lab and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia conducted openly on GitHub.

You can read more about the project here.

We’ve downloaded some of the publicly available expression data from the project and selected a subset with the most common disease types for analysis here.

We’ll use a package called ComplexHeatmap to make our heatmap. This package allows us to annotate our heatmap with sample information, and will also perform clustering as part of generating the heatmap. It is highly flexible and opinionated - the data structures we pass ComplexHeatmap functions often need to be just right. See the ComplexHeatmap Complete Reference for more information.

Set up

Libraries

# We will manipulate RNASeq data with DESeq2 at the start
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min

Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# Then we'll be doing a bit of data wrangling with the Tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
âś” dplyr     1.1.4     âś” readr     2.1.5
âś” forcats   1.0.0     âś” stringr   1.5.1
âś” ggplot2   3.5.1     âś” tibble    3.2.1
âś” lubridate 1.9.3     âś” tidyr     1.3.1
âś” purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
âś– lubridate::%within%() masks IRanges::%within%()
âś– dplyr::collapse()     masks IRanges::collapse()
âś– dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
âś– dplyr::count()        masks matrixStats::count()
âś– dplyr::desc()         masks IRanges::desc()
âś– tidyr::expand()       masks S4Vectors::expand()
âś– dplyr::filter()       masks stats::filter()
âś– dplyr::first()        masks S4Vectors::first()
âś– dplyr::lag()          masks stats::lag()
âś– ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
âś– purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
âś– dplyr::rename()       masks S4Vectors::rename()
âś– lubridate::second()   masks S4Vectors::second()
âś– lubridate::second<-() masks S4Vectors::second<-()
âś– dplyr::slice()        masks IRanges::slice()
â„ą Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# ComplexHeatmap is the package we'll use for making a heatmap
# It will do the hierarchical clustering for us as well
library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 2.20.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
    genomic data. Bioinformatics 2016.


The new InteractiveComplexHeatmap package can directly export static 
complex heatmaps into an interactive Shiny app with zero effort. Have a try!

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================

Directories and files

We have stored the data we’ll use in this notebook in data/open-pbta.

data_dir <- file.path("data", "open-pbta")

# We'll store the heatmap in plots/open-pbta - create directory if it doesn't exist yet
plots_dir <- file.path("plots", "open-pbta")
fs::dir_create(plots_dir)

Input files

# The metadata describing the samples
histologies_file <- file.path(data_dir, "pbta-histologies-subset.tsv")

# The RNA-seq counts table
rnaseq_file <- file.path(data_dir, "pbta-rsem-expected_count-subset.rds")

Output files

heatmap_file <- file.path(plots_dir,
                          "common_histologies_high_variance_heatmap.png")

Read in and prepare data

Metadata

Let’s read in the metadata file file and take a look at the data.

histologies_df <- read_tsv(histologies_file)
Rows: 607 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (31): Kids_First_Biospecimen_ID, CNS_region, sample_id, aliquot_id, Kids...
dbl  (2): OS_days, age_last_update_days

â„ą Use `spec()` to retrieve the full column specification for this data.
â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use the chunk below to explore the metadata data frame.

View(histologies_df)

We’ll use the disease labels in a column called short_histology when we label the heatmap. Let’s count how many samples are assigned each short_histology label using the Tidyverse.

histology_count_df <- histologies_df |>
  # Count how many samples are in each short_histology and name the column
  # with that number n
  count(short_histology) |>
  # Sort from largest number of samples to smallest number of samples in a
  # histology
  arrange(desc(n))

histology_count_df

RNA-seq data

Read in the expression count matrix (stored as a data frame).

# Read in and examine the RNA-seq data
rnaseq_exp <- read_rds(rnaseq_file)

Convert and round

The count data we have consists mostly of integers, but because of the estimation procedure that RSEM uses, some counts are fractional. DESeq2, which we will be using below expects all integers, so we will round all of the count data. This is easiest if we first convert from a data frame to a matrix.

rnaseq_mat <- rnaseq_exp |>
  # move gene_id to the rownames
  tibble::column_to_rownames("gene_id") |>
  # convert to a matrix and round
  as.matrix() |>
  round()

Variance Stabilizing Transformation

Raw counts are not usually suitable for the algorithms we use for clustering and heatmap display, so we will use the vst() function from the DESeq2 package to transform our data.

Since we are starting from a matrix, not a SummarizedExperiment as we did previously, we will need to provide the sample information ourselves. Just to be sure nothing is out of order, we will check that the identifiers for the sample information stored in histologies_df matches the columns of our matrix.

all.equal(histologies_df$Kids_First_Biospecimen_ID,
          colnames(rnaseq_mat))
[1] TRUE

Now we can make our matrix into a DESeq2 dataset, adding on the sample information from histologies_df. Unlike when we were performing differential expression analysis, we won’t provide an experimental design at this stage.

ddset <- DESeqDataSetFromMatrix(rnaseq_mat,
                                colData = histologies_df,
                                design = ~ 1) # don't store an experimental design
converting counts to integer mode

We will again remove low count genes, as they are not likely to be informative.

genes_to_keep <- rowSums(counts(ddset)) >= 10
ddset <- ddset[genes_to_keep, ]

Now we can apply the variance stabilizing transformation, saving the results in a new object.

# apply variance stabilizing transformation
vst_data <- vst(ddset, blind = TRUE)

This object stores information about the transformation that was applied, but for now, we will only need the matrix of transformed data, which we can extract with assay().

# extract transformed data
expr_mat <- assay(vst_data)

What are the dimensions of this transformed RNA-seq data matrix?

dim(expr_mat)
[1] 48663   607

Almost 50k genes would be hard to visualize on a single heatmap! If we are making a heatmap to get an idea of the structure in our data, it can be helpful to subset to high variance genes. This is because the genes that don’t vary are not likely to contribute much to the overall patterns we are interested in.

First, we’ll calculate the variance for each gene using rowVars() from the matrixStats package and then take the genes in the top 10%.

# Calculate variance from the expression data
gene_variance <- matrixStats::rowVars(expr_mat)

# Find the value that we'll use as a threshold to filter the top 10%
variance_threshold <- quantile(gene_variance, 0.9)

# Row indices of high variance genes
high_variance_index <- which(gene_variance > variance_threshold)

# What does a row index look like?
head(high_variance_index)
       ENSG00000000971.15_CFH     ENSG00000001617.11_SEMA3F 
                            7                            15 
      ENSG00000002586.18_CD99 ENSG00000002586.18_PAR_Y_CD99 
                           24                            25 
     ENSG00000002587.9_HS3ST1      ENSG00000002745.12_WNT16 
                           26                            28 
# Get a matrix that is subset to just the high variance genes
high_var_mat <- expr_mat[high_variance_index, ]

Heatmap

Annotation

First, we’ll set up the metadata that we want to use to label samples in the heatmap. In ComplexHeatmap terminology, this is called annotation, or HeatmapAnnotation, specifically.

sample_annotation_df <- histologies_df |>
  # Select only the columns that we'll use
  select(Kids_First_Biospecimen_ID,
         short_histology,
         composition,
         tumor_descriptor)

# Let's examine these columns
sample_annotation_df

ComplexHeatmap is going to want the data frame we provide to have the sample identifiers as row names, so let’s set that up.

sample_annotation_df <- sample_annotation_df |>
  tibble::column_to_rownames("Kids_First_Biospecimen_ID")

To specify the colors in our annotation bar, we need to create a list of named vectors. The names of the list need to exactly match the column names in sample_annotation_df and the names in each vector need to exactly match the values in those columns.

# The Okabe Ito palette is recommended for those with color vision deficiencies
histology_colors <- palette.colors(palette = "Okabe-Ito")[2:5]
# `palette.colors()` returns a named vector, which can cause trouble
histology_colors <- unname(histology_colors)

# annotation color list for ComplexHeatMap
sample_annotation_colors <- list(
  short_histology = c(
    "LGAT" = histology_colors[[1]],
    "Ependymoma" = histology_colors[[2]],
    "HGAT" = histology_colors[[3]],
    "Medulloblastoma" = histology_colors[[4]]
  ),
  composition = c(
    "Solid Tissue" = "#A0A0A0",  # light grey
    "Derived Cell Line" = "#000000"  # black
  ),
  tumor_descriptor = c(
    "Initial CNS Tumor" = "#3333FF",
    "Progressive" = "#FFFF99",
    "Recurrence" = "#CCCCFF",
    "Second Malignancy" = "#000033",
    "Unavailable" = "#FFFFFF"  # white for missing data
  )
)

We need to create a special type of object with HeatmapAnnotation using the annotation data frame and the list of color vectors. We will also make the annotation labels a bit nicer to look at than the raw columns names.

column_annotation <- HeatmapAnnotation(
  df = sample_annotation_df,
  col = sample_annotation_colors,
  annotation_label = c("Histology", "Composition", "Tumor descriptor")
)

Values for display

We will z-score the expression values for display. This is sometimes called a standardized score. Some heatmap plotting packages will do this for you, but ComplexHeatmap does not. It’s calculated for each value in a row (gene) by subtracting the gene’s mean and dividing by the gene’s standard deviation. This will result in every row having a mean of 0 and a standard deviation of 1.

zscores_mat <-
  (high_var_mat - rowMeans(high_var_mat)) / matrixStats::rowSds(high_var_mat)

Since all genes have the same z-score variance (by definition!), if you’re filtering to high variance values, it’s important to do that prior to standardizing.

Heatmap itself!

Okay, now we’re ready to make a heatmap complete with annotation bars.

Heatmap(zscores_mat,
        # The distance metric used for clustering the rows
        # This is different from the default (Euclidean)
        clustering_distance_rows = "pearson",
        # Linkage method for row clustering
        # This is different from the default (complete)
        clustering_method_rows = "average",
        # Distance metric for columns
        clustering_distance_columns = "pearson",
        # Linkage for columns
        clustering_method_columns = "average",
        show_row_names = FALSE,
        show_column_names = FALSE,
        # Add annotation bars to the top of the heatmap
        top_annotation = column_annotation,
        # This will be used as the label for the color bar
        # of the cells of the heatmap itself
        name = "z-score")
The automatically generated colors map from the minus and plus 99^th of
the absolute values in the matrix. There are outliers in the matrix
whose patterns might be hidden by this color mapping. You can manually
set the color to `col` argument.

Use `suppressMessages()` to turn off this message.
`use_raster` is automatically set to TRUE for a matrix with more than
2000 rows. You can control `use_raster` argument by explicitly setting
TRUE/FALSE to it.

Set `ht_opt$message = FALSE` to turn off this message.

ComplexHeatmap gives a few warnings here, but nothing to be concerned about. One is complaining that the color scale may be truncated relative to our data. The other warns that our very large heatmap itself is being drawn at a lower resolution to speed things up.

We have to create the heatmap twice – once for display in the notebook and once to save to a PNG file. For the second drawing we will deal with the raster warning and produce a higher resolution output.

# Open PNG plot device
png(filename = heatmap_file,
    width = 11,
    height = 7,
    units = "in",
    res = 300)
# Heatmap, again!
Heatmap(zscores_mat,
        clustering_distance_rows = "pearson",
        clustering_method_rows = "average",
        clustering_distance_columns = "pearson",
        clustering_method_columns = "average",
        show_row_names = FALSE,
        show_column_names = FALSE,
        top_annotation = column_annotation,
        name = "z-score",
        use_raster = FALSE) # higher resolution for output (be careful with PDF output!)
The automatically generated colors map from the minus and plus 99^th of
the absolute values in the matrix. There are outliers in the matrix
whose patterns might be hidden by this color mapping. You can manually
set the color to `col` argument.

Use `suppressMessages()` to turn off this message.
# Shut down current graphics device
dev.off()
png 
  2 

Notes on clustering

You probably noticed that there were some arguments that we defined that were related to the clustering of samples and genes when we built the heatmaps. The clustering determines the arrangement of genes and samples in the heatmap, making it so genes or samples with similar expression patterns are grouped. If we didn’t do that, we would have a heatmap that looked basically like random static!

The kind of clustering we performed was hierarchical clustering. This is an agglomerative method of clustering - each sample starts in its own cluster and then at each step of the algorithm the two most similar clusters are joined until there’s only a single cluster left.

The arguments we chose for clustering_distance_* and clustering_method_* told ComplexHeatmap to use 1 minus the correlation between samples as the distance measure, and to average samples when they were merged into a group. This method is also known as UPGMA (unweighted pair group method with arithmetic mean).

The result of this clustering process is a tree (dendrogram), which is shown at the top for samples and to the left for genes. It is important to note that the linear arrangement of samples in the tree is somewhat arbitrary, so we have to be careful not to overinterpret it. We can rotate around any branch of the tree in our visualization with no change to the tree topology itself. In fact, the samples at opposite ends of the heatmap could be right next to one another with the right set of rotations!

PCA as an alternative to clustering

You probably noticed that the hierarchical clustering we performed didn’t perfectly separate out the different cancer subtypes we were looking at.

What would have happened if we had used PCA to visualize the relationships among samples? Would that have made it easier to separate disease types? Let’s try it!

We will use the VST transformed data here too, and the plotPCA() function from DESeq2.

# Use plotPCA, but return the data for custom plotting
pca_df <- plotPCA(vst_data,
                  ntop = 5000, # use the top 5000 genes by variance
                  intgroup = "short_histology",
                  returnData = TRUE)
using ntop=5000 top features by variance
ggplot(pca_df, aes(PC1, PC2, color = short_histology)) +
  geom_point() +
  theme_bw() +
  scale_color_manual(
    values = c("LGAT" = histology_colors[[1]],
               "Ependymoma" = histology_colors[[2]],
               "HGAT" = histology_colors[[3]],
               "Medulloblastoma" = histology_colors[[4]])
  ) +
  labs(color = "Histology")

We can see that neither method perfectly separates the different disease types. With PCA, if we didn’t color the points, it might even be difficult to identify distinct clusters in the data at all by eye. (We could look at more than two PCs, which might help out, but we would need to use a different function for our calculation, since plotPCA() only returns the first two.)

For nice comparison of the relative advantages of these two methods (with a little mention of some further directions), we recommend blog post by Soneson.

Session Info

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ComplexHeatmap_2.20.0       lubridate_1.9.3            
 [3] forcats_1.0.0               stringr_1.5.1              
 [5] dplyr_1.1.4                 purrr_1.0.2                
 [7] readr_2.1.5                 tidyr_1.3.1                
 [9] tibble_3.2.1                ggplot2_3.5.1              
[11] tidyverse_2.0.0             DESeq2_1.44.0              
[13] SummarizedExperiment_1.34.0 Biobase_2.64.0             
[15] MatrixGenerics_1.16.0       matrixStats_1.3.0          
[17] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
[19] IRanges_2.38.0              S4Vectors_0.42.0           
[21] BiocGenerics_0.50.0         optparse_1.7.5             

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        farver_2.1.1            fastmap_1.1.1          
 [4] digest_0.6.35           timechange_0.3.0        lifecycle_1.0.4        
 [7] cluster_2.1.6           Cairo_1.6-2             magrittr_2.0.3         
[10] compiler_4.4.1          rlang_1.1.3             sass_0.4.9             
[13] tools_4.4.1             utf8_1.2.4              yaml_2.3.8             
[16] knitr_1.46              labeling_0.4.3          S4Arrays_1.4.0         
[19] bit_4.0.5               DelayedArray_0.30.0     RColorBrewer_1.1-3     
[22] abind_1.4-5             BiocParallel_1.38.0     withr_3.0.0            
[25] fansi_1.0.6             colorspace_2.1-0        scales_1.3.0           
[28] iterators_1.0.14        cli_3.6.2               rmarkdown_2.26         
[31] crayon_1.5.2            generics_0.1.3          httr_1.4.7             
[34] tzdb_0.4.0              rjson_0.2.21            getopt_1.20.4          
[37] cachem_1.0.8            zlibbioc_1.50.0         parallel_4.4.1         
[40] XVector_0.44.0          vctrs_0.6.5             Matrix_1.7-0           
[43] jsonlite_1.8.8          hms_1.1.3               GetoptLong_1.0.5       
[46] bit64_4.0.5             clue_0.3-65             magick_2.8.3           
[49] locfit_1.5-9.9          foreach_1.5.2           jquerylib_0.1.4        
[52] glue_1.7.0              codetools_0.2-20        shape_1.4.6.1          
[55] stringi_1.8.3           gtable_0.3.5            UCSC.utils_1.0.0       
[58] munsell_0.5.1           pillar_1.9.0            htmltools_0.5.8.1      
[61] GenomeInfoDbData_1.2.12 circlize_0.4.16         R6_2.5.1               
[64] doParallel_1.0.17       vroom_1.6.5             evaluate_0.23          
[67] lattice_0.22-6          highr_0.10              png_0.1-8              
[70] bslib_0.7.0             Rcpp_1.0.12             SparseArray_1.4.0      
[73] xfun_0.43               fs_1.6.4                pkgconfig_2.0.3        
[76] GlobalOptions_0.1.2    
---
title: "OpenPBTA: Create a heatmap"
author: CCDL for ALSF
date: 2021
output:
  html_notebook:
    toc: true
    toc_float: true
---


## Objectives

This notebook will demonstrate how to:

- Load tabular expression data and prepare it for use with `DESeq2`
- Create an annotated heatmap of gene expression variation using the `ComplexHeatmap` package
- Customize a PCA plot and compare the output with hierarchical clustering of the same data

---

In this notebook, we cluster RNA-seq data from the Open Pediatric Brain Tumor Atlas (OpenPBTA) project and create a heatmap.
OpenPBTA is a collaborative project organized by the Data Lab and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children's Hospital of Philadelphia conducted openly on GitHub.

You can read more about the project [here](https://github.com/alexslemonade/openpbta-analysis/#openpbta-analysis).

We've downloaded some of the publicly available expression data from the project and selected a subset with the most common disease types for analysis here.

We'll use a package called [`ComplexHeatmap`](https://www.bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) to make our heatmap.
This package allows us to annotate our heatmap with sample information, and will also perform clustering as part of generating the heatmap.
It is highly flexible and opinionated - the data structures we pass `ComplexHeatmap` functions often need to be _just right_.
See the [`ComplexHeatmap` Complete Reference](https://jokergoo.github.io/ComplexHeatmap-reference/book/) for more information.

## Set up

### Libraries

```{r load_libraries}
# We will manipulate RNASeq data with DESeq2 at the start
library(DESeq2)

# Then we'll be doing a bit of data wrangling with the Tidyverse
library(tidyverse)

# ComplexHeatmap is the package we'll use for making a heatmap
# It will do the hierarchical clustering for us as well
library(ComplexHeatmap)
```

### Directories and files

We have stored the data we'll use in this notebook in `data/open-pbta`.

```{r directories, live = TRUE}
data_dir <- file.path("data", "open-pbta")

# We'll store the heatmap in plots/open-pbta - create directory if it doesn't exist yet
plots_dir <- file.path("plots", "open-pbta")
fs::dir_create(plots_dir)
```

#### Input files

```{r input_files}
# The metadata describing the samples
histologies_file <- file.path(data_dir, "pbta-histologies-subset.tsv")

# The RNA-seq counts table
rnaseq_file <- file.path(data_dir, "pbta-rsem-expected_count-subset.rds")
```

#### Output files

```{r png_output, live = TRUE}
heatmap_file <- file.path(plots_dir,
                          "common_histologies_high_variance_heatmap.png")
```

## Read in and prepare data

### Metadata

Let's read in the metadata file file and take a look at the data.

```{r read_rds, live = TRUE}
histologies_df <- read_tsv(histologies_file)
```

Use the chunk below to explore the metadata data frame.

```{r explore_histologies_df, live = TRUE, eval = FALSE}
View(histologies_df)
```

We'll use the disease labels in a column called `short_histology` when we label the heatmap.
Let's count how many samples are assigned each `short_histology` label using the Tidyverse.

```{r count_histologies}
histology_count_df <- histologies_df |>
  # Count how many samples are in each short_histology and name the column
  # with that number n
  count(short_histology) |>
  # Sort from largest number of samples to smallest number of samples in a
  # histology
  arrange(desc(n))

histology_count_df
```

### RNA-seq data

Read in the expression count matrix (stored as a data frame).

```{r read_in_rnaseq, live = TRUE}
# Read in and examine the RNA-seq data
rnaseq_exp <- read_rds(rnaseq_file)
```

### Convert and round

The count data we have consists mostly of integers, but because of the estimation procedure that RSEM uses, some counts are fractional.
`DESeq2`, which we will be using below expects all integers, so we will round all of the count data.
This is easiest if we first convert from a data frame to a matrix.

```{r convert_round, live = TRUE}
rnaseq_mat <- rnaseq_exp |>
  # move gene_id to the rownames
  tibble::column_to_rownames("gene_id") |>
  # convert to a matrix and round
  as.matrix() |>
  round()
```

### Variance Stabilizing Transformation

Raw counts are not usually suitable for the algorithms we use for clustering and heatmap display, so we will use the `vst()` function from the `DESeq2` package to transform our data.

Since we are starting from a matrix, not a `SummarizedExperiment` as we did previously, we will need to provide the sample information ourselves.
Just to be sure nothing is out of order, we will check that the identifiers for the sample information stored in `histologies_df` matches the columns of our matrix.

```{r check-order}
all.equal(histologies_df$Kids_First_Biospecimen_ID,
          colnames(rnaseq_mat))
```

Now we can make our matrix into a `DESeq2` dataset, adding on the sample information from `histologies_df`.
Unlike when we were performing differential expression analysis, we won't provide an experimental design at this stage.

```{r make-DESEq}
ddset <- DESeqDataSetFromMatrix(rnaseq_mat,
                                colData = histologies_df,
                                design = ~ 1) # don't store an experimental design
```

We will again remove low count genes, as they are not likely to be informative.

```{r trim}
genes_to_keep <- rowSums(counts(ddset)) >= 10
ddset <- ddset[genes_to_keep, ]
```

Now we can apply the variance stabilizing transformation, saving the results in a new object.

```{r vst, live=TRUE}
# apply variance stabilizing transformation
vst_data <- vst(ddset, blind = TRUE)
```

This object stores information about the transformation that was applied, but for now, we will only need the matrix of transformed data, which we can extract with `assay()`.

```{r extract_expr, live = TRUE}
# extract transformed data
expr_mat <- assay(vst_data)
```

What are the dimensions of this transformed RNA-seq data matrix?

```{r rnaseq_dim, live = TRUE}
dim(expr_mat)
```

Almost 50k genes would be hard to visualize on a single heatmap!
If we are making a heatmap to get an idea of the structure in our data, it can be helpful to subset to high variance genes.
This is because the genes that *don't* vary are not likely to contribute much to the overall patterns we are interested in.

First, we'll calculate the variance for each gene using `rowVars()` from the `matrixStats` package and then take the genes in the top 10%.

```{r high_var_genes}
# Calculate variance from the expression data
gene_variance <- matrixStats::rowVars(expr_mat)

# Find the value that we'll use as a threshold to filter the top 10%
variance_threshold <- quantile(gene_variance, 0.9)

# Row indices of high variance genes
high_variance_index <- which(gene_variance > variance_threshold)

# What does a row index look like?
head(high_variance_index)
```

```{r high_var_mat, live = TRUE}
# Get a matrix that is subset to just the high variance genes
high_var_mat <- expr_mat[high_variance_index, ]
```

## Heatmap

### Annotation

First, we'll set up the metadata that we want to use to label samples in the heatmap.
In `ComplexHeatmap` terminology, this is called annotation, or `HeatmapAnnotation`, specifically.

```{r sample_annotation_df}
sample_annotation_df <- histologies_df |>
  # Select only the columns that we'll use
  select(Kids_First_Biospecimen_ID,
         short_histology,
         composition,
         tumor_descriptor)

# Let's examine these columns
sample_annotation_df
```
`ComplexHeatmap` is going to want the data frame we provide to have the sample identifiers as row names, so let's set that up.

```{r annotation_as_df, live = TRUE}
sample_annotation_df <- sample_annotation_df |>
  tibble::column_to_rownames("Kids_First_Biospecimen_ID")
```

To specify the colors in our annotation bar, we need to create a list of named vectors.
The names of the list need to *exactly* match the column names in `sample_annotation_df` and the names in each vector need to *exactly* match the values in those columns.

```{r color_palettes}
# The Okabe Ito palette is recommended for those with color vision deficiencies
histology_colors <- palette.colors(palette = "Okabe-Ito")[2:5]
# `palette.colors()` returns a named vector, which can cause trouble
histology_colors <- unname(histology_colors)

# annotation color list for ComplexHeatMap
sample_annotation_colors <- list(
  short_histology = c(
    "LGAT" = histology_colors[[1]],
    "Ependymoma" = histology_colors[[2]],
    "HGAT" = histology_colors[[3]],
    "Medulloblastoma" = histology_colors[[4]]
  ),
  composition = c(
    "Solid Tissue" = "#A0A0A0",  # light grey
    "Derived Cell Line" = "#000000"  # black
  ),
  tumor_descriptor = c(
    "Initial CNS Tumor" = "#3333FF",
    "Progressive" = "#FFFF99",
    "Recurrence" = "#CCCCFF",
    "Second Malignancy" = "#000033",
    "Unavailable" = "#FFFFFF"  # white for missing data
  )
)
```

We need to create a special type of object with `HeatmapAnnotation` using the annotation data frame and the list of color vectors.
We will also make the annotation labels a bit nicer to look at than the raw columns names.

```{r create_annotation, live = TRUE}
column_annotation <- HeatmapAnnotation(
  df = sample_annotation_df,
  col = sample_annotation_colors,
  annotation_label = c("Histology", "Composition", "Tumor descriptor")
)
```

### Values for display

We will z-score the expression values for display.
This is sometimes called a standardized score.
Some heatmap plotting packages will do this for you, but `ComplexHeatmap` does not.
It's calculated for each value in a row (gene) by subtracting the gene's mean and dividing by the gene's standard deviation.
This will result in every row having a mean of 0 and a standard deviation of 1.

```{r zscore}
zscores_mat <-
  (high_var_mat - rowMeans(high_var_mat)) / matrixStats::rowSds(high_var_mat)
```

Since all genes have the same z-score variance (by definition!), if you're filtering to high variance values, it's important to do that _prior_ to standardizing.

### Heatmap itself!

Okay, now we're ready to make a heatmap complete with annotation bars.

```{r make_heatmap}
Heatmap(zscores_mat,
        # The distance metric used for clustering the rows
        # This is different from the default (Euclidean)
        clustering_distance_rows = "pearson",
        # Linkage method for row clustering
        # This is different from the default (complete)
        clustering_method_rows = "average",
        # Distance metric for columns
        clustering_distance_columns = "pearson",
        # Linkage for columns
        clustering_method_columns = "average",
        show_row_names = FALSE,
        show_column_names = FALSE,
        # Add annotation bars to the top of the heatmap
        top_annotation = column_annotation,
        # This will be used as the label for the color bar
        # of the cells of the heatmap itself
        name = "z-score")
```

`ComplexHeatmap` gives a few warnings here, but nothing to be concerned about.
One is complaining that the color scale may be truncated relative to our data.
The other warns that our very large heatmap itself is being drawn at a lower resolution to speed things up.

We have to create the heatmap _twice_ -- once for display in the notebook and once to save to a PNG file.
For the second drawing we will deal with the raster warning and produce a higher resolution output.

```{r save_heatmap_png, live = TRUE}
# Open PNG plot device
png(filename = heatmap_file,
    width = 11,
    height = 7,
    units = "in",
    res = 300)
# Heatmap, again!
Heatmap(zscores_mat,
        clustering_distance_rows = "pearson",
        clustering_method_rows = "average",
        clustering_distance_columns = "pearson",
        clustering_method_columns = "average",
        show_row_names = FALSE,
        show_column_names = FALSE,
        top_annotation = column_annotation,
        name = "z-score",
        use_raster = FALSE) # higher resolution for output (be careful with PDF output!)
# Shut down current graphics device
dev.off()
```

### Notes on clustering

You probably noticed that there were some arguments that we defined that were related to the clustering of samples and genes when we built the heatmaps.
The clustering determines the arrangement of genes and samples in the heatmap, making it so genes or samples with similar expression patterns are grouped.
If we didn't do that, we would have a heatmap that looked basically like random static!

The kind of clustering we performed was [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
This is an _agglomerative_ method of clustering - each sample starts in its own cluster and then at each step of the algorithm the two most similar clusters are joined until there's only a single cluster left.

The arguments we chose for `clustering_distance_*` and `clustering_method_*` told `ComplexHeatmap` to use 1 minus the correlation between samples as the distance measure, and to average samples when they were merged into a group.
This method is also known as [UPGMA (unweighted pair group method with arithmetic mean)](https://en.wikipedia.org/wiki/UPGMA).

The result of this clustering process is a tree (dendrogram), which is shown at the top for samples and to the left for genes.
It is important to note that the linear arrangement of samples in the tree is somewhat arbitrary, so we have to be careful not to overinterpret it.
We can rotate around any branch of the tree in our visualization with no change to the tree topology itself.
In fact, the samples at opposite ends of the heatmap could be right next to one another with the right set of rotations!


### PCA as an alternative to clustering

You probably noticed that the hierarchical clustering we performed didn't perfectly separate out the different cancer subtypes we were looking at.

What would have happened if we had used PCA to visualize the relationships among samples?
Would that have made it easier to separate disease types?
Let's try it!

We will use the VST transformed data here too, and the `plotPCA()` function from `DESeq2`.

```{r expr-pca }
# Use plotPCA, but return the data for custom plotting
pca_df <- plotPCA(vst_data,
                  ntop = 5000, # use the top 5000 genes by variance
                  intgroup = "short_histology",
                  returnData = TRUE)


ggplot(pca_df, aes(PC1, PC2, color = short_histology)) +
  geom_point() +
  theme_bw() +
  scale_color_manual(
    values = c("LGAT" = histology_colors[[1]],
               "Ependymoma" = histology_colors[[2]],
               "HGAT" = histology_colors[[3]],
               "Medulloblastoma" = histology_colors[[4]])
  ) +
  labs(color = "Histology")
```

We can see that neither method perfectly separates the different disease types.
With PCA, if we didn't color the points, it might even be difficult to identify distinct clusters in the data at all by eye.
(We could look at more than two PCs, which might help out, but we would need to use a different function for our calculation, since `plotPCA()` only returns the first two.)

For nice comparison of the relative advantages of these two methods (with a little mention of some further directions), we recommend [blog post by Soneson](https://www.rna-seqblog.com/a-comparison-between-pca-and-hierarchical-clustering/).


## Session Info

```{r session_info}
sessionInfo()
```
