1 Purpose of the analysis

This notebook illustrates one way that you can use harmonized RNA-seq data from refine.bio in downstream analyses, specifically in plotting clustered heatmaps.

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples.” Note that this option will only be available for RNA-seq datasets.

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this acute myeloid leukemia sample dataset.

The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 mice with acute myeloid leukemia (AML)), containing RNA-sequencing results for types of AML under controlled treatment conditions.

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the SRP070849 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The SRP070849 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "SRP070849")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "SRP070849.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 Clustering Heatmap - RNA-seq

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using the R package DESeq2 (Love et al. 2014) for normalization and the R package pheatmap (Slowikowski 2017) for clustering and creating a heatmap.

if (!("pheatmap" %in% installed.packages())) {
  # Install pheatmap
  install.packages("pheatmap", update = FALSE)
}

if (!("DESeq2" %in% installed.packages())) {
  # Install DESeq2
  BiocManager::install("DESeq2", update = FALSE)
}

Attach the pheatmap and DESeq2 libraries:

# Attach the `pheatmap` library
library(pheatmap)

# Attach the `DESeq2` library
library(DESeq2)

# We will need this so we can use the pipe: %>%
library(magrittr)

# Set the seed so our results are reproducible:
set.seed(12345)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read in both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_logical(),
##   refinebio_accession_code = col_character(),
##   experiment_accession = col_character(),
##   refinebio_organism = col_character(),
##   refinebio_platform = col_character(),
##   refinebio_source_database = col_character(),
##   refinebio_specimen_part = col_character(),
##   refinebio_subject = col_character(),
##   refinebio_title = col_character(),
##   refinebio_treatment = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
  # Here we are going to store the gene IDs as row names so that
  # we can have only numeric values to perform calculations on later
  tibble::column_to_rownames("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Let’s take a look at the metadata object that we read into the R environment.

head(metadata)

Now let’s ensure that the metadata and data are in the same sample order.

# Make the data in the order of the metadata
expression_df <- expression_df %>%
  dplyr::select(metadata$refinebio_accession_code)

# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE

Now we are going to use a combination of functions from the DESeq2 and pheatmap packages to look at how are samples and genes are clustering.

4.3 Define a minimum counts cutoff

We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.

# Define a minimum counts cutoff and filter the data to include
# only rows (genes) that have total counts above the cutoff
filtered_expression_df <- expression_df %>%
  dplyr::filter(rowSums(.) >= 10)

We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix() function.

# The `DESeqDataSetFromMatrix()` function needs the values to be integers
filtered_expression_df <- round(filtered_expression_df)

4.4 Create a DESeqDataset

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

# Create a `DESeqDataSet` object
dds <- DESeqDataSetFromMatrix(
  countData = filtered_expression_df, # the counts values for all samples
  colData = metadata, # annotation data for the samples
  design = ~1 # Here we are not specifying a model
  # Replace with an appropriate design variable for your analysis
)
## converting counts to integer mode

4.5 Perform DESeq2 normalization and transformation

We are going to use the rlog() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

# Normalize the data in the `DESeqDataSet` object
# using the `rlog()` function from the `DESEq2` R package
dds_norm <- rlog(dds)

4.6 Choose genes of interest

Although you may want to create a heatmap including all of the genes in the dataset, this can produce a very large image that is hard to interpret. Alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criterion by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, so on.

# Calculate the variance for each gene
variances <- apply(assay(dds_norm), 1, var)

# Determine the upper quartile variance cutoff value
upper_var <- quantile(variances, 0.75)

# Filter the data choosing only genes whose variances are in the upper quartile
df_by_var <- data.frame(assay(dds_norm)) %>%
  dplyr::filter(variances > upper_var)

4.7 Create a heatmap

To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).

# Create and store the heatmap object
heatmap <- pheatmap(
  df_by_var,
  cluster_rows = TRUE, # Cluster the rows of the heatmap (genes in this case)
  cluster_cols = TRUE, # Cluster the columns of the heatmap (samples),
  show_rownames = FALSE, # There are too many genes to clearly show the labels
  main = "Non-Annotated Heatmap",
  colorRampPalette(c(
    "deepskyblue",
    "black",
    "yellow"
  ))(25
  ),
  scale = "row" # Scale values in the direction of genes (rows)
)

We’ve created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the pheatmap() function with annotation labels for our samples.

First let’s save our clustered heatmap.

4.7.1 Save heatmap as a PNG

You can easily switch this to save to a JPEG or tiff by changing the function and file name within the function to the respective file suffix.

# Open a PNG file
png(file.path(
  plots_dir,
  "SRP070849_heatmap_non_annotated.png" # Replace with a relevant file name
))

# Print your heatmap
heatmap

# Close the PNG file:
dev.off()
## png 
##   2

Now, let’s add some annotation bars to our heatmap.

4.8 Prepare metadata for annotation

From the accompanying paper, we know that the mice with IDH2 mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2 mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). (Shih et al. 2017) We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap.

# Let's prepare the annotation for the uncollapsed `DESeqData` set object
# which will be used to annotate the heatmap
annotation_df <- metadata %>%
  # Create a variable to store the cancer type information
  dplyr::mutate(
    mutation = dplyr::case_when(
      startsWith(refinebio_title, "TET2") ~ "TET2",
      startsWith(refinebio_title, "IDH2") ~ "IDH2",
      startsWith(refinebio_title, "WT") ~ "WT",
      # If none of the above criteria are satisfied,
      # we mark the `mutation` variable as "unknown"
      TRUE ~ "unknown"
    )
  ) %>%
  # select only the columns we need for annotation
  dplyr::select(
    refinebio_accession_code,
    mutation,
    refinebio_treatment
  ) %>%
  # The `pheatmap()` function requires that the row names of our annotation
  # data frame match the column names of our `DESeaDataSet` object
  tibble::column_to_rownames("refinebio_accession_code")

4.8.1 Create annotated heatmap

You can create an annotated heatmap by providing our annotation object to the annotation_col argument of the pheatmap() function.

# Create and store the annotated heatmap object
heatmap_annotated <-
  pheatmap(
    df_by_var,
    cluster_rows = TRUE,
    cluster_cols = TRUE,
    show_rownames = FALSE,
    annotation_col = annotation_df,
    main = "Annotated Heatmap",
    colorRampPalette(c(
      "deepskyblue",
      "black",
      "yellow"
    ))(25
    ),
    scale = "row" # Scale values in the direction of genes (rows)
  )

Now that we have annotation bars on our heatmap, we have a better idea of the sample variable groups that appear to cluster together.

Let’s save our annotated heatmap.

4.8.2 Save annotated heatmap as a PNG

You can switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.

# Open a PNG file
png(file.path(
  plots_dir,
  "SRP070849_heatmap_annotated.png" # Replace with a relevant file name
))

# Print your heatmap
heatmap_annotated

# Close the PNG file:
dev.off()
## png 
##   2

5 Further learning resources about this analysis

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package              * version  date       lib source        
##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
##  AnnotationDbi          1.52.0   2020-10-27 [1] Bioconductor  
##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.3)
##  backports              1.2.1    2020-12-09 [1] RSPM (R 4.0.3)
##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
##  BiocGenerics         * 0.36.1   2021-04-16 [1] Bioconductor  
##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.3)
##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.3)
##  bitops                 1.0-7    2021-04-24 [1] RSPM (R 4.0.4)
##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.3)
##  bslib                  0.2.5    2021-05-12 [1] RSPM (R 4.0.4)
##  cachem                 1.0.5    2021-05-15 [1] RSPM (R 4.0.4)
##  cli                    2.5.0    2021-04-26 [1] RSPM (R 4.0.4)
##  colorspace             2.0-1    2021-05-04 [1] RSPM (R 4.0.4)
##  crayon                 1.4.1    2021-02-08 [1] RSPM (R 4.0.3)
##  DBI                    1.1.1    2021-01-15 [1] RSPM (R 4.0.3)
##  DelayedArray           0.16.3   2021-03-24 [1] Bioconductor  
##  DESeq2               * 1.30.1   2021-02-19 [1] Bioconductor  
##  digest                 0.6.27   2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr                  1.0.6    2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis               0.3.2    2021-04-29 [1] RSPM (R 4.0.4)
##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.3)
##  fansi                  0.4.2    2021-01-15 [1] RSPM (R 4.0.3)
##  farver                 2.1.0    2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap                1.1.0    2021-01-25 [1] RSPM (R 4.0.3)
##  genefilter             1.72.1   2021-01-21 [1] Bioconductor  
##  geneplotter            1.68.0   2020-10-27 [1] Bioconductor  
##  generics               0.1.0    2020-10-31 [1] RSPM (R 4.0.3)
##  GenomeInfoDb         * 1.26.7   2021-04-08 [1] Bioconductor  
##  GenomeInfoDbData       1.2.4    2022-03-01 [1] Bioconductor  
##  GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
##  ggplot2                3.3.3    2020-12-30 [1] RSPM (R 4.0.3)
##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.3)
##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.3)
##  highr                  0.9      2021-04-16 [1] RSPM (R 4.0.4)
##  hms                    1.0.0    2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools              0.5.1.1  2021-01-22 [1] RSPM (R 4.0.3)
##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.3)
##  IRanges              * 2.24.1   2020-12-12 [1] Bioconductor  
##  jquerylib              0.1.4    2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite               1.7.2    2020-12-09 [1] RSPM (R 4.0.3)
##  knitr                  1.33     2021-04-24 [1] RSPM (R 4.0.4)
##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.5)
##  lifecycle              1.0.0    2021-02-15 [1] RSPM (R 4.0.3)
##  locfit                 1.5-9.4  2020-03-25 [1] RSPM (R 4.0.3)
##  magrittr             * 2.0.1    2020-11-17 [1] RSPM (R 4.0.3)
##  Matrix                 1.3-2    2021-01-06 [2] CRAN (R 4.0.5)
##  MatrixGenerics       * 1.2.1    2021-01-30 [1] Bioconductor  
##  matrixStats          * 0.58.0   2021-01-29 [1] RSPM (R 4.0.3)
##  memoise                2.0.0    2021-01-26 [1] RSPM (R 4.0.3)
##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.3)
##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
##  pheatmap             * 1.0.12   2019-01-04 [1] RSPM (R 4.0.3)
##  pillar                 1.6.1    2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.3)
##  ps                     1.6.0    2021-02-28 [1] RSPM (R 4.0.3)
##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache                0.15.0   2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R6                     2.5.0    2020-10-28 [1] RSPM (R 4.0.3)
##  RColorBrewer           1.1-2    2014-12-07 [1] RSPM (R 4.0.3)
##  Rcpp                   1.0.6    2021-01-15 [1] RSPM (R 4.0.3)
##  RCurl                  1.98-1.3 2021-03-16 [1] RSPM (R 4.0.4)
##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.3)
##  rlang                  0.4.11   2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown              2.8      2021-05-07 [1] RSPM (R 4.0.4)
##  RSQLite                2.2.7    2021-04-22 [1] RSPM (R 4.0.4)
##  rstudioapi             0.13     2020-11-12 [1] RSPM (R 4.0.3)
##  S4Vectors            * 0.28.1   2020-12-09 [1] Bioconductor  
##  sass                   0.4.0    2021-05-12 [1] RSPM (R 4.0.4)
##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.3)
##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.3)
##  stringi                1.6.1    2021-05-10 [1] RSPM (R 4.0.4)
##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.3)
##  styler                 1.4.1    2021-03-30 [1] RSPM (R 4.0.4)
##  SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
##  survival               3.2-10   2021-03-16 [2] CRAN (R 4.0.5)
##  tibble                 3.1.2    2021-05-16 [1] RSPM (R 4.0.4)
##  tidyselect             1.1.1    2021-04-30 [1] RSPM (R 4.0.4)
##  utf8                   1.2.1    2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs                  0.3.8    2021-04-29 [1] RSPM (R 4.0.4)
##  withr                  2.4.2    2021-04-18 [1] RSPM (R 4.0.4)
##  xfun                   0.23     2021-05-15 [1] RSPM (R 4.0.4)
##  XML                    3.99-0.6 2021-03-16 [1] RSPM (R 4.0.4)
##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.3)
##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.3)
##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313
Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
Shih A. H., C. Meydan, K. Shank, F. E. Garrett-Bakelman, P. S. Ward, et al., 2017 Combination targeted therapy to disrupt aberrant oncogenic signaling and reverse epigenetic dysfunction in IDH2- and TET2-mutant acute myeloid leukemia. Cancer Discovery 7. https://doi.org/10.1158/2159-8290.CD-16-1049
Slowikowski K., 2017 Make heatmaps in R with pheatmap. https://slowkow.com/notes/pheatmap-tutorial/