This notebook illustrates one way that you can use microarray data from refine.bio in downstream analyses, specifically in plotting clustered and annotated heatmaps.
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the path to the directory where plots will be saved
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this melanoma cell line dataset.
The data that we downloaded from refine.bio for this analysis has 21 microarray samples. The samples were obtained from three PLX4032-sensitive parental and three PLX4032-resistant sub-lines of melanoma cell lines that were either treated or not treated with the RAF-selective inhibitor, PLX4032.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE24862
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedGSE24862
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "GSE24862")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "GSE24862.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_GSE24862.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using the R package pheatmap
for clustering and creating a heatmap (Slowikowski 2017).
if (!("pheatmap" %in% installed.packages())) {
# Install pheatmap
install.packages("pheatmap", update = FALSE)
}
Attach the pheatmap
library:
# Attach the `pheatmap` library
library(pheatmap)
# We will need this so we can use the pipe: %>%
library(magrittr)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read in both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
## refinebio_compound = col_logical(),
## refinebio_disease = col_logical(),
## refinebio_disease_stage = col_logical(),
## refinebio_genetic_information = col_logical(),
## refinebio_processed = col_logical(),
## refinebio_race = col_logical(),
## refinebio_sex = col_logical(),
## refinebio_source_archive_url = col_logical(),
## refinebio_subject = col_logical(),
## refinebio_time = col_logical(),
## channel_count = col_double(),
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
df # Here we are going to store the gene IDs as row names so that
# we have only numeric values to perform calculations on later
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Let’s take a look at the metadata object that we read into the R environment.
head(metadata)
Now let’s ensure that the metadata and data are in the same sample order.
# Make the data in the order of the metadata
<- df %>% dplyr::select(metadata$refinebio_accession_code)
df
# Check if this is in the same order
all.equal(colnames(df), metadata$refinebio_accession_code)
## [1] TRUE
Now we are going to use a combination of functions from base R and the pheatmap
package to look at how our samples and genes are clustering.
Although you may want to create a heatmap including all of the genes in the dataset, this can produce a very large image that is hard to interpret. Alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criteria by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, and so on.
# Calculate the variance for each gene
<- apply(df, 1, var)
variances
# Determine the upper quartile variance cutoff value
<- quantile(variances, 0.75)
upper_var
# Filter the data choosing only genes whose variances are in the upper quartile
<- data.frame(df) %>%
df_by_var ::filter(variances > upper_var) dplyr
To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).
# Create and store the heatmap object
<- pheatmap(
heatmap
df_by_var,cluster_rows = TRUE, # Cluster the rows of the heatmap (genes in this case)
cluster_cols = TRUE, # Cluster the columns of the heatmap (samples)
show_rownames = FALSE, # There are too many genes to clearly show the labels
main = "Non-Annotated Heatmap",
colorRampPalette(c(
"deepskyblue",
"black",
"yellow"
25),
))(scale = "row" # Scale values in the direction of genes (rows)
)
We’ve created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the pheatmap()
function with annotation labels for our samples.
First let’s save our clustered heatmap.
You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
# Open a PNG file
png(file.path(
plots_dir,"GSE24862_heatmap_non_annotated.png" # Replace with a relevant file name
))
# Print your heatmap
heatmap
# Close the PNG file:
dev.off()
## png
## 2
Now, let’s add some annotation bars to our heatmap.
From the accompanying paper, we know that three PLX4032-sensitive parental cell lines (M229, M238 and M249) and three derived PLX4032-resistant (r) sub-lines (M229_r5, M238_r1, and M249_r4) were treated or not treated with the RAF-selective inhibitor, PLX4032 (Nazarian et al. 2010). We are going to annotate our heatmap with the variables that hold the refinebio_cell_line
and refinebio_treatment
data. We are also going to create a new column variable from our existing metadata called cell_line_type
, that will distinguish whether the refinebio_cell_line
is parental or resistant – since this is also a key aspect of the experimental design. Note that this step is very specific to our metadata, you may find that you also need to tailor the metadata for your own needs.
# Let's prepare an annotation data frame for plotting
<- metadata %>%
annotation_df # We want to select the variables that we want for annotating the heatmap
::select(
dplyr
refinebio_accession_code,
refinebio_cell_line,
refinebio_treatment%>%
) # Let's create a variable that specifically distinguishes whether
# the cell line is parental or resistant.
# This is a key aspect of the experimental design
::mutate(
dplyrcell_line_type =
::case_when(
dplyr::str_detect(refinebio_cell_line, "_r") ~ "resistant",
stringrTRUE ~ "parental"
)%>%
) # The `pheatmap()` function requires that the row names of our
# annotation object matches the column names of our dataset object
::column_to_rownames("refinebio_accession_code") tibble
You can create an annotated heatmap by providing our annotation object to the annotation_col
argument of the pheatmap()
function.
# Create and store the annotated heatmap object
<-
heatmap_annotated pheatmap(
df_by_var,cluster_rows = TRUE,
cluster_cols = TRUE,
show_rownames = FALSE,
annotation_col = annotation_df,
main = "Annotated Heatmap",
colorRampPalette(c(
"deepskyblue",
"black",
"yellow"
25),
))(scale = "row" # Scale values with respect to genes (rows)
)
Now that we have annotation bars on our heatmap, we have a better idea of the cell line and treatment groups that appear to cluster together. More specifically, we can see that the samples seem to cluster by their cell lines of origin, but not necessarily as much by whether or not they received the PLX4302
treatment.
Let’s save our annotated heatmap.
You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
# Open a PNG file
png(file.path(
plots_dir,"GSE24862_heatmap_annotated.png" # Replace with a relevant plot file name
))
# Print your heatmap
heatmap_annotated
# Close the PNG file:
dev.off()
## png
## 2
pheatmap
package allow, see the ComplexHeatmap Complete Reference Manual (Gu et al. 2016)At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## colorspace 2.0-1 2021-05-04 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## highr 0.9 2021-04-16 [1] RSPM (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pheatmap * 1.0.12 2019-01-04 [1] RSPM (R 4.0.3)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library