This notebook illustrates one way that you can use microarray data from refine.bio to perform Uniform Manifold Approximation (UMAP) and plot the coordinates using ggplot2
.
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this medulloblastoma dataset.
The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. The purpose of the experiment is to identify the predominant regions with somatic copy number aberrations in each medulloblastoma subgroup. We will use the subgroup
and histology
variables from our metadata to annotate our plot later. These variables hold the information on the one of the defined subgroups of medulloblastoma to which it sample belongs to and the molecular histology of each sample, respectively.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE37382
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedGSE37382
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "GSE37382")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "GSE37382.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_GSE37382.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
Note that UMAP does require more dimensions than PCA, though, so if you have a smaller dataset (less than 20) and you obtain errors, you may not be able to run UMAP for your dataset.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using the R package umap
(Konopka 2020) for the production of UMAP dimension reduction values and the R package ggplot2
(Prabhakaran 2016) for plotting the UMAP values.
if (!("umap" %in% installed.packages())) {
# Install umap package
::install("umap", update = FALSE)
BiocManager }
Attach the packages we need for this analysis:
# Attach the `umap` library
library(umap)
# Attach the `ggplot2` library
library(ggplot2)
# We will need this so we can use the pipe: %>%
library(magrittr)
The UMAP algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.
# Set the seed so our results are reproducible:
set.seed(12345)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_double(),
## refinebio_cell_line = col_logical(),
## refinebio_compound = col_logical(),
## refinebio_disease = col_logical(),
## refinebio_disease_stage = col_logical(),
## refinebio_genetic_information = col_logical(),
## refinebio_processed = col_logical(),
## refinebio_race = col_logical(),
## refinebio_source_archive_url = col_logical(),
## refinebio_specimen_part = col_logical(),
## refinebio_subject = col_logical(),
## refinebio_time = col_logical(),
## refinebio_treatment = col_logical(),
## channel_count = col_double(),
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
df # Tuck away the gene ID column as row names, leaving only numeric values
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
# Make sure the columns (samples) are in the same order as the metadata
<- df %>%
df ::select(metadata$geo_accession)
dplyr
# Check if this is in the same order
all.equal(colnames(df), metadata$geo_accession)
## [1] TRUE
Now we are going to use a combination of functions from the umap
and the ggplot2
packages to perform and visualize the results of the Uniform Manifold Approximation (UMAP) dimension reduction technique on our medulloblastoma samples.
In this code chunk, we are going to perform Uniform Manifold Approximation (UMAP) on our data and create a data frame using the UMAP scores and the variables from our metadata that we are going to use to annotate our plot later. The umap()
function calculates scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first.
# Perform UMAP using the `umap::umap()` function
<- umap::umap(
umap_data t(df) # transpose our data frame to obtain PC scores for samples, not genes
)
# Make into data frame for plotting with `ggplot2`
# The UMAP values we need for plotting are stored in the `layout` element
<- data.frame(umap_data$layout) %>%
umap_df # Turn sample IDs stored as row names into a column
::rownames_to_column("refinebio_accession_code") %>%
tibble# Add on the variables that we want from the metadata into this data frame;
# match by sample IDs
::inner_join(
dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
dplyrby = "refinebio_accession_code"
)
Let’s take a look at the data frame we created in the chunk above.
umap_df
Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, X1
and X2
. Now, let’s plot those UMAP scores.
# Make a scatterplot using `ggplot2` functionality
<- ggplot(
umap_plot
umap_df,aes(
x = X1,
y = X2
)+
) geom_point() + # Plot individual points to make a scatterplot
theme_classic() # Format as a classic-looking plot with no gridlines
# Print out the plot here
umap_plot
It’s hard to interpret our UMAP results without some metadata labels on our plot.
Let’s label the data points based on their genotype subgroup since this is central to the subgroup specific based hypothesis in the original paper (Northcott et al. 2012).
# Make a scatterplot with ggplot2
<- ggplot(
umap_plot
umap_df,aes(
x = X1,
y = X2,
color = subgroup # label points with different colors for each `subgroup`
)+
) geom_point() +
theme_classic()
# Print out the plot here
umap_plot
It looks like SHH clusters pretty distinctly, with Group 3 and Group 4 being more similar and grouping together (with some division).
We can add another label to our plot to potentially gain insight on the clustering behavior of our data. Let’s also label the data points based on the histological subtype that each sample belongs to.
# Make a scatterplot with ggplot2
<- ggplot(
umap_plot
umap_df,aes(
x = X1,
y = X2,
color = subgroup, # Draw points with different colors for each `subgroup`
shape = histology # Use a different shape for each `histology` group
)+
) geom_point() +
theme_classic()
# Print out plot here
umap_plot
Our histological subtype groups don’t appear to be clustering in a discernible pattern. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup.
In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (adapted from Childhood Cancer Data Lab (2020) training materials).
You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave()
function to the respective file suffix.
# Save plot using `ggsave()` function
ggsave(
file.path(
plots_dir,"GSE37382_umap_plot.png" # Replace with a good file name for your plot
),plot = umap_plot # The plot object that we want saved to file
)
## Saving 7 x 5 in image
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## askpass 1.1 2019-01-13 [1] RSPM (R 4.0.3)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## colorspace 2.0-1 2021-05-04 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## highr 0.9 2021-04-16 [1] RSPM (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.0.3)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.5)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## Matrix 1.3-2 2021-01-06 [2] CRAN (R 4.0.5)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## openssl 1.4.4 2021-04-30 [1] RSPM (R 4.0.4)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## png 0.1-7 2013-12-03 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## reticulate 1.20 2021-05-03 [1] RSPM (R 4.0.4)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## RSpectra 0.16-0 2019-12-01 [1] RSPM (R 4.0.3)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## umap * 0.2.7.0 2020-11-04 [1] RSPM (R 4.0.3)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library