# Purpose of this analysis This notebook illustrates one way that you can use RNA-seq data from refine.bio to perform Uniform Manifold Approximation and Projection (UMAP) and plot the scores using `ggplot2`. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ # How to run this example For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. ## Obtain the `.Rmd` file of this analysis To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd). Clicking this link will most likely send this to your downloads folder on your computer. Move this `.Rmd` file to where you would like this example and its files to be stored. You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) ## Set up your analysis folders Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders! If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. ```{r} # Create the data folder if it doesn't exist if (!dir.exists("data")) { dir.create("data") } # Define the file path to the plots directory plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { dir.create(plots_dir) } # Define the file path to the results directory results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { dir.create(results_dir) } ``` In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! ## Obtain the dataset from refine.bio For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573). Click the "Download Now" button on the right side of this screen. Fill out the pop up window with your email and our Terms and Conditions: We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". Note that this option will only be available for RNA-seq datasets. It may take a few minutes for the dataset to process. You will get an email when it is ready. ## About the dataset we are using for this example For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. ## Place the dataset in your new `data/` folder refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#rna-seq-sample-compendium-download-folder). The `` folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like `GSE1235` or `SRP12345`. Copy and paste the `SRP133573` folder into your newly created `data/` folder. ## Check out our file structure! Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - The `SRP133573` folder which contains: - The gene expression - The metadata TSV - A folder for `plots` (currently empty) - A folder for `results` (currently empty) Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place. First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. ```{r} # Define the file path to the data directory # Replace with the path of the folder the files will be in data_dir <- file.path("data", "SRP133573") # Declare the file path to the gene expression matrix file # inside directory saved as `data_dir` # Replace with the path to your dataset file data_file <- file.path(data_dir, "SRP133573.tsv") # Declare the file path to the metadata file # inside the directory saved as `data_dir` # Replace with the path to your metadata file metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} # Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` file.exists(metadata_file) ``` If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). We suggest saving plots to the `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences. ***   # UMAP Visualization - RNA-seq ## Install libraries See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. In this analysis, we will be using the R package [`DESeq2`](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@Love2014] for normalization, the R package [`umap`](https://www.rdocumentation.org/packages/umapr/versions/0.0.0.9001/topics/umap) [@Konopka2020] for the production of UMAP dimension reduction values and the R package , and the R package [`ggplot2`](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html) [@Prabhakaran2016] for plotting the UMAP values. ```{r} if (!("DESeq2" %in% installed.packages())) { # Install DESeq2 BiocManager::install("DESeq2", update = FALSE) } if (!("umap" %in% installed.packages())) { # Install umap package BiocManager::install("umap", update = FALSE) } ``` Attach the packages we need for this analysis: ```{r message=FALSE} # Attach the `DESeq2` library library(DESeq2) # Attach the `umap` library library(umap) # Attach the `ggplot2` library for plotting library(ggplot2) # We will need this so we can use the pipe: %>% library(magrittr) # Set the seed so our results are reproducible: set.seed(12345) ``` ## Import and set up data Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment. We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). ```{r} # Read in metadata TSV file metadata <- readr::read_tsv(metadata_file) # Read in data TSV file expression_df <- readr::read_tsv(data_file) %>% # Tuck away the gene ID column as row names, leaving only numeric values tibble::column_to_rownames("Gene") ``` Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order all.equal(colnames(expression_df), metadata$refinebio_accession_code) ``` Now we are going to use a combination of functions from the `DESeq2`, `umap`, and `ggplot2` packages to perform and visualize the results of the Uniform Manifold Approximation and Projection (UMAP) dimension reduction technique on our pre-ADT and post-ADT samples. ## Prepare metadata for `DESEq2` We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors. ```{r} # convert the columns we will be using for annotation into factors metadata <- metadata %>% dplyr::select( # select only the columns that we will need for plotting refinebio_accession_code, refinebio_treatment, refinebio_disease ) %>% dplyr::mutate( # Now let's convert the annotation variables into factors refinebio_treatment = factor( refinebio_treatment, # specify the possible levels in the order we want them to appear levels = c("pre-adt", "post-adt") ), refinebio_disease = as.factor(refinebio_disease) ) ``` ## Define a minimum counts cutoff We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset. ```{r} # Define a minimum counts cutoff and filter the data to include # only rows (genes) that have total counts above the cutoff filtered_expression_df <- expression_df %>% dplyr::filter(rowSums(.) >= 10) ``` We also need our counts to be rounded before we can use them with the `DESeqDataSetFromMatrix()` function. ```{r} # The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers filtered_expression_df <- round(filtered_expression_df) ``` ## Create a DESeqDataset We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. ```{r} # Create a `DESeqDataSet` object dds <- DESeqDataSetFromMatrix( countData = filtered_expression_df, # the counts values for all samples in our dataset colData = metadata, # annotation data for the samples in the counts data frame design = ~1 # Here we are not specifying a model # Replace with an appropriate design variable for your analysis ) ``` ## Perform DESeq2 normalization and transformation We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). ```{r} # Normalize and transform the data in the `DESeqDataSet` object # using the `vst()` function from the `DESeq2` R package dds_norm <- vst(dds) ``` ## Perform UMAP Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique proposed by @McInnes2018 ([See associated paper](https://arxiv.org/abs/1802.03426)). While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data. The advantage of this feature is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others [@CCDL2020]. In this code chunk, we are going to extract the normalized counts data from the `DESeqDataSet` object and perform UMAP on the normalized data using `umap()` from the `umap` package. We are using the default parameters when we run the `umap::umap()` function. Here's some [guidance about choosing parameters](https://cran.r-project.org/web/packages/umap/vignettes/umap.html#tuning-umap) when executing `umap::umap()` [@umap-vignette]. You can also run the following in the RStudio console to get more information on the function and its default parameters: `?umap::umap` or `?umap::umap.defaults`. ```{r} # First we are going to retrieve the normalized data # from the `DESeqDataSet` object using the `assay()` function normalized_counts <- assay(dds_norm) %>% t() # We need to transpose this data so each row is a sample # Now perform UMAP on the normalized data umap_results <- umap::umap(normalized_counts) ``` ## Prepare data frame for plotting Now that we have the results from UMAP, we need to extract the counts data from the `umap_results` object and merge the variables from the metadata that we will use for annotating our plot. ```{r} # Make into data frame for plotting with `ggplot2` # The UMAP values we need for plotting are stored in the `layout` element umap_plot_df <- data.frame(umap_results$layout) %>% # Turn sample IDs stored as row names into a column tibble::rownames_to_column("refinebio_accession_code") %>% # Add the metadata into this data frame; match by sample IDs dplyr::inner_join(metadata, by = "refinebio_accession_code") ``` Let's take a look at the data frame we created in the chunk above. ```{r} umap_plot_df ``` Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, `X1` and `X2`. ## Create UMAP plot Now we can use the `ggplot()` function to plot our normalized UMAP scores. ```{r} # Plot using `ggplot()` function ggplot( umap_plot_df, aes( x = X1, y = X2 ) ) + geom_point() # Plot individual points to make a scatterplot ``` Let's try adding a variable to our plot for annotation. In this code chunk, the variable `refinebio_treatment` is given to the `ggplot()` function so we can label by androgen deprivation therapy (ADT) status. ```{r} # Plot using `ggplot()` function ggplot( umap_plot_df, aes( x = X1, y = X2, color = refinebio_treatment # label points with different colors for each `subgroup` ) ) + geom_point() # This tells R that we want a scatterplot ``` In the next code chunk, we are going to add another variable to our plot for annotation. We'll plot using both `refinebio_treatment` and `refinebio_disease` variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the [original paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6210624/) [@Sharma2018]. ```{r} # Plot using `ggplot()` function and save to an object final_annotated_umap_plot <- ggplot( umap_plot_df, aes( x = X1, y = X2, # plot points with different colors for each `refinebio_treatment` group color = refinebio_treatment, # plot points with different shapes for each `refinebio_disease` group shape = refinebio_disease ) ) + geom_point() # make a scatterplot # Display the plot that we saved above final_annotated_umap_plot ``` Although it does appear that majority of the pre-ADT and post-ADT appear to cluster together, there are still questions remaining as we look at outliers. ### Interpretation of UMAP plot and results 1. Note that the coordinates of UMAP output for any given cell can change dramatically depending on parameters, and even run to run with the same parameters (Also why setting the seed is important). This means that you should not rely too heavily on the exact values of UMAP's output. - One particular limitation of UMAP is that while observed clusters have some meaning, the distance *between* clusters usually does not (nor does cluster density). The fact that two clusters are near each other should NOT be interpreted to mean that they are more related to each other than to more distant clusters. (There is some disagreement about whether UMAP distances have more meaning, but it is also probably safer to assume they don't.) 2. Playing with the parameters so you can fine-tune them is a good way to give you more information about a particular analysis as well as the data itself. Feel free to try playing with the parameters on your own in the code chunks above! In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (_adapted from @CCDL2020 training materials_). ## Save annotated UMAP plot as a PNG You can easily switch this to save to a JPEG or TIFF by changing the file name within the `ggsave()` function to the respective file suffix. ```{r} # Save plot using `ggsave()` function ggsave( file.path( plots_dir, "SRP133573_umap_plot.png" # Replace with a good file name for your plot ), plot = final_annotated_umap_plot ) ``` # Resources for further learning - [More on UMAP and its parameters](https://cran.r-project.org/web/packages/umap/vignettes/umap.html) [@umap-vignette] - [Guidelines on choosing dimension reduction methods](https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006907&type=printable) [@Nguyen2019] - [A nice explanation and comparison of many different dimensionality reduction techniques that you may encounter](https://rpubs.com/Saskia/520216) [@Freytag2019] # Print session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info sessioninfo::session_info() ``` # References