1 Purpose of this analysis

This notebook illustrates one way that you can use microarray data from refine.bio to perform Principal Component Analysis (PCA) and plot the scores using ggplot2.

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this medulloblastoma dataset.

The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. The purpose of the experiment is to identify the predominant regions with somatic copy number aberrations in each medulloblastoma subgroup. We will use the subgroup and histology variables from our metadata to annotate our plot later. These variables hold the information on the one of the defined subgroups of medulloblastoma to which it sample belongs to and the molecular histology of each sample, respectively.

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the GSE37382 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The GSE37382 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "GSE37382")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "GSE37382.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 PCA Visualization - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

Attach the packages we need for this analysis:

# Attach the library
library(ggplot2)

# We will need this so we can use the pipe: %>%
library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   refinebio_age = col_double(),
##   refinebio_cell_line = col_logical(),
##   refinebio_compound = col_logical(),
##   refinebio_disease = col_logical(),
##   refinebio_disease_stage = col_logical(),
##   refinebio_genetic_information = col_logical(),
##   refinebio_processed = col_logical(),
##   refinebio_race = col_logical(),
##   refinebio_source_archive_url = col_logical(),
##   refinebio_specimen_part = col_logical(),
##   refinebio_subject = col_logical(),
##   refinebio_time = col_logical(),
##   refinebio_treatment = col_logical(),
##   channel_count = col_double(),
##   `contact_zip/postal_code` = col_double(),
##   data_row_count = col_double(),
##   taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
df <- readr::read_tsv(data_file) %>%
  # Tuck away the gene ID column as row names, leaving only numeric values
  tibble::column_to_rownames("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Let’s ensure that the metadata and data are in the same sample order.

# Make the sure the columns (samples) are in the same order as the metadata
df <- df %>%
  dplyr::select(metadata$geo_accession)

# Check if this is in the same order
all.equal(colnames(df), metadata$geo_accession)
## [1] TRUE

Now we are going to use a combination of functions from base R and the ggplot2 package to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our medulloblastoma samples.

4.3 Perform Principal Components Analysis

In this code chunk, we are going to perform Principal Component Analysis (PCA) on our data and create a data frame using the PCA scores and the variables from our metadata that we are going to use to annotate our plot later. We are using the base R prcomp() function to perform Principal Component Analysis (PCA) here. The prcomp() function calculates principal component scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first. In most cases, we will want to use the scale = TRUE argument so that all of the expression measurements have the same variance. This prevents the PCA results from being dominated by a few highly variable genes.

# Perform Principal Component Analysis (PCA) using the `prcomp()` function
pca <- prcomp(
  t(df), # transpose our data frame to obtain PC scores for samples, not genes
  scale = TRUE # we want the data scaled to have unit variance for each gene
)

Let’s take a preview of the PCA results. Each row will be a sample, as in the transposed data matrix we used as input, and each column is one of the new principal component (PC) values. We are using indexing to only print out the first 5 PC columns: [, 1:5].

# We can access the results from our `pca` object using `$x`
head(pca$x[, 1:5])
##                   PC1        PC2        PC3       PC4       PC5
## GSM917111 -18.1468442 -63.659225   1.136901 -5.242875  10.29114
## GSM917250 -51.2350460  45.762776  -2.183100  7.948304 -21.26658
## GSM917281 -42.4774592   2.216351  19.941918 -5.664602 -49.00085
## GSM917062  -7.6116070 -25.887243 -65.257099 -6.226487  32.73786
## GSM917288 -54.9540801  51.804918  42.332093 28.506307  54.17483
## GSM917230  -0.0325771 -11.070407  -6.555240 28.661922 -20.85879

In total, we have 285 principal component values, because we provided 285 samples’ data (we will always have as many PCs as the smaller dimension of the input data matrix).

4.4 Explore Variance in PCA Results

Before visualizing and interpreting the results, it can be useful to understand the proportion of variance explained by each principal component. The principal components are automatically ordered by the variance they explain, meaning PC1 would always be the principal component that explains the most variance in your dataset. If the largest variance component, PC1, explained 96% of the variance in your data and very clearly showed a difference between sample batches you would be very concerned about your dataset! On the other hand, if a separation of batches was apparent in a different principal component that explained a low proportion of variance and the first few PCs explained most of the variance and appeared to correspond to something like tissue type and treatment, you would be less concerned (Childhood Cancer Data Lab 2020).

The summary() function reports the proportion of variance explained by each principal component.

# Save the summary of the PCA results using the `summary()` function
pca_summary <- summary(pca)

The importance element of the summary object contains the proportion of variance explained by each principal component along with other statistics, with pca_summary$importance, we can use indexing to only look at the first n PCs.

# Now access the importance information for the first 5 PCs
pca_summary$importance[, 1:5]
##                             PC1      PC2      PC3      PC4      PC5
## Standard deviation     45.50502 33.87890 30.50038 27.51175 23.99828
## Proportion of Variance  0.09560  0.05299  0.04295  0.03494  0.02659
## Cumulative Proportion   0.09560  0.14858  0.19153  0.22647  0.25306

Now that we’ve seen the proportion of variance for the first set of PCs, let’s prepare and plot the PC scores for the first two principal components, the components that explain the largest proportion of the expression variance in our dataset. (Note though, that in this case, they explain less than 15% of the total variance!)

4.5 Prepare a final data frame with PCA results for plotting

In the next chunk, we are going to extract the first two principal components from our pca object to prepare a data frame for plotting.

# Make the first two PCs into a data frame for plotting with `ggplot2`
pca_df <- data.frame(pca$x[, 1:2]) %>%
  # Turn samples IDs stored as row names into a column
  tibble::rownames_to_column("refinebio_accession_code") %>%
  # Bring only the variables that we want from the metadata into this data frame
  # here we are going to join by `refinebio_accession_code` values
  dplyr::inner_join(
    dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
    by = "refinebio_accession_code"
  )

4.6 Plot PCA Results

Now let’s plot the PC scores for the first two principal components.

Let’s also label the data points based on their genotype subgroup since medulloblastoma has been found to comprise of subgroups that each have molecularly distinct profiles (Northcott et al. 2012).

# Make a scatterplot using `ggplot2` functionality
pca_plot <- ggplot(
  pca_df,
  aes(
    x = PC1,
    y = PC2,
    color = subgroup # label points with different colors for each `subgroup`
  )
) +
  geom_point() + # Plot individual points to make a scatterplot
  theme_classic() # Format as a classic-looking plot with no gridlines

# Print out the plot here
pca_plot

Looks like Group 4 and SHH groups cluster with each other somewhat, but Group 3 seems to be less distinct, as there are some samples clustering with Group 4 as well. Most of the differences that we see between the groups are along the first axis of variation, PC1.

We can add another label to our plot to get more information about our dataset. Let’s also label the data points based on the histological subtype that each sample belongs to.

# Make a scatterplot with ggplot2
pca_plot <- ggplot(
  pca_df,
  aes(
    x = PC1,
    y = PC2,
    color = subgroup, # Draw points with different colors for each `subgroup`
    shape = histology # Use a different shape for each `histology` group
  )
) +
  geom_point() +
  theme_classic()

# Print out the plot here
pca_plot

Adding the histological subtype label to our plot made our plot more informative, but the diffuse Group 3 data doesn’t appear to be related to a histology subtype. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup, or plot other PC values to see if they might also reveal some structure in the data.

4.7 Save annotated PCA plot as a PNG

Now that we have an annotated PCA plot, let’s save it!

You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave() function to the respective file suffix.

# Save plot using `ggsave()` function
ggsave(
  file.path(
    plots_dir,
    "GSE37382_pca_scatterplot.png" # Replace with a good file name for your plot
  ),
  plot = pca_plot # The plot object that we want saved to file
)
## Saving 7 x 5 in image

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports     1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  bslib         0.2.5   2021-05-12 [1] RSPM (R 4.0.4)
##  cli           2.5.0   2021-04-26 [1] RSPM (R 4.0.4)
##  colorspace    2.0-1   2021-05-04 [1] RSPM (R 4.0.4)
##  crayon        1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  DBI           1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr         1.0.6   2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis      0.3.2   2021-04-29 [1] RSPM (R 4.0.4)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi         0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  farver        2.1.0   2021-02-28 [1] RSPM (R 4.0.3)
##  generics      0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  getopt        1.20.3  2019-03-22 [1] RSPM (R 4.0.0)
##  ggplot2     * 3.3.3   2020-12-30 [1] RSPM (R 4.0.3)
##  glue          1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  gtable        0.3.0   2019-03-25 [1] RSPM (R 4.0.3)
##  highr         0.9     2021-04-16 [1] RSPM (R 4.0.4)
##  hms           1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools     0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  jquerylib     0.1.4   2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite      1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr         1.33    2021-04-24 [1] RSPM (R 4.0.4)
##  labeling      0.4.2   2020-10-20 [1] RSPM (R 4.0.3)
##  lifecycle     1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  magrittr    * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  munsell       0.5.0   2018-06-12 [1] RSPM (R 4.0.3)
##  optparse    * 1.6.6   2020-04-16 [1] RSPM (R 4.0.0)
##  pillar        1.6.1   2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  ps            1.6.0   2021-02-28 [1] RSPM (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache       0.15.0  2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3   1.8.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo          1.24.0  2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils       2.10.1  2020-08-26 [1] RSPM (R 4.0.3)
##  R6            2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  readr         1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2      2.1.2   2020-05-01 [1] RSPM (R 4.0.3)
##  rlang         0.4.11  2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown     2.8     2021-05-07 [1] RSPM (R 4.0.4)
##  rstudioapi    0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] RSPM (R 4.0.4)
##  scales        1.1.1   2020-05-11 [1] RSPM (R 4.0.3)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi       1.6.1   2021-05-10 [1] RSPM (R 4.0.4)
##  stringr       1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  styler        1.4.1   2021-03-30 [1] RSPM (R 4.0.4)
##  tibble        3.1.2   2021-05-16 [1] RSPM (R 4.0.4)
##  tidyselect    1.1.1   2021-04-30 [1] RSPM (R 4.0.4)
##  utf8          1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs         0.3.8   2021-04-29 [1] RSPM (R 4.0.4)
##  withr         2.4.2   2021-04-18 [1] RSPM (R 4.0.4)
##  xfun          0.23    2021-05-15 [1] RSPM (R 4.0.4)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Brems M., 2017 A one-stop shop for principal component analysis. https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
Childhood Cancer Data Lab, 2020 OpenPBTA: Cluster validation. https://github.com/AlexsLemonade/training-modules/blob/3dbc6f3f53c680ec6aa2f513851c1cd4635cc31c/machine-learning/02-openpbta_consensus_clustering.Rmd#L310
Nguyen L. H., and S. Holmes, 2019 Ten quick tips for effective dimensionality reduction. PLOS Computational Biology 15. https://doi.org/10.1371/journal.pcbi.1006907
Northcott P., D. Shih, J. Peacock, L. Garzia, S. Morrissy, et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327
Powell V., and L. Lehe, Principal component analysis explained visually. https://setosa.io/ev/principal-component-analysis/
Prabhakaran S., 2016 The complete ggplot2 tutorial. http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html