1 Purpose of this analysis

This notebook takes data and metadata from refine.bio and identifies differentially expressed genes with more than 2 groups.

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this medulloblastoma samples. Robinson et al. (2012) measured microarray gene expression of 71 medulloblastoma tumor samples. In this analysis, we will test differential expression across the medulloblastoma subtypes.

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the GSE37418 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The GSE37418 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "GSE37418")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "GSE37418.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 Differential Expression - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using limma for differential expression (Ritchie et al. 2015).

if (!("limma" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("limma", update = FALSE)
}

Attach the packages we need for this analysis.

# Attach the library
library(limma)

# We will need this so we can use the pipe: %>%
library(magrittr)

# We'll use this for plotting
library(ggplot2)

The jitter plot we make later on with geom_jitter() involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.

set.seed(12345)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   refinebio_age = col_logical(),
##   refinebio_cell_line = col_logical(),
##   refinebio_compound = col_logical(),
##   refinebio_disease = col_logical(),
##   refinebio_disease_stage = col_logical(),
##   refinebio_genetic_information = col_logical(),
##   refinebio_processed = col_logical(),
##   refinebio_race = col_logical(),
##   refinebio_source_archive_url = col_logical(),
##   refinebio_specimen_part = col_logical(),
##   refinebio_subject = col_logical(),
##   refinebio_time = col_logical(),
##   refinebio_treatment = col_logical(),
##   channel_count = col_double(),
##   contact_phone = col_double(),
##   `contact_zip/postal_code` = col_double(),
##   data_row_count = col_double(),
##   taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
  # Tuck away the gene ID column as row names
  tibble::column_to_rownames("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

4.3 Removing groups that are too small

We will be using the subgroup variable labels in our metadata to test differentially expression across. Let’s take a look at how many samples of each subgroup we have.

metadata %>% dplyr::count(subgroup)

Looks like there is one sample that has been labeled by the authors as an outlier (SHH OUTLIIER), as well as one group, U, that only has two samples. We will probably want to remove the U samples and this outlier since their inclusion might throw off our differential expression analysis results.

Let’s start out by removing the outlier and the U group, we can do this all at once by removing groups smaller than 3.

filtered_metadata <- metadata %>%
  dplyr::group_by(subgroup) %>%
  dplyr::filter(dplyr::n() > 3) %>%
  dplyr::ungroup()

Let’s take a look at the subgroup summary again.

metadata %>% dplyr::count(subgroup)

Note that the U and the SHH OUTLIER subgroups are gone and only the four groups we are interested in are left.

But we still need to filter these samples out from the expression data that’s stored in expression_df.

# Make the data in the order of the metadata
expression_df <- expression_df %>%
  dplyr::select(filtered_metadata$geo_accession)

# Check if this is in the same order
all.equal(colnames(expression_df), filtered_metadata$geo_accession)
## [1] TRUE

4.4 Create the design matrix

limma needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our subgroup variable. We are using a + 0 in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0 so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.

# Create the design matrix
des_mat <- model.matrix(~ subgroup + 0, data = filtered_metadata)

Let’s take a look at the design matrix we created.

# Print out part of the design matrix
head(des_mat)
##   subgroupG3 subgroupG4 subgroupSHH subgroupWNT
## 1          0          1           0           0
## 2          0          1           0           0
## 3          0          0           1           0
## 4          1          0           0           0
## 5          0          1           0           0
## 6          0          0           1           0

The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup designation they all have.

# Make the column names less messy
colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup")

4.5 Perform differential expression

Now we are ready to actually start fitting our differential expression model to the data. To accommodate our design that has more than 2 groups this time, we will need to do this in a couple steps.

We will use the lmFit() function from the limma package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing using the eBayes() function.

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

# Apply linear model to data
fit <- lmFit(expression_df, design = des_mat)

# Apply empirical Bayes to smooth standard errors
fit <- eBayes(fit)

Now that we have our basic model fitting, we will want to investigate the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.

In this contrasts matrix, we are comparing each subtype to all the other subtypes.
We’re dividing by three in this expression so that each group is compared to the average of the other three groups (makeContrasts() doesn’t allow you to use functions like mean(); it wants a formula).

contrast_matrix <- makeContrasts(
  "G3vsOther" = G3 - (G4 + SHH + WNT) / 3,
  "G4vsOther" = G4 - (G3 + SHH + WNT) / 3,
  "SHHvsOther" = SHH - (G3 + G4 + WNT) / 3,
  "WNTvsOther" = WNT - (G3 + G4 + SHH) / 3,
  levels = des_mat
)

Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like G3 = G3 - Control for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts() and model.matrix() setups should look like (Ritchie et al. 2015).

Now that we have the contrasts matrix set up, we can use it to re-fit the model with contrasts.fit() and re-smooth it with eBayes().

# Fit the model according to the contrasts matrix
contrasts_fit <- contrasts.fit(fit, contrast_matrix)

# Re-smooth the Bayes
contrasts_fit <- eBayes(contrasts_fit)

Now let’s create the results table based on the contrasts fitted model.

This step will also apply the Benjamini-Hochberg multiple testing correction. The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

# Apply multiple testing correction and obtain stats
stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>%
  tibble::rownames_to_column("Gene")

Let’s take a peek at our results table.

head(stats_df)

For each gene, each group’s fold change in expression, compared to the average of the other groups is reported.

By default, results are ordered from largest F value to the smallest, which means your most differentially expressed genes across all groups should be toward the top.

See the help page by using ?topTable for more information and options for this table.

4.6 Check results by plotting one gene

To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSG00000128683 and set up its own data frame for plotting purposes. Based on the results in stats_df, we should expect this gene to be much higher in the WNT samples.

First we will need to set up the data for this gene and the subgroup labels into a data frame for plotting.

top_gene_df <- expression_df %>%
  # Extract this gene from `expression_df`
  dplyr::filter(rownames(.) == "ENSG00000128683") %>%
  # Transpose so the gene is a column
  t() %>%
  # Transpose made this a matrix, let's make it back into a data.frame like before
  data.frame() %>%
  # Store the sample ids as their own column instead of being row names
  tibble::rownames_to_column("refinebio_accession_code") %>%
  # Join on the selected columns from metadata
  dplyr::inner_join(dplyr::select(
    metadata,
    refinebio_accession_code,
    subgroup
  ))
## Joining, by = "refinebio_accession_code"

Let’s take a sneak peek at our top_gene_df.

head(top_gene_df)

Now let’s plot the data for ENSG00000128683 using our top_gene_df. We should expect this gene to be expressed at much higher levels in the WNT group samples.

ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
  geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
  theme_classic() # This makes some aesthetic changes

Yes! These results make sense. The WNT samples have much higher expression of ENSG00000128683 than the other samples.

4.7 Write results to file

The results in stats_df will be saved to our results/ directory.

readr::write_tsv(stats_df, file.path(
  results_dir,
  "GSE37418_limma_results.tsv" # Replace with a relevant output name
))

4.8 Make volcano plots

We’ll use ggplot2 to make a set of volcano plots. But first, we need to set up our data for plotting. We will need the p values from the individual contrasts as well as the log fold changes.

We can obtain the contrast p values from the contrasts_fit object and make it a longer format that the ggplot() function will want for plotting.

# Let's extract the contrast p values for each and transform them with -log10()
contrast_p_vals_df <- -log10(contrasts_fit$p.value) %>%
  # Make this into a data frame
  as.data.frame() %>%
  # Store genes as their own column
  tibble::rownames_to_column("Gene") %>%
  # Make this into long format
  tidyr::pivot_longer(dplyr::contains("vsOther"),
    names_to = "contrast",
    values_to = "neg_log10_p_val"
  )

Now let’s extract the log fold changes from stats_df.

# Let's extract the fold changes from `stats_df`
log_fc_df <- stats_df %>%
  # We only want to keep the `Gene` column as well
  dplyr::select("Gene", dplyr::contains("vsOther")) %>%
  # Make this a longer format
  tidyr::pivot_longer(dplyr::contains("vsOther"),
    names_to = "contrast",
    values_to = "logFoldChange"
  )

We can perform an inner_join() of both these datasets using both their Gene and contrast columns.

plot_df <- log_fc_df %>%
  dplyr::inner_join(contrast_p_vals_df,
    by = c("Gene", "contrast"),
    # This argument will add the given suffixes to the column names
    # from the respective data frames, helping us keep track of which columns
    # hold which types of values
    suffix = c("_log_fc", "_p_val")
  )

Let’s print out a preview of plot_df.

# Print out what this looks like
head(plot_df)

Let’s declare what we consider to be significant levels for fold change and for -log10 p-values. By saving this as its own variable, we only need to change these cutoffs in one place if we want to adjust later.

# Convert p value cutoff to negative log 10 scale
p_val_cutoff <- -log10(0.05)

# Absolute value cutoff for fold changes
abs_fc_cutoff <- 5

Now we can use these cutoffs to make a new variable that declares which genes we consider significant. We will use some logic with dplyr::case_when() to do this.

plot_df <- plot_df %>%
  dplyr::mutate(
    signif_label = dplyr::case_when(
      abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff
      ~ "p-val and FC",
      abs(logFoldChange) > abs_fc_cutoff ~ "FC",
      neg_log10_p_val > p_val_cutoff ~ "p-val",
      TRUE ~ "NS"
    )
  )

Now we’re ready to plot the volcanoes!

volcanoes_plot <- ggplot(
  plot_df,
  aes(
    x = logFoldChange, # Fold change as x value
    y = neg_log10_p_val, # -log10(p value) for the contrasts
    color = signif_label # Color code by significance cutoffs variable we made
  )
) +
  # Make a scatter plot with points that are 30% opaque using `alpha`
  geom_point(alpha = 0.3) +
  # Draw our `p_val_cutoff` for line here
  geom_hline(yintercept = p_val_cutoff, linetype = "dashed") +
  # Using our `abs_fc_cutoff` for our lines here
  geom_vline(xintercept = c(-abs_fc_cutoff, abs_fc_cutoff), linetype = "dashed") +
  # The default colors aren't great, we'll specify our own here
  scale_colour_manual(values = c("#67a9cf", "darkgray", "gray", "#a1d76a")) +
  # Let's be more specific about what this p value is in our y axis label
  ylab("Contrast -log10(p value)") +
  # This makes separate plots for each contrast!
  facet_wrap(~contrast) +
  # Just for making it prettier!
  theme_classic()

# Print out the plot!
volcanoes_plot

Here the green points might be of interest. We recommend ColorBrewer for finding different color sets if you don’t like the ones we used.

Let’s save these volcanoes to a PNG file.

ggsave(
  plot = volcanoes_plot,
  file.path(plots_dir, "GSE37418_results_volcano_plots.png")
)
## Saving 7 x 5 in image

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports     1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  bslib         0.2.5   2021-05-12 [1] RSPM (R 4.0.4)
##  cli           2.5.0   2021-04-26 [1] RSPM (R 4.0.4)
##  colorspace    2.0-1   2021-05-04 [1] RSPM (R 4.0.4)
##  crayon        1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  DBI           1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr         1.0.6   2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis      0.3.2   2021-04-29 [1] RSPM (R 4.0.4)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi         0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  farver        2.1.0   2021-02-28 [1] RSPM (R 4.0.3)
##  generics      0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  getopt        1.20.3  2019-03-22 [1] RSPM (R 4.0.0)
##  ggplot2     * 3.3.3   2020-12-30 [1] RSPM (R 4.0.3)
##  glue          1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  gtable        0.3.0   2019-03-25 [1] RSPM (R 4.0.3)
##  highr         0.9     2021-04-16 [1] RSPM (R 4.0.4)
##  hms           1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools     0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  jquerylib     0.1.4   2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite      1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr         1.33    2021-04-24 [1] RSPM (R 4.0.4)
##  labeling      0.4.2   2020-10-20 [1] RSPM (R 4.0.3)
##  lifecycle     1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  limma       * 3.46.0  2020-10-27 [1] Bioconductor  
##  magrittr    * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  munsell       0.5.0   2018-06-12 [1] RSPM (R 4.0.3)
##  optparse    * 1.6.6   2020-04-16 [1] RSPM (R 4.0.0)
##  pillar        1.6.1   2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  ps            1.6.0   2021-02-28 [1] RSPM (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache       0.15.0  2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3   1.8.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo          1.24.0  2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils       2.10.1  2020-08-26 [1] RSPM (R 4.0.3)
##  R6            2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  readr         1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2      2.1.2   2020-05-01 [1] RSPM (R 4.0.3)
##  rlang         0.4.11  2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown     2.8     2021-05-07 [1] RSPM (R 4.0.4)
##  rstudioapi    0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] RSPM (R 4.0.4)
##  scales        1.1.1   2020-05-11 [1] RSPM (R 4.0.3)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi       1.6.1   2021-05-10 [1] RSPM (R 4.0.4)
##  stringr       1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  styler        1.4.1   2021-03-30 [1] RSPM (R 4.0.4)
##  tibble        3.1.2   2021-05-16 [1] RSPM (R 4.0.4)
##  tidyr         1.1.3   2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect    1.1.1   2021-04-30 [1] RSPM (R 4.0.4)
##  utf8          1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs         0.3.8   2021-04-29 [1] RSPM (R 4.0.4)
##  withr         2.4.2   2021-04-18 [1] RSPM (R 4.0.4)
##  xfun          0.23    2021-05-15 [1] RSPM (R 4.0.4)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
Gonzalez I., 2014 Statistical analysis of RNA-Seq data. http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf
Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using Affymetrix microarrays. https://www.bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html
Ritchie M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007
Robinson G., M. Parker, T. A. Kranenburg, C. Lu, X. Chen, et al., 2012 Novel mutations target distinct subgroups of medulloblastoma. Nature 488: 43–48. https://doi.org/10.1038/nature11213
Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/