This notebook takes data and metadata from refine.bio and identifies differentially expressed genes with more than 2 groups.
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this medulloblastoma samples. Robinson et al. (2012) measured microarray gene expression of 71 medulloblastoma tumor samples. In this analysis, we will test differential expression across the medulloblastoma subtypes.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE37418
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedGSE37418
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "GSE37418")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "GSE37418.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_GSE37418.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using limma
for differential expression (Ritchie et al. 2015).
if (!("limma" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("limma", update = FALSE)
BiocManager }
Attach the packages we need for this analysis.
# Attach the library
library(limma)
# We will need this so we can use the pipe: %>%
library(magrittr)
# We'll use this for plotting
library(ggplot2)
The jitter plot we make later on with geom_jitter()
involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.
set.seed(12345)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
## refinebio_cell_line = col_logical(),
## refinebio_compound = col_logical(),
## refinebio_disease = col_logical(),
## refinebio_disease_stage = col_logical(),
## refinebio_genetic_information = col_logical(),
## refinebio_processed = col_logical(),
## refinebio_race = col_logical(),
## refinebio_source_archive_url = col_logical(),
## refinebio_specimen_part = col_logical(),
## refinebio_subject = col_logical(),
## refinebio_time = col_logical(),
## refinebio_treatment = col_logical(),
## channel_count = col_double(),
## contact_phone = col_double(),
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
expression_df # Tuck away the gene ID column as row names
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
We will be using the subgroup
variable labels in our metadata to test differentially expression across. Let’s take a look at how many samples of each subgroup we have.
%>% dplyr::count(subgroup) metadata
Looks like there is one sample that has been labeled by the authors as an outlier (SHH OUTLIIER
), as well as one group, U
, that only has two samples. We will probably want to remove the U
samples and this outlier since their inclusion might throw off our differential expression analysis results.
Let’s start out by removing the outlier and the U
group, we can do this all at once by removing groups smaller than 3.
<- metadata %>%
filtered_metadata ::group_by(subgroup) %>%
dplyr::filter(dplyr::n() > 3) %>%
dplyr::ungroup() dplyr
Let’s take a look at the subgroup summary again.
%>% dplyr::count(subgroup) metadata
Note that the U
and the SHH OUTLIER
subgroups are gone and only the four groups we are interested in are left.
But we still need to filter these samples out from the expression data that’s stored in expression_df
.
# Make the data in the order of the metadata
<- expression_df %>%
expression_df ::select(filtered_metadata$geo_accession)
dplyr
# Check if this is in the same order
all.equal(colnames(expression_df), filtered_metadata$geo_accession)
## [1] TRUE
limma
needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our subgroup
variable. We are using a + 0
in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0
so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.
# Create the design matrix
<- model.matrix(~ subgroup + 0, data = filtered_metadata) des_mat
Let’s take a look at the design matrix we created.
# Print out part of the design matrix
head(des_mat)
## subgroupG3 subgroupG4 subgroupSHH subgroupWNT
## 1 0 1 0 0
## 2 0 1 0 0
## 3 0 0 1 0
## 4 1 0 0 0
## 5 0 1 0 0
## 6 0 0 1 0
The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup
designation they all have.
# Make the column names less messy
colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup")
Now we are ready to actually start fitting our differential expression model to the data. To accommodate our design that has more than 2 groups this time, we will need to do this in a couple steps.
We will use the lmFit()
function from the limma
package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing using the eBayes()
function.
Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).
# Apply linear model to data
<- lmFit(expression_df, design = des_mat)
fit
# Apply empirical Bayes to smooth standard errors
<- eBayes(fit) fit
Now that we have our basic model fitting, we will want to investigate the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.
In this contrasts matrix, we are comparing each subtype to all the other subtypes.
We’re dividing by three in this expression so that each group is compared to the average of the other three groups (makeContrasts()
doesn’t allow you to use functions like mean()
; it wants a formula).
<- makeContrasts(
contrast_matrix "G3vsOther" = G3 - (G4 + SHH + WNT) / 3,
"G4vsOther" = G4 - (G3 + SHH + WNT) / 3,
"SHHvsOther" = SHH - (G3 + G4 + WNT) / 3,
"WNTvsOther" = WNT - (G3 + G4 + SHH) / 3,
levels = des_mat
)
Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like G3 = G3 - Control
for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts()
and model.matrix()
setups should look like (Ritchie et al. 2015).
Now that we have the contrasts matrix set up, we can use it to re-fit the model with contrasts.fit()
and re-smooth it with eBayes()
.
# Fit the model according to the contrasts matrix
<- contrasts.fit(fit, contrast_matrix)
contrasts_fit
# Re-smooth the Bayes
<- eBayes(contrasts_fit) contrasts_fit
Now let’s create the results table based on the contrasts fitted model.
This step will also apply the Benjamini-Hochberg multiple testing correction. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply multiple testing correction and obtain stats
<- topTable(contrasts_fit, number = nrow(expression_df)) %>%
stats_df ::rownames_to_column("Gene") tibble
Let’s take a peek at our results table.
head(stats_df)
For each gene, each group’s fold change in expression, compared to the average of the other groups is reported.
By default, results are ordered from largest F
value to the smallest, which means your most differentially expressed genes across all groups should be toward the top.
See the help page by using ?topTable
for more information and options for this table.
To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSG00000128683
and set up its own data frame for plotting purposes. Based on the results in stats_df
, we should expect this gene to be much higher in the WNT
samples.
First we will need to set up the data for this gene and the subgroup labels into a data frame for plotting.
<- expression_df %>%
top_gene_df # Extract this gene from `expression_df`
::filter(rownames(.) == "ENSG00000128683") %>%
dplyr# Transpose so the gene is a column
t() %>%
# Transpose made this a matrix, let's make it back into a data.frame like before
data.frame() %>%
# Store the sample ids as their own column instead of being row names
::rownames_to_column("refinebio_accession_code") %>%
tibble# Join on the selected columns from metadata
::inner_join(dplyr::select(
dplyr
metadata,
refinebio_accession_code,
subgroup ))
## Joining, by = "refinebio_accession_code"
Let’s take a sneak peek at our top_gene_df
.
head(top_gene_df)
Now let’s plot the data for ENSG00000128683
using our top_gene_df
. We should expect this gene to be expressed at much higher levels in the WNT
group samples.
ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
theme_classic() # This makes some aesthetic changes
Yes! These results make sense. The WNT samples have much higher expression of ENSG00000128683 than the other samples.
The results in stats_df
will be saved to our results/
directory.
::write_tsv(stats_df, file.path(
readr
results_dir,"GSE37418_limma_results.tsv" # Replace with a relevant output name
))
We’ll use ggplot2
to make a set of volcano plots. But first, we need to set up our data for plotting. We will need the p values from the individual contrasts as well as the log fold changes.
We can obtain the contrast p values from the contrasts_fit
object and make it a longer format that the ggplot()
function will want for plotting.
# Let's extract the contrast p values for each and transform them with -log10()
<- -log10(contrasts_fit$p.value) %>%
contrast_p_vals_df # Make this into a data frame
as.data.frame() %>%
# Store genes as their own column
::rownames_to_column("Gene") %>%
tibble# Make this into long format
::pivot_longer(dplyr::contains("vsOther"),
tidyrnames_to = "contrast",
values_to = "neg_log10_p_val"
)
Now let’s extract the log fold changes from stats_df
.
# Let's extract the fold changes from `stats_df`
<- stats_df %>%
log_fc_df # We only want to keep the `Gene` column as well
::select("Gene", dplyr::contains("vsOther")) %>%
dplyr# Make this a longer format
::pivot_longer(dplyr::contains("vsOther"),
tidyrnames_to = "contrast",
values_to = "logFoldChange"
)
We can perform an inner_join()
of both these datasets using both their Gene
and contrast
columns.
<- log_fc_df %>%
plot_df ::inner_join(contrast_p_vals_df,
dplyrby = c("Gene", "contrast"),
# This argument will add the given suffixes to the column names
# from the respective data frames, helping us keep track of which columns
# hold which types of values
suffix = c("_log_fc", "_p_val")
)
Let’s print out a preview of plot_df
.
# Print out what this looks like
head(plot_df)
Let’s declare what we consider to be significant levels for fold change and for -log10 p-values. By saving this as its own variable, we only need to change these cutoffs in one place if we want to adjust later.
# Convert p value cutoff to negative log 10 scale
<- -log10(0.05)
p_val_cutoff
# Absolute value cutoff for fold changes
<- 5 abs_fc_cutoff
Now we can use these cutoffs to make a new variable that declares which genes we consider significant. We will use some logic with dplyr::case_when()
to do this.
<- plot_df %>%
plot_df ::mutate(
dplyrsignif_label = dplyr::case_when(
abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff
~ "p-val and FC",
abs(logFoldChange) > abs_fc_cutoff ~ "FC",
> p_val_cutoff ~ "p-val",
neg_log10_p_val TRUE ~ "NS"
) )
Now we’re ready to plot the volcanoes!
<- ggplot(
volcanoes_plot
plot_df,aes(
x = logFoldChange, # Fold change as x value
y = neg_log10_p_val, # -log10(p value) for the contrasts
color = signif_label # Color code by significance cutoffs variable we made
)+
) # Make a scatter plot with points that are 30% opaque using `alpha`
geom_point(alpha = 0.3) +
# Draw our `p_val_cutoff` for line here
geom_hline(yintercept = p_val_cutoff, linetype = "dashed") +
# Using our `abs_fc_cutoff` for our lines here
geom_vline(xintercept = c(-abs_fc_cutoff, abs_fc_cutoff), linetype = "dashed") +
# The default colors aren't great, we'll specify our own here
scale_colour_manual(values = c("#67a9cf", "darkgray", "gray", "#a1d76a")) +
# Let's be more specific about what this p value is in our y axis label
ylab("Contrast -log10(p value)") +
# This makes separate plots for each contrast!
facet_wrap(~contrast) +
# Just for making it prettier!
theme_classic()
# Print out the plot!
volcanoes_plot
Here the green points might be of interest. We recommend ColorBrewer for finding different color sets if you don’t like the ones we used.
Let’s save these volcanoes to a PNG file.
ggsave(
plot = volcanoes_plot,
file.path(plots_dir, "GSE37418_results_volcano_plots.png")
)
## Saving 7 x 5 in image
ggplot2
cheatsheet has a summary of `ggplot2 options that might give you some inspiration for tweaking the volcano plot.At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## colorspace 2.0-1 2021-05-04 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## highr 0.9 2021-04-16 [1] RSPM (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.0.3)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## limma * 3.46.0 2020-10-27 [1] Bioconductor
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyr 1.1.3 2021-03-03 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library