This notebook takes data and metadata from refine.bio and identifies differentially expressed genes.
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
<- "plots"
plots_dir
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
<- "results"
results_dir
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
It may take a few minutes for the dataset to process. You will get an email when it is ready.
For this example analysis, we will use this zebrafish gene expression dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. In this analysis, we will test differential expression between the control and CREB-overexpressing groups.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE71270
folder into your newly created data/
folder.
Your new analysis folder should contain:
.Rmd
you downloadedGSE71270
folder which contains:
plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
# Replace with the path of the folder the files will be in
<- file.path("data", "GSE71270")
data_dir
# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
<- file.path(data_dir, "GSE71270.tsv")
data_file
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
<- file.path(data_dir, "metadata_GSE71270.tsv") metadata_file
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
In this analysis, we will be using limma
for differential expression (Ritchie et al. 2015). We will also use EnhancedVolcano
for plotting (Blighe et al. 2020).
if (!("limma" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("limma", update = FALSE)
BiocManager
}if (!("EnhancedVolcano" %in% installed.packages())) {
# Install this package if it isn't installed yet
::install("EnhancedVolcano", update = FALSE)
BiocManager }
Attach the packages we need for this analysis.
# Attach the library
library(limma)
# We will need this so we can use the pipe: %>%
library(magrittr)
# We'll use this for plotting
library(ggplot2)
The jitter plot we make later on with geom_jitter()
involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.
set.seed(12345)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
<- readr::read_tsv(metadata_file) metadata
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
## refinebio_cell_line = col_logical(),
## refinebio_compound = col_logical(),
## refinebio_disease = col_logical(),
## refinebio_disease_stage = col_logical(),
## refinebio_genetic_information = col_logical(),
## refinebio_processed = col_logical(),
## refinebio_race = col_logical(),
## refinebio_sex = col_logical(),
## refinebio_source_archive_url = col_logical(),
## refinebio_subject = col_logical(),
## refinebio_time = col_logical(),
## refinebio_treatment = col_logical(),
## channel_count = col_double(),
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
<- readr::read_tsv(data_file) %>%
expression_df # Tuck away the Gene ID column as row names
::column_to_rownames("Gene") tibble
##
## ── Column specification ──────────────────────────────────────────────
## cols(
## Gene = col_character(),
## GSM1831675 = col_double(),
## GSM1831676 = col_double(),
## GSM1831677 = col_double(),
## GSM1831678 = col_double(),
## GSM1831679 = col_double(),
## GSM1831680 = col_double(),
## GSM1831681 = col_double(),
## GSM1831682 = col_double(),
## GSM1831683 = col_double(),
## GSM1831684 = col_double()
## )
Let’s ensure that the metadata and data are in the same sample order.
# Make the data in the order of the metadata
<- expression_df %>%
expression_df ::select(metadata$geo_accession)
dplyr
# Check if this is in the same order
all.equal(colnames(df), metadata$geo_accession)
## [1] "target is NULL, current is character"
limma
needs a numeric design matrix to signify which are CREB and control samples. Here we are using the treatments described in the metadata table in the genotype/variation
column to create a design matrix where the “control” samples are assigned 0
and the “overexpressing the human CREB” samples are assigned 1
. Note that the metadata columns that signify the treatment groups might be different across datasets, and will almost certainly have different contents.
While the genotype/variation
column contains the group information we will be using for differential expression, the /
it contains in its column name makes it more annoying to access.
Accessing variable that have names with special characters like /
, or spaces, require extra work-arounds to ignore R’s normal interpretations of these characters. Here we will rename it as just genotype
to make our lives later much easier.
We will also recode the contents of the column, as overexpressing the human CREB"
is a bit of an unruly name. To do this, we will use the fct_recode()
function from the forcats
package, simplifying "overexpressing the human CREB"
to just CREB
. We will also use fct_relevel()
to make sure our control
samples appear first in the factor levels.
# These renaming steps will not be the same (or might not be needed at all)
# with a different dataset
<- metadata %>%
metadata # rename the column
::rename("genotype" = `genotype/variation`) %>%
dplyr# change the names and order of the genotypes (making the column a factor)
::mutate(
dplyrgenotype = genotype %>%
# rename the "overexpressing..." genotype to "CREB"
::fct_recode(CREB = "overexpressing the human CREB") %>%
forcats# make "control" the first level of the factor
::fct_relevel("control")
forcats )
Now we will create a model matrix based on our newly renamed genotype
variable.
# Create the design matrix from the genotype information
<- model.matrix(~genotype, data = metadata)
des_mat
# Look at the design matrix
head(des_mat)
## (Intercept) genotypeCREB
## 1 1 1
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 0
## 6 1 0
When we look at this design matrix, we see that there is now a genotypeCREB
column that defines the group for each sample: 0 for control samples and 1 for the CREB samples. (The model will also fit an intercept for all samples, so we can see that here as well.)
We will use the lmFit()
function from the limma
package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing with the eBayes()
function.
Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).
# Apply linear model to data
<- lmFit(expression_df, design = des_mat)
fit
# Apply empirical Bayes to smooth standard errors
<- eBayes(fit) fit
Because we are testing many different genes at once, we also want to perform some multiple test corrections, which we will do with the Benjamini-Hochberg method while making a table of results with topTable()
. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply multiple testing correction and obtain stats
<- topTable(fit, number = nrow(expression_df)) %>%
stats_df ::rownames_to_column("Gene") tibble
## Removing intercept from test coefficients
Let’s take a peek at our results table.
head(stats_df)
By default, results are ordered by largest B
(the log odds value) to the smallest, which means your most differentially expressed genes should be toward the top.
See the help page by using ?topTable
for more information and options for this table.
To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSDARG00000104315
and set up its own data frame for plotting purposes.
<- expression_df %>%
top_gene_df # Extract this gene from `expression_df`
::filter(rownames(.) == "ENSDARG00000104315") %>%
dplyr# Transpose so the gene is a column
t() %>%
# Transpose made this a matrix, let's make it back into a data frame
data.frame() %>%
# Store the sample ids as their own column instead of as row names
::rownames_to_column("refinebio_accession_code") %>%
tibble# Join on the selected columns from metadata
::inner_join(dplyr::select(
dplyr
metadata,
refinebio_accession_code,
genotype ))
## Joining, by = "refinebio_accession_code"
Let’s take a sneak peek at what our top_gene_df
looks like.
top_gene_df
Now let’s plot the data for ENSDARG00000104315
using our top_gene_df
.
ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
theme_classic() # This makes some aesthetic changes
These results make sense. The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do.
The results in stats_df
will be saved to our results/
directory.
::write_tsv(stats_df, file.path(
readr
results_dir,"GSE71270_limma_results.tsv" # Replace with a relevant output name
))
We’ll use the EnhancedVolcano
package’s main function to plot our data (Zhu et al. 2018).
::EnhancedVolcano(stats_df,
EnhancedVolcanolab = stats_df$Gene, # This has to be a vector with our labels we want for our genes
x = "logFC", # This is the column name in `stats_df` that contains what we want on the x axis
y = "adj.P.Val" # This is the column name in `stats_df` that contains what we want on the y axis
)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
In this plot, green points represent genes that meet the log2 fold change, by default the cutoff is absolute value of 1.
But there are no genes that meet the p value cutoff, which by default is 1e-05
. We used the adjusted p values for our plot above, so you may want to adjust this with the pCutoff
argument (Take a look at all the options for tailoring this plot using ?EnhancedVolcano
).
Let’s make the same plot again, but adjust the pCutoff
since we are using multiple-testing corrected p values, and this time we will assign the plot to our environment as volcano_plot
.
<- EnhancedVolcano::EnhancedVolcano(stats_df,
volcano_plot lab = stats_df$Gene,
x = "logFC",
y = "adj.P.Val",
pCutoff = 0.01 # Because we are using adjusted p values, we can loosen this a bit
)
# Print out our plot
volcano_plot
Let’s save this plot to a PNG file.
ggsave(
plot = volcano_plot,
file.path(plots_dir, "GSE71270_volcano_plot.png")
# Replace with a plot name relevant to your data )
## Saving 7 x 5 in image
EnhancedVolcano
vignette has more examples on how to tailor your volcano plot (Blighe et al. 2020).At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
# Print session info
::session_info() sessioninfo
## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.5 (2021-03-31)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-03-01
##
## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## ash 1.0-15 2015-09-01 [1] RSPM (R 4.0.3)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## beeswarm 0.3.1 2021-03-07 [1] RSPM (R 4.0.3)
## bslib 0.2.5 2021-05-12 [1] RSPM (R 4.0.4)
## cli 2.5.0 2021-04-26 [1] RSPM (R 4.0.4)
## colorspace 2.0-1 2021-05-04 [1] RSPM (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr 1.0.6 2021-05-05 [1] RSPM (R 4.0.4)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.0.4)
## EnhancedVolcano 1.8.0 2020-10-27 [1] Bioconductor
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## extrafont 0.17 2014-12-08 [1] RSPM (R 4.0.3)
## extrafontdb 1.0 2012-06-11 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## forcats 0.5.1 2021-01-27 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggalt 0.4.0 2017-02-15 [1] RSPM (R 4.0.0)
## ggbeeswarm 0.6.0 2017-08-07 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## ggrastr 0.2.3 2021-03-01 [1] RSPM (R 4.0.5)
## ggrepel 0.9.1 2021-01-15 [1] RSPM (R 4.0.3)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## highr 0.9 2021-04-16 [1] RSPM (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## KernSmooth 2.23-18 2020-10-29 [2] CRAN (R 4.0.5)
## knitr 1.33 2021-04-24 [1] RSPM (R 4.0.4)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.0.3)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## limma * 3.46.0 2020-10-27 [1] Bioconductor
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## maps 3.3.0 2018-04-03 [1] RSPM (R 4.0.3)
## MASS 7.3-53.1 2021-02-12 [2] CRAN (R 4.0.5)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.6.1 2021-05-16 [1] RSPM (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## proj4 1.0-10.1 2021-01-26 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R.cache 0.15.0 2021-04-30 [1] RSPM (R 4.0.4)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.3)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.3)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.3)
## rlang 0.4.11 2021-04-30 [1] RSPM (R 4.0.4)
## rmarkdown 2.8 2021-05-07 [1] RSPM (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## Rttf2pt1 1.3.8 2020-01-10 [1] RSPM (R 4.0.3)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.6.1 2021-05-10 [1] RSPM (R 4.0.4)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## styler 1.4.1 2021-03-30 [1] RSPM (R 4.0.4)
## tibble 3.1.2 2021-05-16 [1] RSPM (R 4.0.4)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.0.4)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.0.4)
## vipor 0.4.5 2017-03-22 [1] RSPM (R 4.0.0)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.0.4)
## xfun 0.23 2021-05-15 [1] RSPM (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library