1 Purpose of this analysis

This notebook takes data and metadata from refine.bio and identifies differentially expressed genes.

⬇️ Jump to the analysis code ⬇️

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
  dir.create("data")
}

# Define the file path to the plots directory
plots_dir <- "plots"

# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

# Define the file path to the results directory
results_dir <- "results"

# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
  dir.create(results_dir)
}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

It may take a few minutes for the dataset to process. You will get an email when it is ready.

2.4 About the dataset we are using for this example

For this example analysis, we will use this zebrafish gene expression dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. In this analysis, we will test differential expression between the control and CREB-overexpressing groups.

2.5 Place the dataset in your new data/ folder

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

Copy and paste the GSE71270 folder into your newly created data/ folder.

2.6 Check out our file structure!

Your new analysis folder should contain:

  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
    • The GSE71270 folder which contains:
      • The gene expression
      • The metadata TSV
  • A folder for plots (currently empty)
  • A folder for results (currently empty)

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

# Define the file path to the data directory
# Replace with the path of the folder the files will be in
data_dir <- file.path("data", "GSE71270")

# Declare the file path to the gene expression matrix file
# inside directory saved as `data_dir`
# Replace with the path to your dataset file
data_file <- file.path(data_dir, "GSE71270.tsv")

# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
# Replace with the path to your metadata file
metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

4 Differential Expression - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using limma for differential expression (Ritchie et al. 2015). We will also use EnhancedVolcano for plotting (Blighe et al. 2020).

if (!("limma" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("limma", update = FALSE)
}
if (!("EnhancedVolcano" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("EnhancedVolcano", update = FALSE)
}

Attach the packages we need for this analysis.

# Attach the library
library(limma)

# We will need this so we can use the pipe: %>%
library(magrittr)

# We'll use this for plotting
library(ggplot2)

The jitter plot we make later on with geom_jitter() involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.

set.seed(12345)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   refinebio_age = col_logical(),
##   refinebio_cell_line = col_logical(),
##   refinebio_compound = col_logical(),
##   refinebio_disease = col_logical(),
##   refinebio_disease_stage = col_logical(),
##   refinebio_genetic_information = col_logical(),
##   refinebio_processed = col_logical(),
##   refinebio_race = col_logical(),
##   refinebio_sex = col_logical(),
##   refinebio_source_archive_url = col_logical(),
##   refinebio_subject = col_logical(),
##   refinebio_time = col_logical(),
##   refinebio_treatment = col_logical(),
##   channel_count = col_double(),
##   `contact_zip/postal_code` = col_double(),
##   data_row_count = col_double(),
##   taxid_ch1 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
  # Tuck away the Gene ID column as row names
  tibble::column_to_rownames("Gene")
## 
## ── Column specification ──────────────────────────────────────────────
## cols(
##   Gene = col_character(),
##   GSM1831675 = col_double(),
##   GSM1831676 = col_double(),
##   GSM1831677 = col_double(),
##   GSM1831678 = col_double(),
##   GSM1831679 = col_double(),
##   GSM1831680 = col_double(),
##   GSM1831681 = col_double(),
##   GSM1831682 = col_double(),
##   GSM1831683 = col_double(),
##   GSM1831684 = col_double()
## )

Let’s ensure that the metadata and data are in the same sample order.

# Make the data in the order of the metadata
expression_df <- expression_df %>%
  dplyr::select(metadata$geo_accession)

# Check if this is in the same order
all.equal(colnames(df), metadata$geo_accession)
## [1] "target is NULL, current is character"

4.3 Set up design matrix

limma needs a numeric design matrix to signify which are CREB and control samples. Here we are using the treatments described in the metadata table in the genotype/variation column to create a design matrix where the “control” samples are assigned 0 and the “overexpressing the human CREB” samples are assigned 1. Note that the metadata columns that signify the treatment groups might be different across datasets, and will almost certainly have different contents.

While the genotype/variation column contains the group information we will be using for differential expression, the / it contains in its column name makes it more annoying to access.
Accessing variable that have names with special characters like /, or spaces, require extra work-arounds to ignore R’s normal interpretations of these characters. Here we will rename it as just genotype to make our lives later much easier.

We will also recode the contents of the column, as overexpressing the human CREB" is a bit of an unruly name. To do this, we will use the fct_recode() function from the forcats package, simplifying "overexpressing the human CREB" to just CREB. We will also use fct_relevel() to make sure our control samples appear first in the factor levels.

# These renaming steps will not be the same (or might not be needed at all)
# with a different dataset
metadata <- metadata %>%
  # rename the column
  dplyr::rename("genotype" = `genotype/variation`) %>%
  # change the names and order of the genotypes (making the column a factor)
  dplyr::mutate(
    genotype = genotype %>%
      # rename the "overexpressing..." genotype to "CREB"
      forcats::fct_recode(CREB = "overexpressing the human CREB") %>%
      # make "control" the first level of the factor
      forcats::fct_relevel("control")
  )

Now we will create a model matrix based on our newly renamed genotype variable.

# Create the design matrix from the genotype information
des_mat <- model.matrix(~genotype, data = metadata)

# Look at the design matrix
head(des_mat)
##   (Intercept) genotypeCREB
## 1           1            1
## 2           1            0
## 3           1            0
## 4           1            1
## 5           1            0
## 6           1            0

When we look at this design matrix, we see that there is now a genotypeCREB column that defines the group for each sample: 0 for control samples and 1 for the CREB samples. (The model will also fit an intercept for all samples, so we can see that here as well.)

4.4 Perform differential expression

We will use the lmFit() function from the limma package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing with the eBayes() function.

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

# Apply linear model to data
fit <- lmFit(expression_df, design = des_mat)

# Apply empirical Bayes to smooth standard errors
fit <- eBayes(fit)

Because we are testing many different genes at once, we also want to perform some multiple test corrections, which we will do with the Benjamini-Hochberg method while making a table of results with topTable(). The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

# Apply multiple testing correction and obtain stats
stats_df <- topTable(fit, number = nrow(expression_df)) %>%
  tibble::rownames_to_column("Gene")
## Removing intercept from test coefficients

Let’s take a peek at our results table.

head(stats_df)

By default, results are ordered by largest B (the log odds value) to the smallest, which means your most differentially expressed genes should be toward the top.

See the help page by using ?topTable for more information and options for this table.

4.5 Check results by plotting one gene

To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSDARG00000104315 and set up its own data frame for plotting purposes.

top_gene_df <- expression_df %>%
  # Extract this gene from `expression_df`
  dplyr::filter(rownames(.) == "ENSDARG00000104315") %>%
  # Transpose so the gene is a column
  t() %>%
  # Transpose made this a matrix, let's make it back into a data frame
  data.frame() %>%
  # Store the sample ids as their own column instead of as row names
  tibble::rownames_to_column("refinebio_accession_code") %>%
  # Join on the selected columns from metadata
  dplyr::inner_join(dplyr::select(
    metadata,
    refinebio_accession_code,
    genotype
  ))
## Joining, by = "refinebio_accession_code"

Let’s take a sneak peek at what our top_gene_df looks like.

top_gene_df

Now let’s plot the data for ENSDARG00000104315 using our top_gene_df.

ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
  geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
  theme_classic() # This makes some aesthetic changes

These results make sense. The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do.

4.6 Write results to file

The results in stats_df will be saved to our results/ directory.

readr::write_tsv(stats_df, file.path(
  results_dir,
  "GSE71270_limma_results.tsv" # Replace with a relevant output name
))

4.7 Make a volcano plot

We’ll use the EnhancedVolcano package’s main function to plot our data (Zhu et al. 2018).

EnhancedVolcano::EnhancedVolcano(stats_df,
  lab = stats_df$Gene, # This has to be a vector with our labels we want for our genes
  x = "logFC", # This is the column name in `stats_df` that contains what we want on the x axis
  y = "adj.P.Val" # This is the column name in `stats_df` that contains what we want on the y axis
)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

In this plot, green points represent genes that meet the log2 fold change, by default the cutoff is absolute value of 1.
But there are no genes that meet the p value cutoff, which by default is 1e-05. We used the adjusted p values for our plot above, so you may want to adjust this with the pCutoff argument (Take a look at all the options for tailoring this plot using ?EnhancedVolcano).

Let’s make the same plot again, but adjust the pCutoff since we are using multiple-testing corrected p values, and this time we will assign the plot to our environment as volcano_plot.

volcano_plot <- EnhancedVolcano::EnhancedVolcano(stats_df,
  lab = stats_df$Gene,
  x = "logFC",
  y = "adj.P.Val",
  pCutoff = 0.01 # Because we are using adjusted p values, we can loosen this a bit
)

# Print out our plot
volcano_plot

Let’s save this plot to a PNG file.

ggsave(
  plot = volcano_plot,
  file.path(plots_dir, "GSE71270_volcano_plot.png")
) # Replace with a plot name relevant to your data
## Saving 7 x 5 in image

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2022-03-01                  
## 
## ─ Packages ─────────────────────────────────────────────────────────
##  package         * version  date       lib source        
##  ash               1.0-15   2015-09-01 [1] RSPM (R 4.0.3)
##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.3)
##  backports         1.2.1    2020-12-09 [1] RSPM (R 4.0.3)
##  beeswarm          0.3.1    2021-03-07 [1] RSPM (R 4.0.3)
##  bslib             0.2.5    2021-05-12 [1] RSPM (R 4.0.4)
##  cli               2.5.0    2021-04-26 [1] RSPM (R 4.0.4)
##  colorspace        2.0-1    2021-05-04 [1] RSPM (R 4.0.4)
##  crayon            1.4.1    2021-02-08 [1] RSPM (R 4.0.3)
##  DBI               1.1.1    2021-01-15 [1] RSPM (R 4.0.3)
##  digest            0.6.27   2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr             1.0.6    2021-05-05 [1] RSPM (R 4.0.4)
##  ellipsis          0.3.2    2021-04-29 [1] RSPM (R 4.0.4)
##  EnhancedVolcano   1.8.0    2020-10-27 [1] Bioconductor  
##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.3)
##  extrafont         0.17     2014-12-08 [1] RSPM (R 4.0.3)
##  extrafontdb       1.0      2012-06-11 [1] RSPM (R 4.0.3)
##  fansi             0.4.2    2021-01-15 [1] RSPM (R 4.0.3)
##  farver            2.1.0    2021-02-28 [1] RSPM (R 4.0.3)
##  forcats           0.5.1    2021-01-27 [1] RSPM (R 4.0.3)
##  generics          0.1.0    2020-10-31 [1] RSPM (R 4.0.3)
##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
##  ggalt             0.4.0    2017-02-15 [1] RSPM (R 4.0.0)
##  ggbeeswarm        0.6.0    2017-08-07 [1] RSPM (R 4.0.0)
##  ggplot2         * 3.3.3    2020-12-30 [1] RSPM (R 4.0.3)
##  ggrastr           0.2.3    2021-03-01 [1] RSPM (R 4.0.5)
##  ggrepel           0.9.1    2021-01-15 [1] RSPM (R 4.0.3)
##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.3)
##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.3)
##  highr             0.9      2021-04-16 [1] RSPM (R 4.0.4)
##  hms               1.0.0    2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools         0.5.1.1  2021-01-22 [1] RSPM (R 4.0.3)
##  jquerylib         0.1.4    2021-04-26 [1] RSPM (R 4.0.4)
##  jsonlite          1.7.2    2020-12-09 [1] RSPM (R 4.0.3)
##  KernSmooth        2.23-18  2020-10-29 [2] CRAN (R 4.0.5)
##  knitr             1.33     2021-04-24 [1] RSPM (R 4.0.4)
##  labeling          0.4.2    2020-10-20 [1] RSPM (R 4.0.3)
##  lifecycle         1.0.0    2021-02-15 [1] RSPM (R 4.0.3)
##  limma           * 3.46.0   2020-10-27 [1] Bioconductor  
##  magrittr        * 2.0.1    2020-11-17 [1] RSPM (R 4.0.3)
##  maps              3.3.0    2018-04-03 [1] RSPM (R 4.0.3)
##  MASS              7.3-53.1 2021-02-12 [2] CRAN (R 4.0.5)
##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.3)
##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
##  pillar            1.6.1    2021-05-16 [1] RSPM (R 4.0.4)
##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.3)
##  proj4             1.0-10.1 2021-01-26 [1] RSPM (R 4.0.3)
##  ps                1.6.0    2021-02-28 [1] RSPM (R 4.0.3)
##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.3)
##  R.cache           0.15.0   2021-04-30 [1] RSPM (R 4.0.4)
##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.3)
##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.3)
##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.3)
##  R6                2.5.0    2020-10-28 [1] RSPM (R 4.0.3)
##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.3)
##  Rcpp              1.0.6    2021-01-15 [1] RSPM (R 4.0.3)
##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.4)
##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.3)
##  rlang             0.4.11   2021-04-30 [1] RSPM (R 4.0.4)
##  rmarkdown         2.8      2021-05-07 [1] RSPM (R 4.0.4)
##  rstudioapi        0.13     2020-11-12 [1] RSPM (R 4.0.3)
##  Rttf2pt1          1.3.8    2020-01-10 [1] RSPM (R 4.0.3)
##  sass              0.4.0    2021-05-12 [1] RSPM (R 4.0.4)
##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.3)
##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.3)
##  stringi           1.6.1    2021-05-10 [1] RSPM (R 4.0.4)
##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.3)
##  styler            1.4.1    2021-03-30 [1] RSPM (R 4.0.4)
##  tibble            3.1.2    2021-05-16 [1] RSPM (R 4.0.4)
##  tidyselect        1.1.1    2021-04-30 [1] RSPM (R 4.0.4)
##  utf8              1.2.1    2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs             0.3.8    2021-04-29 [1] RSPM (R 4.0.4)
##  vipor             0.4.5    2017-03-22 [1] RSPM (R 4.0.0)
##  withr             2.4.2    2021-04-18 [1] RSPM (R 4.0.4)
##  xfun              0.23     2021-05-15 [1] RSPM (R 4.0.4)
##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library

References

Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. https://github.com/kevinblighe/EnhancedVolcano
Gonzalez I., 2014 Statistical analysis of RNA-Seq data. http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf
Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using Affymetrix microarrays. https://www.bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html
Ritchie M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007
Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/
Tregnago C., E. Manara, M. Zampini, V. Bisio, C. Borga, et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98
Zhu A., J. G. Ibrahim, and M. I. Love, 2018 Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895