Objectives

This notebook will demonstrate how to:

  • Load and use R packages
  • Read in and perform simple manipulations of data frames
  • Use ggplot2 to plot and visualize data
  • Customize plots using features of ggplot2

We’ll use a real gene expression dataset to get comfortable making visualizations using ggplot2. We’ve performed differential expression analyses on a pre-processed astrocytoma microarray dataset. We’ll start by making a volcano plot of differential gene expression results from this experiment. We performed three sets of contrasts:

  1. sex category contrasting: Male vs Female
  2. tissue category contrasting : Pilocytic astrocytoma tumor samples vs normal cerebellum samples
  3. An interaction of both sex and tissue.

More ggplot2 resources:

Set Up

We saved these results to a tab separated values (TSV) file called gene_results_GSE44971.tsv. It’s been saved to the data folder. File paths are relative to where this notebook file (.Rmd) is saved. So we can reference it later, let’s make a variable with our data directory name.

data_dir <- "data"

Let’s declare our output folder name as its own variable.

plots_dir <- "plots"

We can also create a directory if it doesn’t already exist.

There’s a couple ways that we can create that directory from within R. One way is to use the base R function dir.create(), which (as the name suggests) will create a directory. But this function assumes that the directory does not yet exist, and it will throw an error if you try to create a directory that already exists. To avoid this error, we can place the directory creation inside an if statement, so the code will only run if the directory does not yet exist:

# The if statement here tests whether the plot directory exists and
# only executes the expressions between the braces if it does not.
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}

In this notebook we will be using functions from the Tidyverse set of packages, so we need to load in those functions using library(). We could load the individual packages we need one at a time, but it is convenient for now to load them all with the tidyverse “package,” which groups many of them together as a shortcut. Keep a look out for where we tell you which individual package different functions come from.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read in the differential expression analysis results file

Here we are using a tidyverse function read_tsv() from the readr package. Like we did in the previous notebook, we will store the resulting data frame as stats_df.

# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(file.path(
  data_dir,
  "gene_results_GSE44971.tsv"
))
Rows: 6804 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): ensembl_id, gene_symbol, contrast
dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can take a look at a column individually by using a $. Note we are using head() so the whole thing doesn’t print out.

head(stats_df$contrast)
[1] "male_female" "male_female" "male_female" "male_female" "male_female"
[6] "male_female"
male_female
male_female
male_female
male_female
male_female
male_female

If we want to see a specific set of values, we can use brackets with the indices of the values we’d like returned.

stats_df$avg_expression[6:10]
[1] 19.084011  8.453933  5.116563  6.345609 25.473133

Let’s look at some basic statistics from the data set using summary()

# summary of stats_df
summary(stats_df)
  ensembl_id        gene_symbol          contrast         log_fold_change    
 Length:6804        Length:6804        Length:6804        Min.   :-180.8118  
 Class :character   Class :character   Class :character   1st Qu.:  -1.6703  
 Mode  :character   Mode  :character   Mode  :character   Median :   0.1500  
                                                          Mean   :   0.2608  
                                                          3rd Qu.:   2.1049  
                                                          Max.   : 129.3009  
 avg_expression     t_statistic           p_value         adj_p_value     
 Min.   :  5.003   Min.   :-32.84581   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:  6.304   1st Qu.: -1.16444   1st Qu.:0.01309   1st Qu.:0.05657  
 Median :  8.482   Median :  0.10619   Median :0.18919   Median :0.41354  
 Mean   : 13.847   Mean   : -0.00819   Mean   :0.31223   Mean   :0.44833  
 3rd Qu.: 14.022   3rd Qu.:  1.46589   3rd Qu.:0.57634   3rd Qu.:0.82067  
 Max.   :190.708   Max.   : 10.48302   Max.   :0.99979   Max.   :0.99988  

The statistics for contrast are not very informative, so let’s do that again with just the contrast column after converting it to a factor

# summary of `stats_df$contrast` as a factor
summary(as.factor(stats_df$contrast))
astrocytoma_normal        interaction        male_female 
              2268               2268               2268 

Set up the dataset

Before we make our plot, we want to calculate a set of new values for each row; transformations of the raw statistics in our table. To do this we will use a function from the dplyr package called mutate() to make a new column of -log10 p values.

# add a `neg_log10_p` column to the data frame
stats_df <- mutate(stats_df, # data frame we'd like to add a variable to
  neg_log10_p = -log10(p_value) # column name and values
)

Let’s filter to only male_female contrast data. First let’s try out a logical expression:

stats_df$contrast == "male_female"

Now we can try out the filter() function. Notice that we are not assigning the results to a variable, so this filtered dataset will not be saved to the environment.

# filter stats_df to "male_female" only
filter(stats_df, contrast == "male_female")

Now we can assign the results to a new data frame: male_female_df.

# filter and save to male_female_df
male_female_df <- filter(stats_df, contrast == "male_female")

Plotting this data

Let’s make a volcano plot with this data. First let’s take a look at only the tumor vs. normal comparison. Let’s save this as a separate data frame by assigning it a new name.

tumor_normal_df <- filter(stats_df, contrast == "astrocytoma_normal")

To make this plot we will be using functions from the ggplot2 package, the main plotting package of the tidyverse. We use the first function, ggplot() to define the data that will be plotted. Remember, the name of this package is ggplot2, but the function we use is called ggplot() without the 2. ggplot() takes two main arguments:

  1. data, which is the data frame that contains the data we want to plot.
  2. mapping, which is a special list made with the aes() function to describe which values will be used for each aesthetic component of the plot, such as the x and y coordinates of each point. (If you find calling things like the x and y coordinates “aesthetics” confusing, don’t worry, you are not alone.) Specifically, the aes() function is used to specify that a given column (variable) in your data frame be mapped to a given aesthetic component of the plot.
ggplot(
  tumor_normal_df, # This first argument is the data frame with the data we want to plot
  aes(
    x = log_fold_change, # This is the column name of the values we want to use
    # for the x coordinates
    y = neg_log10_p
  ) # This is the column name of the data we want for the y-axis
)

You’ll notice this plot doesn’t have anything on it because we haven’t specified a plot type yet. To do that, we will add another ggplot layer with + which will specify exactly what we want to plot. A volcano plot is a special kind of scatter plot, so to make that we will want to plot individual points, which we can do with geom_point().

# This first part is the same as before
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p
  )
) +
  # Now we are adding on a layer to specify what kind of plot we want
  geom_point()