This notebook will demonstrate how to:
ggplot2
to plot and visualize dataggplot2
We’ll use a real gene expression dataset to get comfortable making visualizations using ggplot2. We’ve performed differential expression analyses on a pre-processed astrocytoma microarray dataset. We’ll start by making a volcano plot of differential gene expression results from this experiment. We performed three sets of contrasts:
sex
category contrasting: Male
vs
Female
tissue
category contrasting :
Pilocytic astrocytoma tumor
samples vs
normal cerebellum
samplessex
and
tissue
.More ggplot2 resources:
We saved these results to a tab separated values (TSV) file called
gene_results_GSE44971.tsv
. It’s been saved to the
data
folder. File paths are relative to where this notebook
file (.Rmd) is saved. So we can reference it later, let’s make a
variable with our data directory name.
data_dir <- "data"
Let’s declare our output folder name as its own variable.
plots_dir <- "plots"
We can also create a directory if it doesn’t already exist.
There’s a couple ways that we can create that directory from within
R. One way is to use the base R function dir.create()
,
which (as the name suggests) will create a directory. But this function
assumes that the directory does not yet exist, and it will throw an
error if you try to create a directory that already exists. To avoid
this error, we can place the directory creation inside an
if
statement, so the code will only run if the directory
does not yet exist:
# The if statement here tests whether the plot directory exists and
# only executes the expressions between the braces if it does not.
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
In this notebook we will be using functions from the Tidyverse set of
packages, so we need to load in those functions using
library()
. We could load the individual packages we need
one at a time, but it is convenient for now to load them all with the
tidyverse
“package,” which groups many of them together as
a shortcut. Keep a look out for where we tell you which individual
package different functions come from.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Here we are using a tidyverse
function
read_tsv()
from the readr
package. Like we did
in the previous notebook, we will store the resulting data frame as
stats_df
.
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(file.path(
data_dir,
"gene_results_GSE44971.tsv"
))
Rows: 6804 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): ensembl_id, gene_symbol, contrast
dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can take a look at a column individually by using a
$
. Note we are using head()
so the whole thing
doesn’t print out.
head(stats_df$contrast)
[1] "male_female" "male_female" "male_female" "male_female" "male_female"
[6] "male_female"
male_female
male_female
male_female
male_female
male_female
male_female
If we want to see a specific set of values, we can use brackets with the indices of the values we’d like returned.
stats_df$avg_expression[6:10]
[1] 19.084011 8.453933 5.116563 6.345609 25.473133
Let’s look at some basic statistics from the data set using
summary()
# summary of stats_df
summary(stats_df)
ensembl_id gene_symbol contrast log_fold_change
Length:6804 Length:6804 Length:6804 Min. :-180.8118
Class :character Class :character Class :character 1st Qu.: -1.6703
Mode :character Mode :character Mode :character Median : 0.1500
Mean : 0.2608
3rd Qu.: 2.1049
Max. : 129.3009
avg_expression t_statistic p_value adj_p_value
Min. : 5.003 Min. :-32.84581 Min. :0.00000 Min. :0.00000
1st Qu.: 6.304 1st Qu.: -1.16444 1st Qu.:0.01309 1st Qu.:0.05657
Median : 8.482 Median : 0.10619 Median :0.18919 Median :0.41354
Mean : 13.847 Mean : -0.00819 Mean :0.31223 Mean :0.44833
3rd Qu.: 14.022 3rd Qu.: 1.46589 3rd Qu.:0.57634 3rd Qu.:0.82067
Max. :190.708 Max. : 10.48302 Max. :0.99979 Max. :0.99988
The statistics for contrast
are not very informative, so
let’s do that again with just the contrast
column after
converting it to a factor
# summary of `stats_df$contrast` as a factor
summary(as.factor(stats_df$contrast))
astrocytoma_normal interaction male_female
2268 2268 2268
Before we make our plot, we want to calculate a set of new values for
each row; transformations of the raw statistics in our table. To do this
we will use a function from the dplyr
package called
mutate()
to make a new column of -log10 p values.
# add a `neg_log10_p` column to the data frame
stats_df <- mutate(stats_df, # data frame we'd like to add a variable to
neg_log10_p = -log10(p_value) # column name and values
)
Let’s filter to only male_female
contrast data. First
let’s try out a logical expression:
stats_df$contrast == "male_female"
Now we can try out the filter()
function. Notice that we
are not assigning the results to a variable, so this filtered dataset
will not be saved to the environment.
# filter stats_df to "male_female" only
filter(stats_df, contrast == "male_female")
Now we can assign the results to a new data frame:
male_female_df
.
# filter and save to male_female_df
male_female_df <- filter(stats_df, contrast == "male_female")
Let’s make a volcano plot with this data. First let’s take a look at only the tumor vs. normal comparison. Let’s save this as a separate data frame by assigning it a new name.
tumor_normal_df <- filter(stats_df, contrast == "astrocytoma_normal")
To make this plot we will be using functions from the
ggplot2
package, the main plotting package of the
tidyverse. We use the first function, ggplot()
to define
the data that will be plotted. Remember, the name of this package is
ggplot2
, but the function we use is called
ggplot()
without the 2
. ggplot()
takes two main arguments:
data
, which is the data frame that contains the data we
want to plot.mapping
, which is a special list made with the
aes()
function to describe which values will be used for
each aesthetic component of the plot, such as the x and
y coordinates of each point. (If you find calling things like the x and
y coordinates “aesthetics” confusing, don’t worry, you are not alone.)
Specifically, the aes()
function is used to specify that a
given column (variable) in your data frame be mapped to a given
aesthetic component of the plot.ggplot(
tumor_normal_df, # This first argument is the data frame with the data we want to plot
aes(
x = log_fold_change, # This is the column name of the values we want to use
# for the x coordinates
y = neg_log10_p
) # This is the column name of the data we want for the y-axis
)
You’ll notice this plot doesn’t have anything on it because we
haven’t specified a plot type yet. To do that, we will add another
ggplot layer with +
which will specify exactly what we want
to plot. A volcano plot is a special kind of scatter plot, so to make
that we will want to plot individual points, which we can do with
geom_point()
.
# This first part is the same as before
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p
)
) +
# Now we are adding on a layer to specify what kind of plot we want
geom_point()