This notebook will demonstrate how to:
ggplot2
to plot and visualize dataggplot2
We’ll use a real gene expression dataset to get comfortable making visualizations using ggplot2. We’ve performed differential expression analyses on a pre-processed astrocytoma microarray dataset. We’ll start by making a volcano plot of differential gene expression results from this experiment. We performed three sets of contrasts:
sex
category contrasting: Male
vs
Female
tissue
category contrasting :
Pilocytic astrocytoma tumor
samples vs
normal cerebellum
samplessex
and
tissue
.More ggplot2 resources:
We saved these results to a tab separated values (TSV) file called
gene_results_GSE44971.tsv
. It’s been saved to the
data
folder. File paths are relative to where this notebook
file (.Rmd) is saved. So we can reference it later, let’s make a
variable with our data directory name.
data_dir <- "data"
Let’s declare our output folder name as its own variable.
plots_dir <- "plots"
We can also create a directory if it doesn’t already exist.
There’s a couple ways that we can create that directory from within
R. One way is to use the base R function dir.create()
,
which (as the name suggests) will create a directory. But this function
assumes that the directory does not yet exist, and it will throw an
error if you try to create a directory that already exists. To avoid
this error, we can place the directory creation inside an
if
statement, so the code will only run if the directory
does not yet exist:
# The if statement here tests whether the plot directory exists and
# only executes the expressions between the braces if it does not.
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
In this notebook we will be using functions from the Tidyverse set of
packages, so we need to load in those functions using
library()
. We could load the individual packages we need
one at a time, but it is convenient for now to load them all with the
tidyverse
“package,” which groups many of them together as
a shortcut. Keep a look out for where we tell you which individual
package different functions come from.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Here we are using a tidyverse
function
read_tsv()
from the readr
package. Like we did
in the previous notebook, we will store the resulting data frame as
stats_df
.
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(file.path(
data_dir,
"gene_results_GSE44971.tsv"
))
Rows: 6804 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): ensembl_id, gene_symbol, contrast
dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can take a look at a column individually by using a
$
. Note we are using head()
so the whole thing
doesn’t print out.
head(stats_df$contrast)
[1] "male_female" "male_female" "male_female" "male_female" "male_female"
[6] "male_female"
male_female
male_female
male_female
male_female
male_female
male_female
If we want to see a specific set of values, we can use brackets with the indices of the values we’d like returned.
stats_df$avg_expression[6:10]
[1] 19.084011 8.453933 5.116563 6.345609 25.473133
Let’s look at some basic statistics from the data set using
summary()
# summary of stats_df
summary(stats_df)
ensembl_id gene_symbol contrast log_fold_change
Length:6804 Length:6804 Length:6804 Min. :-180.8118
Class :character Class :character Class :character 1st Qu.: -1.6703
Mode :character Mode :character Mode :character Median : 0.1500
Mean : 0.2608
3rd Qu.: 2.1049
Max. : 129.3009
avg_expression t_statistic p_value adj_p_value
Min. : 5.003 Min. :-32.84581 Min. :0.00000 Min. :0.00000
1st Qu.: 6.304 1st Qu.: -1.16444 1st Qu.:0.01309 1st Qu.:0.05657
Median : 8.482 Median : 0.10619 Median :0.18919 Median :0.41354
Mean : 13.847 Mean : -0.00819 Mean :0.31223 Mean :0.44833
3rd Qu.: 14.022 3rd Qu.: 1.46589 3rd Qu.:0.57634 3rd Qu.:0.82067
Max. :190.708 Max. : 10.48302 Max. :0.99979 Max. :0.99988
The statistics for contrast
are not very informative, so
let’s do that again with just the contrast
column after
converting it to a factor
# summary of `stats_df$contrast` as a factor
summary(as.factor(stats_df$contrast))
astrocytoma_normal interaction male_female
2268 2268 2268
Before we make our plot, we want to calculate a set of new values for
each row; transformations of the raw statistics in our table. To do this
we will use a function from the dplyr
package called
mutate()
to make a new column of -log10 p values.
# add a `neg_log10_p` column to the data frame
stats_df <- mutate(stats_df, # data frame we'd like to add a variable to
neg_log10_p = -log10(p_value) # column name and values
)
Let’s filter to only male_female
contrast data. First
let’s try out a logical expression:
stats_df$contrast == "male_female"
Now we can try out the filter()
function. Notice that we
are not assigning the results to a variable, so this filtered dataset
will not be saved to the environment.
# filter stats_df to "male_female" only
filter(stats_df, contrast == "male_female")
Now we can assign the results to a new data frame:
male_female_df
.
# filter and save to male_female_df
male_female_df <- filter(stats_df, contrast == "male_female")
Let’s make a volcano plot with this data. First let’s take a look at only the tumor vs. normal comparison. Let’s save this as a separate data frame by assigning it a new name.
tumor_normal_df <- filter(stats_df, contrast == "astrocytoma_normal")
To make this plot we will be using functions from the
ggplot2
package, the main plotting package of the
tidyverse. We use the first function, ggplot()
to define
the data that will be plotted. Remember, the name of this package is
ggplot2
, but the function we use is called
ggplot()
without the 2
. ggplot()
takes two main arguments:
data
, which is the data frame that contains the data we
want to plot.mapping
, which is a special list made with the
aes()
function to describe which values will be used for
each aesthetic component of the plot, such as the x and
y coordinates of each point. (If you find calling things like the x and
y coordinates “aesthetics” confusing, don’t worry, you are not alone.)
Specifically, the aes()
function is used to specify that a
given column (variable) in your data frame be mapped to a given
aesthetic component of the plot.ggplot(
tumor_normal_df, # This first argument is the data frame with the data we want to plot
aes(
x = log_fold_change, # This is the column name of the values we want to use
# for the x coordinates
y = neg_log10_p
) # This is the column name of the data we want for the y-axis
)
You’ll notice this plot doesn’t have anything on it because we
haven’t specified a plot type yet. To do that, we will add another
ggplot layer with +
which will specify exactly what we want
to plot. A volcano plot is a special kind of scatter plot, so to make
that we will want to plot individual points, which we can do with
geom_point()
.
# This first part is the same as before
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p
)
) +
# Now we are adding on a layer to specify what kind of plot we want
geom_point()
Here’s a brief summary of ggplot2 structure.
Now that we have a base plot that shows our data, we can add layers
on to it and adjust it. We can adjust the color of points using the
color
aesthetic.
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
) # We added this argument to color code the points!
) +
geom_point()
Because we have so many points overlapping one another, we will want
to adjust the transparency, which we can do with an alpha
argument.
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) # We are using the `alpha` argument to make our points transparent
Notice that we added the alpha within the geom_point()
function, not to the aes()
. We did this because we want all
of the points to have the same level of transparency, and it will not
vary depending on any variable in the data. We can also change the
background and appearance of the plot as a whole by adding a
theme
.
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
# Add on this set of appearance presets to make it pretty
theme_bw()
We are not limited to a single plotting layer. For example, if we
want to add a horizontal line to indicate a significance cutoff, we can
do that with geom_hline()
. For now, we will choose the
value of 5.5 (that is close to a Bonferroni correction) and add that to
the plot.
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") # we can specify colors by names here
We can change the x and y labels using a few different strategies.
One approach is to use functions xlab()
and
ylab()
individually to set, respectively, the x-axis label
and the the y-axis label.
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") +
theme_bw() +
# Add labels with separate functions:
xlab("log2 Fold Change Tumor/Normal") +
ylab("-log10 p value")
Alternatively, we can use the ggplot2
function
labs()
, which takes individual arguments for each label we
want want to set. We can also include the argument title
to
add an overall plot title.
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") +
theme_bw() +
# Add x and y labels and overall plot title with arguments to labs():
labs(
x = "log2 Fold Change Tumor/Normal",
y = "-log10 p value",
title = "Astrocytoma Tumor vs Normal Cerebellum"
)
Something great about the labs()
function is you can
also use it to specify labels for your legends derived from
certain aesthetics. In this plot, our legend is derived from a color
aesthetic, so we can specify the keyword “color” to update the
legend title.
ggplot(
tumor_normal_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") +
theme_bw() +
# Add x and y labels and overall plot title (and more!) with arguments to labs():
labs(
x = "log2 Fold Change Tumor/Normal",
y = "-log10 p value",
title = "Astrocytoma Tumor vs Normal Cerebellum",
# Use the color keyword to label the color legend
color = "Average expression"
)
Use this chunk to make the same kind of plot as the previous chunk
but instead plot the male female contrast data, that is stored in
male_female_df
.
# Use this chunk to make the same kind of volcano plot, but with the male-female contrast data.
ggplot(
male_female_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") +
theme_bw() +
labs(
x = "log2 Fold Change Male/Female",
y = "-log10 p value",
color = "Average expression"
)
Turns out, we don’t have to plot each contrast separately, instead,
we can use the original data frame that contains all three contrasts’
data, stats_df
, and add a facet_wrap
to make
each contrast its own plot.
ggplot(
stats_df, # Switch to the bigger data frame with all three contrasts' data
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") +
theme_bw() +
facet_wrap(vars(contrast)) +
labs(
# Now that this includes the other contrasts,
# we'll make the x-axis label more general
x = "log2 Fold Change",
y = "-log10 p value",
color = "Average expression"
) +
coord_cartesian(xlim = c(-25, 25)) # zoom in on the x-axis
We can store the plot as an object in the global environment by using
<-
operator. Here we will call this
volcano_plot
.
# We are saving this plot to a variable named `volcano_plot`
volcano_plot <- ggplot(
stats_df,
aes(
x = log_fold_change,
y = neg_log10_p,
color = avg_expression
)
) +
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") +
theme_bw() +
facet_wrap(vars(contrast)) +
labs(
x = "log2 Fold Change",
y = "-log10 p value",
color = "Average expression"
) +
coord_cartesian(xlim = c(-25, 25))
When we are happy with our plot, we can save the plot using
ggsave
. It’s a good idea to also specify width
and height
arguments (units in inches) to ensure the saved
plot is always the same size every time you run this code. Here, we’ll
save a 6”x6” plot.
ggsave(
plot = volcano_plot,
filename = file.path(plots_dir, "volcano_plot.png"),
width = 6,
height = 6
)
# Print out the versions and packages we are using in this session
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 optparse_1.7.5
loaded via a namespace (and not attached):
[1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.3
[5] hms_1.1.3 digest_0.6.35 magrittr_2.0.3 evaluate_0.23
[9] grid_4.4.1 timechange_0.3.0 fastmap_1.1.1 jsonlite_1.8.8
[13] fansi_1.0.6 scales_1.3.0 textshaping_0.3.7 getopt_1.20.4
[17] jquerylib_0.1.4 cli_3.6.2 crayon_1.5.2 rlang_1.1.3
[21] bit64_4.0.5 munsell_0.5.1 withr_3.0.0 cachem_1.0.8
[25] yaml_2.3.8 parallel_4.4.1 tools_4.4.1 tzdb_0.4.0
[29] colorspace_2.1-0 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
[33] bit_4.0.5 vroom_1.6.5 ragg_1.3.0 pkgconfig_2.0.3
[37] pillar_1.9.0 bslib_0.7.0 gtable_0.3.5 glue_1.7.0
[41] systemfonts_1.0.6 highr_0.10 xfun_0.43 tidyselect_1.2.1
[45] knitr_1.46 farver_2.1.1 htmltools_0.5.8.1 labeling_0.4.3
[49] rmarkdown_2.26 compiler_4.4.1