Objectives
This notebook will demonstrate how to:
- Use functions from the tidyverse to read and write data frames
- Implement and use tidyverse functions to wrangle data (i.e. filter,
mutate, arrange, join)
- Use R pipes (
|>
) to combine multiple operations
- Use the
apply()
function to apply functions across rows
or columns of a matrix
We’ll use the same gene expression dataset we used in the previous notebook. It is a
pre-processed astrocytoma
microarray dataset that we performed a set of differential expression analyses
on.
More tidyverse resources:
Set Up
The tidyverse is a collection of packages that are handy for general
data wrangling, analysis, and visualization. Other packages that are
specifically handy for different biological analyses are found on Bioconductor. If we want to use
a package’s functions we first need to install them.
Our RStudio Server already has the tidyverse
group of
packages installed for you. But if you needed to install it or other
packages available on CRAN, you do it using the
install.packages()
function like this:
install.packages("tidyverse")
.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Referencing a library’s function with ::
Note that if we had not imported the tidyverse set of packages using
library()
like above, and we wanted to use a tidyverse
function like read_tsv()
, we would need to tell R what
package to find this function in. To do this, we would use
::
to tell R to load in this function from the
readr
package by using readr::read_tsv()
. You
will see this ::
method of referencing libraries within
packages throughout the course. We like to use it in part to remove any
ambiguity in which version of a function we are using; it is not too
uncommon for different packages to use the same name for very different
functions!
Managing directories
Before we can import the data we need, we should double check where R
is looking for files, aka the current working
directory. We can do this by using the getwd()
function, which will tell us what folder we are in.
# Let's check what directory we are in:
getwd()
[1] "/__w/training-modules/training-modules/intro-to-R-tidyverse"
/__w/training-modules/training-modules/intro-to-R-tidyverse
For Rmd files, the working directory is wherever the file is located,
but commands executed in the console may have a different working
directory.
We will want to make a directory for our output and we will call this
directory: results
. But before we create the directory, we
should check if it already exists. We will show two ways that we can do
this.
First, we can use the dir()
function to have R list the
files in our working directory.
# Let's check what files are here
dir()
[1] "00a-rstudio_guide.Rmd"
[2] "00b-debugging_resources.Rmd"
[3] "00c-good_scientific_coding_practices.Rmd"
[4] "01-intro_to_base_R-live.Rmd"
[5] "01-intro_to_base_R.nb.html"
[6] "01-intro_to_base_R.Rmd"
[7] "02-intro_to_ggplot2-live.Rmd"
[8] "02-intro_to_ggplot2.nb.html"
[9] "02-intro_to_ggplot2.Rmd"
[10] "03-intro_to_tidyverse-live.Rmd"
[11] "03-intro_to_tidyverse.nb.html"
[12] "03-intro_to_tidyverse.Rmd"
[13] "data"
[14] "diagrams"
[15] "exercise_01-intro_to_base_R.Rmd"
[16] "exercise_02-intro_to_R.Rmd"
[17] "exercise_03a-intro_to_tidyverse.Rmd"
[18] "exercise_03b-intro_to_tidyverse.Rmd"
[19] "plots"
[20] "README.md"
[21] "screenshots"
[22] "scripts"
00a-rstudio_guide.Rmd
00b-debugging_resources.Rmd
00c-good_scientific_coding_practices.Rmd
01-intro_to_base_R-live.Rmd
01-intro_to_base_R.nb.html
01-intro_to_base_R.Rmd
02-intro_to_ggplot2-live.Rmd
02-intro_to_ggplot2.nb.html
02-intro_to_ggplot2.Rmd
03-intro_to_tidyverse-live.Rmd
03-intro_to_tidyverse.nb.html
03-intro_to_tidyverse.Rmd
data
diagrams
exercise_01-intro_to_base_R.Rmd
exercise_02-intro_to_R.Rmd
exercise_03a-intro_to_tidyverse.Rmd
exercise_03b-intro_to_tidyverse.Rmd
plots
README.md
screenshots
scripts
This shows us there is no folder called “results” yet.
If we want to more pointedly look for “results” in our working
directory we can use the dir.exists()
function.
# Check if the results directory exists
dir.exists("results")
[1] FALSE
If the above says FALSE
that means we will need to
create a results
directory. We’ve previously seen that we
can make directories in R using the base R function
dir.create()
. But we’ve also seen that this function will
throw an error if you try to create a directory that already exists,
which can be frustrating if you are re-running code! A different option
is to use the fs
package, which provides functions for you to interact with your
computer’s file system with a more consistent behavior than the base R
functions. One function from this package is
fs::dir_create()
(note that it has an underscore,
not a period), and much like the base R dir.create()
, it
creates directories. It has some other helpful features too: - It will
simply do nothing if that directory already exists; no errors, and
nothing will get overwritten - It allows creating nested
directories by default, i.e. in one call make directories inside of
other directories
Let’s go ahead and use it to create our results
directory:
# Make a directory within the working directory called 'results'
fs::dir_create("results")
After creating the results directory above, let’s re-run
dir.exists()
to see if now it exists.
# Re-check if the results directory exists
dir.exists("results")
[1] TRUE
The dir.exists()
function will not work on files
themselves. In that case, there is an analogous function called
file.exists()
.
Try using the file.exists()
function to see if the file
gene_results_GSE44971.tsv
exists in the current directory.
Use the code chunk we set up for you below. Note that in our notebooks
(and sometimes elsewhere), wherever you see a
<FILL_IN_THE_BLANK>
like in the chunk below, that is
meant for you to replace (including the angle brackets) with the correct
phrase before you run the chunk (otherwise you will get an error).
# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)
It doesn’t seem that file exists in our current directory,
but that doesn’t mean it doesn’t exist it all. In fact, this file is
inside the relative path data/
, so let’s check
again if the whole relative path to that file exists.
# This time, use file.path() to form your argument to file.exists()
file.exists(<PUT_PATH_TO_FILE_HERE>)
With the right relative path, we can confirm this file exists.
Read a TSV file
Declare the name of the directory where we will read in the data.
data_dir <- "data"
Although base R has functions to read in data files, the functions in
the readr
package (part of the tidyverse) are faster and
more straightforward to use so we are going to use those here. Because
the file we are reading in is a TSV (tab separated values) file we will
be using the read_tsv
function. There are analogous
functions for CSV (comma separated values) files
(read_csv()
) and other files types.
Read in the differential expression analysis results file
stats_df <- readr::read_tsv(
file.path(data_dir,
"gene_results_GSE44971.tsv")
)
Rows: 6804 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): ensembl_id, gene_symbol, contrast
dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Following the template of the previous chunk, use this chunk to read
in the file GSE44971.tsv
that is in the data
folder and save it in the variable gene_df
.
# Use this chunk to read in data from the file `GSE44971.tsv`
gene_df <- readr::read_tsv(
file.path(data_dir,
"GSE44971.tsv")
)
Rows: 20056 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (58): GSM1094814, GSM1094815, GSM1094816, GSM1094817, GSM1094818, GSM109...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use this chunk to explore what gene_df
looks like.
# Explore `gene_df`
What information is contained in gene_df
?
R
pipes
One nifty feature that was added to R
in version 4.1 is
the pipe: |>
. Pipes are very handy things that allow you
to funnel the result of one expression to the next, making your code
more streamlined and fluently expressing the flow of data through a
series of operations.
Note: If you are using a version of R
prior to
4.1 (or looking at older code), pipe functionality was available through
the magrittr
package, which used a pipe that looked like
this: %>%
. That pipe was the inspiration for the native
R pipe we are using here. While there are some minor differences, you
can mostly treat them interchangeably as long as you load the
magrittr
package or dplyr
, which also loads
that version of the pipe.
For example, the output from this:
filter(stats_df, contrast == "male_female")
…is the same as the output from this:
stats_df |> filter(contrast == "male_female")
This can make your code cleaner and easier to follow a series of
related commands. Let’s look at an example with our stats of of how the
same functions look with or without pipes:
Example 1: without pipes:
stats_arranged <- arrange(stats_df, t_statistic)
stats_filtered <- filter(stats_arranged, avg_expression > 50)
stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)
UGH, we have to keep track of all of those different intermediate
data frames and type their names so many times here! We could maybe
streamline things by using the same variable name at each stage, but
even then there is a lot of extra typing, and it is easy to get confused
about what has been done where. It’s annoying and makes it harder for
people to read.
Example 2: Same result as 1 but with pipes!
# Example of the same modifications as above but with pipes!
stats_pipe <- stats_df |>
arrange(t_statistic) |>
filter(avg_expression > 50) |>
select(contrast, log_fold_change, p_value)
What the |>
(pipe) is doing here is feeding the
result of the expression on its left into the first argument of the next
function (to its right, or on the next line here). We can then skip that
first argument (the data in these cases), and move right on to the part
we care about at that step: what we are arranging, filtering, or
selecting in this case. The key insight that makes the pipe work here is
to recognize that each of these functions (arrange
,
filter
, and select
) are fundamental
dplyr
(a tidyverse package) functions which work as “data
in, data out.” In other words, these functions operate on data frames,
and return data frames; you give them a data frame, and they give you
back a data frame. Because these functions all follow a “data in, data
out” framework, we can chain them together with pipe and send data all
the way through the…pipeline!
Let’s double check that these versions with and without pipe yield
the same solution by using the base R function
all.equal()
.
all.equal(stats_nopipe, stats_pipe)
[1] TRUE
all.equal()
is letting us know that these two objects
are the same.
Now that hopefully you are convinced that the tidyverse can help you
make your code neater and easier to use and read, let’s go through some
of the popular tidyverse functions and so we can create pipelines like
this.
Common tidyverse functions
Let’s say we wanted to filter this gene expression dataset to
particular sample groups. In order to do this, we would use the function
filter()
as well as a logic statement (usually one that
refers to a column or columns in the data frame).
# Here let's filter stats_df to only keep the gene_symbol "SNCA"
stats_df |>
filter(gene_symbol == "SNCA")
We can use filter()
similarly for numeric
statements.
# Here let's filter the data to rows with average expression values above 50
stats_df |>
filter(avg_expression > 50)
We can apply multiple filters at once, which will require all of them
to be satisfied for every row in the results:
# filter to highly expressed genes with contrast "male_female"
stats_df |>
filter(contrast == "male_female",
avg_expression > 50)
When we are filtering, the %in%
operator can come in
handy if we have multiple items we would like to match. Let’s take a
look at what using %in%
does.
genes_of_interest <- c("SNCA", "CDKN1A")
# Are these genes present in the `gene_symbol` column in stats_df?
stats_df$gene_symbol %in% genes_of_interest
%in%
returns a logical vector that now we can use in
dplyr::filter
.
# filter to keep only genes of interest
stats_df |>
filter(gene_symbol %in% c("SNCA", "CDKN1A"))
Let’s return to our first filter()
and build on to it.
This time, let’s keep only some of the columns from the data frame using
the select()
function. Let’s also save this as a new data
frame called stats_filtered_df
.
# filter to highly expressed "male_female"
# and select gene_symbol, log_fold_change and t_statistic
stats_filtered_df <- stats_df |>
filter(contrast == "male_female",
avg_expression > 50) |>
select(log_fold_change, t_statistic)
Let’s say we wanted to arrange this dataset so that the genes are
arranged by the smallest p values to the largest. In order to do this,
we would use the function arrange()
as well as the column
we would like to sort by (in this case p_value
).
stats_df |>
arrange(p_value)
What if we want to sort from largest to smallest? Like if we want to
see the genes with the highest average expression? We can use the same
function, but instead use the desc()
function and now we
are using avg_expression
column.
# arrange descending by avg_expression
stats_df |>
arrange(desc(avg_expression))
What if we would like to create a new column of values? For that we
use mutate()
function.
stats_df |>
mutate(log10_p_value = -log10(p_value))
What if we want to obtain summary statistics for a column or columns?
The summarize
function allows us to calculate summary
statistics for a column. Here we will use summarize to calculate two
summary statistics of log-fold change across all genes: mean (function
mean()
) and standard deviation (function
sd()
).
stats_df |>
summarize(mean(log_fold_change),
sd(log_fold_change))
What if we’d like to obtain a summary statistics but have them for
various groups? Conveniently named, there’s a function called
group_by()
that seamlessly allows us to do this. Also note
that group_by()
allows us to group by multiple variables at
a time if you want to.
stats_summary_df <- stats_df |>
group_by(contrast) |>
summarize(mean(log_fold_change),
sd(log_fold_change))
Let’s look at a preview of what we made:
stats_summary_df
Here we have the mean log fold change expression per each contrast we
made.
A brief intro to the apply
family of functions
In base R, the apply
family of functions can be an
alternative methods for performing transformations across a data frame,
matrix or other object structures.
One of this family is (shockingly) the function apply()
,
which operates on matrices.
A matrix is similar to a data frame in that it is a rectangular table
of data, but it has an additional constraint: rather than each column
having a type, ALL data in a matrix has the same type.
The first argument to apply()
is the data object we want
to work on. The third argument is the function we will apply to each row
or column of the data object. The second argument in specifies whether
we are applying the function across rows or across columns (1 for rows,
2 for columns).
Remember that gene_df
is a gene x sample gene expression
data frame that has columns of two different types, character and
numeric. Converting it to a matrix will require us to make them all the
same type. We can coerce it into a matrix using
as.matrix()
, in which case R will pick a type that it can
convert everything to. What does it choose?
# Coerce `gene_df` into a matrix
gene_matrix <- as.matrix(gene_df)
# Explore the structure of the `gene_matrix` object
str(gene_matrix)
chr [1:20056, 1:59] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:59] "Gene" "GSM1094814" "GSM1094815" "GSM1094816" ...
While that worked, it is rare that we want numbers converted to text,
so we are going to select only the columns with numeric values before
converting it to a matrix. We can do this most easily by removing the
first column, which contains the gene names stored as character
values.
# Let's save a new matrix object names `gene_num_matrix` containing only
# the numeric values
gene_num_matrix <- as.matrix(gene_df[, -1])
# Explore the structure of the `gene_num_matrix` object
str(gene_num_matrix)
num [1:20056, 1:58] 9.5951 -0.0436 8.5246 1.6013 0.6189 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:58] "GSM1094814" "GSM1094815" "GSM1094816" "GSM1094817" ...
Why do we have a [, -1]
after gene_df
in
the above chunk?
Now that the matrix is all numbers, we can do things like calculate
the column or row statistics using apply()
.
# Calculate row means
gene_means <- apply(gene_num_matrix, 1, mean) # Notice we are using 1 here
# How long will `gene_means` be?
length(gene_means)
[1] 20056
Note that we can obtain the same results if we select just the
columns with numeric values from the gene_df
data frame.
This allows R to do the as.matrix() coercion automatically, and can be a
handy shortcut if you have a mostly numeric data frame.
# Calculate row means using the `gene_df` object after removing the character column
# apply() converts this to a matrix internally
gene_means_from_df <- apply(gene_df[, -1], 1, mean)
# Let's check that the two gene means objects are equal
all.equal(gene_means, gene_means_from_df)
[1] TRUE
Now let’s investigate the same set up, but use 2 to
apply
over the columns of our matrix.
# Calculate sample means
sample_means <- apply(gene_num_matrix, 2, mean) # Notice we use 2 here
# How long will `sample_means` be?
length(sample_means)
[1] 58
We can put the gene names back into the numeric matrix object by
assigning them as rownames.
# Assign the gene names from gene_df$Gene to the `gene_num_matrix` object using
# the `rownames()` function
rownames(gene_num_matrix) <- gene_df$Gene
# Explore the `gene_num_matrix` object
head(gene_num_matrix)
GSM1094814 GSM1094815 GSM1094816 GSM1094817 GSM1094818
ENSG00000000003 9.59510150 8.4785070 12.6802129 8.677614838 10.75552946
GSM1094819 GSM1094820 GSM1094821 GSM1094822 GSM1094823
ENSG00000000003 6.37470691 9.10028584 7.3546860 8.51847190 9.4216113
GSM1094824 GSM1094825 GSM1094826 GSM1094827 GSM1094828
ENSG00000000003 5.0239629 7.89737460 8.1126876 7.03444640 9.6984918
GSM1094829 GSM1094830 GSM1094831 GSM1094832 GSM1094833
ENSG00000000003 13.98689230 10.5868331 7.6836223 8.3862587 11.18932763
GSM1094834 GSM1094835 GSM1094836 GSM1094837 GSM1094838
ENSG00000000003 9.7562003 9.6984918 10.56891510 9.9391025 7.8738131
GSM1094839 GSM1094840 GSM1094841 GSM1094842 GSM1094843
ENSG00000000003 8.6311353 8.58077557 9.1579585 6.3317019 10.1939387
GSM1094844 GSM1094845 GSM1094846 GSM1094847 GSM1094848
ENSG00000000003 10.44364159 9.62435722 16.05075944 6.9334508 8.55180910
GSM1094849 GSM1094850 GSM1094851 GSM1094852 GSM1094853
ENSG00000000003 9.29497760 7.5027098 6.9593119 8.33588532 8.16826110
GSM1094854 GSM1094855 GSM1094856 GSM1094857 GSM1094858
ENSG00000000003 9.8020077 7.92580451 8.5122426 8.5300217 6.45774124
GSM1094859 GSM1094860 GSM1094861 GSM1094862 GSM1094863
ENSG00000000003 8.06834906 6.29704946 9.59510150 8.2198571 6.0207988
GSM1094864 GSM1094865 GSM1094866 GSM1094867 GSM1094868
ENSG00000000003 10.60783176 10.23536609 9.031964212 7.66629540 1.06494502
GSM1094869 GSM1094870 GSM1094871
ENSG00000000003 1.0408332 1.7262079 1.0292255
[ reached getOption("max.print") -- omitted 5 rows ]
Row names like this can be very convenient for keeping matrices
organized, but row names (and column names) can be lost or misordered if
you are not careful, especially during input and output, so treat them
with care.
Although the apply
functions may not be as easy to use
as the tidyverse functions, for some applications, apply
methods can be better suited. In this workshop, we will not delve too
deeply into the various other apply functions (tapply()
,
lapply()
, etc.) but you can read more information about
them here.
The dplyr::join functions
Let’s say we have a scenario where we have two data frames that we
would like to combine. Recall that stats_df
and
gene_df
are data frames that contain information about some
of the same genes. The dplyr::join
family of functions are useful for various scenarios of combining
data frames. For a visual explanation, the tidyexplain
project has some helpful
animations of joins.
For now, we will focus on inner_join()
, which will
combine data frames by only keeping information about matching rows that
are in both data frames. We need to use the by
argument to
designate what column(s) should be used as a key to match the data
frames. In this case we want to match the gene information between the
two, so we will specify that we want to compare values in the
ensembl_id
column from stats_df
to the
Gene
column from gene_df
.
stats_df |>
# Join based on their shared column
# Called ensembl_id in stats_df and called Gene in gene_df
inner_join(gene_df, by = c('ensembl_id' = 'Gene'))
Save data to files
Save to TSV files
Let’s write some of the data frames we created to a file. To do this,
we can use the readr
library of write_()
functions. The first argument of write_tsv()
is the data we
want to write, and the second argument is a character string that
describes the path to the new file we would like to create. Remember
that we created a results
directory to put our output in,
but if we want to save our data to a directory other than our working
directory, we need to specify this. This is what we will use the
file.path()
function for. Let’s look in a bit more detail
what file.path()
does, by examining the results of the
function in the examples below.
# Which of these file paths is what we want to use to save our data to the
# results directory we created at the beginning of this notebook?
file.path("docker-install", "stats_summary.tsv")
[1] "docker-install/stats_summary.tsv"
docker-install/stats_summary.tsv
file.path("results", "stats_summary.tsv")
[1] "results/stats_summary.tsv"
results/stats_summary.tsv
file.path("stats_summary.tsv", "results")
[1] "stats_summary.tsv/results"
stats_summary.tsv/results
Replace <NEW_FILE_PATH>
below with the
file.path()
statement from above that will successfully
save our file to the results
folder.
# Write our data frame to a TSV file
readr::write_tsv(stats_summary_df, <NEW_FILE_PATH>)
Check in your results
directory to see if your new file
has successfully saved.
Save to RDS files
For this example we have been working with data frames, which are
conveniently represented as TSV or CSV tables. However, in other
situations we may want to save more complicated or very large data
structures, RDS (R Data Serialized/Single) files may be a better option
for saving our data. RDS is R’s special file format for holding data
exactly as you have it in your R environment. RDS files can also be
compressed, meaning they will take up less space on your computer. Let’s
save our data to an RDS file in our results
folder. You
will need to replace the .tsv
with .RDS
, but
you can use what we determined as our file path for the last chunk as
your template.
# Write your object to an RDS file
readr::write_rds(stats_summary_df, <PUT_CORRECT_FILE_PATH_HERE>)
Read an RDS file
Since now you have learned the readr
functions:
read_tsv()
, write_tsv()
, and now,
write_rds()
, what do you suppose the function you will need
to read your RDS file is called? Use that function here to re-import
your data in the chunk we set up for you below.
# Read in your RDS file
reimport_df <- <PUT_FUNCTION_NAME>(file.path("results", "stats_summary.RDS"))
As is good practice, we will end this session by printing out our
session info.
Session Info
# Print out the versions and packages we are using in this session
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 optparse_1.7.5
loaded via a namespace (and not attached):
[1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.3
[5] hms_1.1.3 digest_0.6.35 magrittr_2.0.3 evaluate_0.23
[9] grid_4.4.1 timechange_0.3.0 fastmap_1.1.1 jsonlite_1.8.8
[13] fansi_1.0.6 scales_1.3.0 getopt_1.20.4 jquerylib_0.1.4
[17] cli_3.6.2 crayon_1.5.2 rlang_1.1.3 bit64_4.0.5
[21] munsell_0.5.1 withr_3.0.0 cachem_1.0.8 yaml_2.3.8
[25] parallel_4.4.1 tools_4.4.1 tzdb_0.4.0 colorspace_2.1-0
[29] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 bit_4.0.5
[33] fs_1.6.4 vroom_1.6.5 pkgconfig_2.0.3 pillar_1.9.0
[37] bslib_0.7.0 gtable_0.3.5 glue_1.7.0 xfun_0.43
[41] tidyselect_1.2.1 knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26
[45] compiler_4.4.1
---
title: "Introduction to tidyverse"
author: "CCDL for ALSF"
date: 2021
output:
  html_notebook:
    toc: true
    toc_float: true
editor_options:
  chunk_output_type: inline
---


## Objectives

This notebook will demonstrate how to:

- Use functions from the tidyverse to read and write data frames
- Implement and use tidyverse functions to wrangle data (i.e. filter, mutate, arrange, join)
- Use R pipes (`|>`) to combine multiple operations
- Use the `apply()` function to apply functions across rows or columns of a matrix

---

We'll use the same gene expression dataset we used in the [previous notebook](./02-intro_to_ggplot2.Rmd).
It is a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls) that we performed a set of [differential expression analyses on](./scripts/00-setup-intro-to-R.R).

**More tidyverse resources:**

- [R for Data Science](https://r4ds.hadley.nz/)
- [tidyverse documentation](https://tidyverse.org/)
  - [`dplyr` documentation](https://dplyr.tidyverse.org/)
  - [`readr` documentation](https://readr.tidyverse.org/)
- [Cheatsheet of tidyverse data transformation](https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf)
- [Online tidyverse book chapter](https://privefl.github.io/advr38book/tidyverse.html)

## Set Up

The tidyverse is a collection of packages that are handy for general data wrangling, analysis, and visualization.
Other packages that are specifically handy for different biological analyses are found on [Bioconductor](https://www.bioconductor.org/).
If we want to use a package's functions we first need to install them.

Our RStudio Server already has the `tidyverse` group of packages installed for you.
But if you needed to install it or other packages available on CRAN, you do it using the `install.packages()` function like this: `install.packages("tidyverse")`.

```{r tidyverse}
library(tidyverse)
```

### Referencing a library's function with `::`

Note that if we had not imported the tidyverse set of packages using `library()` like above, and we wanted to use a tidyverse function like `read_tsv()`, we would need to tell R what package to find this function in.
To do this, we would use `::` to tell R to load in this function from the `readr` package by using `readr::read_tsv()`.
You will see this `::` method of referencing libraries within packages throughout the course.
We like to use it in part to remove any ambiguity in which version of a function we are using; it is not too uncommon for different packages to use the same name for very different functions!

## Managing directories

Before we can import the data we need, we should double check where R is looking for files, aka the current **working directory**.
We can do this by using the `getwd()` function, which will tell us what folder we are in.

```{r workingdir, live = TRUE}
# Let's check what directory we are in:
getwd()
```

For Rmd files, the working directory is wherever the file is located, but commands executed in the console may have a different working directory.

We will want to make a directory for our output and we will call this directory: `results`.
But before we create the directory, we should check if it already exists.
We will show two ways that we can do this.

First, we can use the `dir()` function to have R list the files in our working directory.

```{r}
# Let's check what files are here
dir()
```

This shows us there is no folder called "results" yet.

If we want to more pointedly look for "results" in our working directory we can use the `dir.exists()` function.

```{r check-dir, live = TRUE}
# Check if the results directory exists
dir.exists("results")
```

If the above says `FALSE` that means we will need to create a `results` directory.
We've previously seen that we can make directories in R using the base R function `dir.create()`.
But we've also seen that this function will throw an error if you try to create a directory that already exists, which can be frustrating if you are re-running code!
A different option is to use the [`fs`](https://fs.r-lib.org/) package, which provides functions for you to interact with your computer's file system with a more consistent behavior than the base R functions.
One function from this package is `fs::dir_create()` (note that it has an _underscore_, not a period), and much like the base R `dir.create()`, it creates directories. 
It has some other helpful features too:
- It will simply do nothing if that directory already exists; no errors, and nothing will get overwritten
- It allows creating _nested_ directories by default, i.e. in one call make directories inside of other directories

Let's go ahead and use it to create our `results` directory:

```{r create-dir, live = TRUE}
# Make a directory within the working directory called 'results'
fs::dir_create("results")
```

After creating the results directory above, let's re-run `dir.exists()` to see if now it exists.

```{r check-dir-again, live = TRUE}
# Re-check if the results directory exists
dir.exists("results")
```

The `dir.exists()` function will not work on files themselves.
In that case, there is an analogous function called `file.exists()`.

Try using the `file.exists()` function to see if the file `gene_results_GSE44971.tsv` exists in the current directory.
Use the code chunk we set up for you below.
Note that in our notebooks (and sometimes elsewhere), wherever you see a `<FILL_IN_THE_BLANK>` like in the chunk below, that is meant for you to replace (including the angle brackets) with the correct phrase before you run the chunk (otherwise you will get an error).

```{r file-check, eval=FALSE}
# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)
```

It doesn't seem that file exists in our _current directory_, but that doesn't mean it doesn't exist it all.
In fact, this file is inside the _relative path_ `data/`, so let's check again if the whole relative path to that file exists.


```{r file-check-path, eval=FALSE}
# This time, use file.path() to form your argument to file.exists()
file.exists(<PUT_PATH_TO_FILE_HERE>)
```

With the right relative path, we can confirm this file exists.

#### Read a TSV file

Declare the name of the directory where we will read in the data.

```{r}
data_dir <- "data"
```

Although base R has functions to read in data files, the functions in the `readr` package (part of the tidyverse) are faster and more straightforward to use so we are going to use those here.
Because the file we are reading in is a TSV (tab separated values) file we will be using the `read_tsv` function.
There are analogous functions for CSV (comma separated values) files (`read_csv()`) and other files types.

## Read in the differential expression analysis results file

```{r read-results}
stats_df <- readr::read_tsv(
  file.path(data_dir,
            "gene_results_GSE44971.tsv")
  )
```

Following the template of the previous chunk, use this chunk to read in the file `GSE44971.tsv` that is in the `data` folder and save it in the variable `gene_df`.

```{r read-expr, live = TRUE}
# Use this chunk to read in data from the file `GSE44971.tsv`
gene_df <- readr::read_tsv(
  file.path(data_dir,
            "GSE44971.tsv")
  )
```

Use this chunk to explore what `gene_df` looks like.

```{r explore}
# Explore `gene_df`

```

What information is contained in `gene_df`?

## `R` pipes

One nifty feature that was added to `R` in version 4.1 is the pipe: `|>`.
Pipes are very handy things that allow you to funnel the result of one expression to the next, making your code more streamlined and fluently expressing the flow of data through a series of operations.

_Note:_ If you are using a version of `R` prior to 4.1 (or looking at older code), pipe functionality was available through the `magrittr` package, which used a pipe that looked like this: `%>%`.
That pipe was the inspiration for the native R pipe we are using here. 
While there are some minor differences, you can mostly treat them interchangeably as long as you load the `magrittr` package or `dplyr`, which also loads that version of the pipe.

For example, the output from this:

```{r filter}
filter(stats_df, contrast == "male_female")
```

...is the same as the output from this:

```{r filter-pipe}
stats_df |> filter(contrast == "male_female")
```

This can make your code cleaner and easier to follow a series of related commands.
Let's look at an example with our stats of of how the same functions look with or without pipes:

*Example 1:* without pipes:

```{r steps-nopipe}
stats_arranged <- arrange(stats_df, t_statistic)
stats_filtered <- filter(stats_arranged, avg_expression > 50)
stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)
```

UGH, we have to keep track of all of those different intermediate data frames and type their names so many times here!
We could maybe streamline things by using the same variable name at each stage, but even then there is a lot of extra typing, and it is easy to get confused about what has been done where.
It's annoying and makes it harder for people to read.

*Example 2:* Same result as 1 but with pipes!

```{r steps-pipe, live = TRUE}
# Example of the same modifications as above but with pipes!
stats_pipe  <- stats_df |>
               arrange(t_statistic) |>
               filter(avg_expression > 50) |>
               select(contrast, log_fold_change, p_value)
```

What the `|>` (pipe) is doing here is feeding the result of the expression on its left into the first argument of the next function (to its right, or on the next line here).
We can then skip that first argument (the data in these cases), and move right on to the part we care about at that step: what we are arranging, filtering, or selecting in this case.
The key insight that makes the pipe work here is to recognize that each of these functions (`arrange`, `filter`, and `select`) are fundamental `dplyr` (a tidyverse package) functions which work as "data in, data out."
In other words, these functions operate on data frames, and return data frames; you give them a data frame, and they give you back a data frame.
Because these functions all follow a "data in, data out" framework, we can chain them together with pipe and send data all the way through the...pipeline!

Let's double check that these versions with and without pipe yield the same solution by using the base R function `all.equal()`.

```{r check-pipe}
all.equal(stats_nopipe, stats_pipe)
```

`all.equal()` is letting us know that these two objects are the same.

Now that hopefully you are convinced that the tidyverse can help you make your code neater and easier to use and read, let's go through some of the popular tidyverse functions and so we can create pipelines like this.


## Common tidyverse functions

Let's say we wanted to filter this gene expression dataset to particular sample groups.
In order to do this, we would use the function `filter()` as well as a logic statement (usually one that refers to a column or columns in the data frame).

```{r filter-gene}
# Here let's filter stats_df to only keep the gene_symbol "SNCA"
stats_df |>
  filter(gene_symbol == "SNCA")
```

We can use `filter()` similarly for numeric statements.

```{r filter-numeric, live = TRUE}
# Here let's filter the data to rows with average expression values above 50
stats_df |>
  filter(avg_expression > 50)
```

We can apply multiple filters at once, which will require all of them to be satisfied for every row in the results:

```{r filter-2, live = TRUE}
# filter to highly expressed genes with contrast "male_female"
stats_df |>
  filter(contrast == "male_female",
         avg_expression > 50)
```

When we are filtering, the `%in%` operator can come in handy if we have multiple items we would like to match.
Let's take a look at what using `%in%` does.

```{r in-example, eval = FALSE}
genes_of_interest <- c("SNCA", "CDKN1A")
# Are these genes present in the `gene_symbol` column in stats_df?
stats_df$gene_symbol %in% genes_of_interest
```

`%in%` returns a logical vector that now we can use in `dplyr::filter`.

```{r filter-in, live = TRUE}
# filter to keep only genes of interest
stats_df |>
  filter(gene_symbol %in% c("SNCA", "CDKN1A"))
```

Let's return to our first `filter()` and build on to it.
This time, let's keep only some of the columns from the data frame using the
`select()` function.
Let's also save this as a new data frame called `stats_filtered_df`.

```{r filter-select, live = TRUE}
# filter to highly expressed "male_female"
# and select gene_symbol, log_fold_change and t_statistic
stats_filtered_df <- stats_df |>
  filter(contrast == "male_female",
         avg_expression > 50) |>
  select(log_fold_change, t_statistic)
```

Let's say we wanted to arrange this dataset so that the genes are arranged by the smallest p values to the largest.
In order to do this, we would use the function `arrange()` as well as the column we would like to sort by (in this case `p_value`).

```{r arrange}
stats_df |>
  arrange(p_value)
```

What if we want to sort from largest to smallest?
Like if we want to see the genes with the highest average expression?
We can use the same function, but instead use the `desc()` function and now we are using `avg_expression` column.

```{r arrange-desc}
# arrange descending by avg_expression
stats_df |>
  arrange(desc(avg_expression))
```

What if we would like to create a new column of values?
For that we use `mutate()` function.

```{r mutate}
stats_df |>
  mutate(log10_p_value = -log10(p_value))
```

What if we want to obtain summary statistics for a column or columns?
The `summarize` function allows us to calculate summary statistics for a column.
Here we will use summarize to calculate two summary statistics of log-fold change across all genes: mean (function `mean()`) and standard deviation (function `sd()`).

```{r summarize}
stats_df |>
  summarize(mean(log_fold_change),
            sd(log_fold_change))
```

What if we'd like to obtain a summary statistics but have them for various groups?
Conveniently named, there's a function called `group_by()` that seamlessly allows us to do this.
Also note that `group_by()` allows us to group by multiple variables at a time if you want to.

```{r summarize-groups, live = TRUE}
stats_summary_df <- stats_df |>
      group_by(contrast) |>
      summarize(mean(log_fold_change),
                sd(log_fold_change))
```

Let's look at a preview of what we made:

```{r}
stats_summary_df
```

Here we have the mean log fold change expression per each contrast we made.

## A brief intro to the `apply` family of functions

In base R, the `apply` family of functions can be an alternative methods for performing transformations across a data frame, matrix or other object structures.

One of this family is (shockingly) the function `apply()`, which operates on matrices.

A matrix is similar to a data frame in that it is a rectangular table of data, but it has an additional constraint: rather than each column having a type, ALL data in a matrix has the same type.

The first argument to `apply()` is the data object we want to work on.
The third argument is the function we will apply to each row or column of the data object.
The second argument in specifies whether we are applying the function across rows or across columns (1 for rows, 2 for columns).

Remember that `gene_df` is a gene x sample gene expression data frame that has columns of two different types, character and numeric.
Converting it to a matrix will require us to make them all the same type.
We can coerce it into a matrix using `as.matrix()`, in which case R will pick a type that it can convert everything to.
What does it choose?

```{r matrix}
# Coerce `gene_df` into a matrix
gene_matrix <- as.matrix(gene_df)
```

```{r matrix-type, live = TRUE}
# Explore the structure of the `gene_matrix` object
str(gene_matrix)
```

While that worked, it is rare that we want numbers converted to text, so we are going to select only the columns with numeric values before converting it to a matrix.
We can do this most easily by removing the first column, which contains the gene names stored as character values.

```{r matrix-numeric, live = TRUE}
# Let's save a new matrix object names `gene_num_matrix` containing only
# the numeric values
gene_num_matrix <- as.matrix(gene_df[, -1])

# Explore the structure of the `gene_num_matrix` object
str(gene_num_matrix)
```

Why do we have a `[, -1]` after `gene_df` in the above chunk?

Now that the matrix is all numbers, we can do things like calculate the column or row statistics using `apply()`.

```{r rowmeans}
# Calculate row means
gene_means <- apply(gene_num_matrix, 1, mean) # Notice we are using 1 here

# How long will `gene_means` be?
length(gene_means)
```

Note that we can obtain the same results if we select just the columns with numeric values from the `gene_df` data frame.
This allows R to do the as.matrix() coercion automatically, and can be a handy shortcut if you have a *mostly* numeric data frame.

```{r rowmeans-dataframe}
# Calculate row means using the `gene_df` object after removing the character column
# apply() converts this to a matrix internally
gene_means_from_df <- apply(gene_df[, -1], 1, mean)

# Let's check that the two gene means objects are equal
all.equal(gene_means, gene_means_from_df)
```

Now let's investigate the same set up, but use 2 to `apply` over the columns of our matrix.

```{r colmeans}
# Calculate sample means
sample_means <- apply(gene_num_matrix, 2, mean) # Notice we use 2 here

# How long will `sample_means` be?
length(sample_means)
```

We can put the gene names back into the numeric matrix object by assigning them as rownames.

```{r matrix-rownames, live = TRUE}
# Assign the gene names from gene_df$Gene to the `gene_num_matrix` object using
# the `rownames()` function
rownames(gene_num_matrix) <- gene_df$Gene

# Explore the `gene_num_matrix` object
head(gene_num_matrix)
```

Row names like this can be very convenient for keeping matrices organized, but row names (and column names) can be lost or misordered if you are not careful, especially during input and output, so treat them with care.

Although the `apply` functions may not be as easy to use as the tidyverse functions, for some applications, `apply` methods can be better suited.
In this workshop, we will not delve too deeply into the various other apply functions (`tapply()`, `lapply()`, etc.) but you can read more information about them [here](https://www.guru99.com/r-apply-sapply-tapply.html).

## The dplyr::join functions

Let's say we have a scenario where we have two data frames that we would like to combine.
Recall that `stats_df` and `gene_df` are data frames that contain information about some of the same genes.
The [`dplyr::join` family of functions](https://dplyr.tidyverse.org/reference/mutate-joins.html) are useful for various scenarios of combining data frames.
For a visual explanation, the [`tidyexplain` project](https://github.com/gadenbuie/tidyexplain) has some [helpful animations of joins](https://github.com/gadenbuie/tidyexplain#mutating-joins).

For now, we will focus on `inner_join()`, which will combine data frames by only keeping information about matching rows that are in both data frames.
We need to use the `by` argument to designate what column(s) should be used as a key to match the data frames.
In this case we want to match the gene information between the two, so we will specify that we want to compare values in the `ensembl_id` column from `stats_df` to the `Gene` column from `gene_df`.

```{r inner-join}
stats_df |>
  # Join based on their shared column
  # Called ensembl_id in stats_df and called Gene in gene_df
  inner_join(gene_df, by = c('ensembl_id' = 'Gene'))
```

## Save data to files

#### Save to TSV files

Let's write some of the data frames we created to a file.
To do this, we can use the `readr` library of `write_()` functions.
The first argument of `write_tsv()` is the data we want to write, and the second argument is a character string that describes the path to the new file we would like to create.
Remember that we created a `results` directory to put our output in, but if we want to save our data to a directory other than our working directory, we need to specify this.
This is what we will use the `file.path()` function for.
Let's look in a bit more detail what `file.path()` does, by examining the results of the function in the examples below.

```{r file-path-quiz}
# Which of these file paths is what we want to use to save our data to the
# results directory we created at the beginning of this notebook?
file.path("docker-install", "stats_summary.tsv")
file.path("results", "stats_summary.tsv")
file.path("stats_summary.tsv", "results")
```

Replace `<NEW_FILE_PATH>` below with the `file.path()` statement from above that will successfully save our file to the `results` folder.

```{r eval=FALSE}
# Write our data frame to a TSV file
readr::write_tsv(stats_summary_df, <NEW_FILE_PATH>)
```

Check in your `results` directory to see if your new file has successfully saved.

#### Save to RDS files

For this example we have been working with data frames, which are conveniently represented as TSV or CSV tables.
However, in other situations we may want to save more complicated or very large data structures, RDS (R Data Serialized/Single) files may be a better option for saving our data.
RDS is R's special file format for holding data exactly as you have it in your R environment.
RDS files can also be compressed, meaning they will take up less space on your computer.
Let's save our data to an RDS file in our `results` folder.
You will need to replace the `.tsv` with `.RDS`, but you can use what we determined as our file path for the last chunk as your template.

```{r eval=FALSE}
# Write your object to an RDS file
readr::write_rds(stats_summary_df, <PUT_CORRECT_FILE_PATH_HERE>)
```

#### Read an RDS file

Since now you have learned the `readr` functions: `read_tsv()`, `write_tsv()`, and now, `write_rds()`, what do you suppose the function you will need to read your RDS file is called?
Use that function here to re-import your data in the chunk we set up for you below.

```{r eval=FALSE}
# Read in your RDS file
reimport_df <- <PUT_FUNCTION_NAME>(file.path("results", "stats_summary.RDS"))
```

As is good practice, we will end this session by printing out our session info.

### Session Info

```{r}
# Print out the versions and packages we are using in this session
sessionInfo()
```
