This notebook will demonstrate how to:
|>
) to combine multiple operationsapply()
function to apply functions across rows
or columns of a matrixWe’ll use the same gene expression dataset we used in the previous notebook. It is a pre-processed astrocytoma microarray dataset that we performed a set of differential expression analyses on.
More tidyverse resources:
The tidyverse is a collection of packages that are handy for general data wrangling, analysis, and visualization. Other packages that are specifically handy for different biological analyses are found on Bioconductor. If we want to use a package’s functions we first need to install them.
Our RStudio Server already has the tidyverse
group of
packages installed for you. But if you needed to install it or other
packages available on CRAN, you do it using the
install.packages()
function like this:
install.packages("tidyverse")
.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
::
Note that if we had not imported the tidyverse set of packages using
library()
like above, and we wanted to use a tidyverse
function like read_tsv()
, we would need to tell R what
package to find this function in. To do this, we would use
::
to tell R to load in this function from the
readr
package by using readr::read_tsv()
. You
will see this ::
method of referencing libraries within
packages throughout the course. We like to use it in part to remove any
ambiguity in which version of a function we are using; it is not too
uncommon for different packages to use the same name for very different
functions!
Before we can import the data we need, we should double check where R
is looking for files, aka the current working
directory. We can do this by using the getwd()
function, which will tell us what folder we are in.
# Let's check what directory we are in:
getwd()
[1] "/__w/training-modules/training-modules/intro-to-R-tidyverse"
/__w/training-modules/training-modules/intro-to-R-tidyverse
For Rmd files, the working directory is wherever the file is located, but commands executed in the console may have a different working directory.
We will want to make a directory for our output and we will call this
directory: results
. But before we create the directory, we
should check if it already exists. We will show two ways that we can do
this.
First, we can use the dir()
function to have R list the
files in our working directory.
# Let's check what files are here
dir()
[1] "00a-rstudio_guide.Rmd"
[2] "00b-debugging_resources.Rmd"
[3] "00c-good_scientific_coding_practices.Rmd"
[4] "01-intro_to_base_R-live.Rmd"
[5] "01-intro_to_base_R.nb.html"
[6] "01-intro_to_base_R.Rmd"
[7] "02-intro_to_ggplot2-live.Rmd"
[8] "02-intro_to_ggplot2.nb.html"
[9] "02-intro_to_ggplot2.Rmd"
[10] "03-intro_to_tidyverse-live.Rmd"
[11] "03-intro_to_tidyverse.nb.html"
[12] "03-intro_to_tidyverse.Rmd"
[13] "data"
[14] "diagrams"
[15] "exercise_01-intro_to_base_R.Rmd"
[16] "exercise_02-intro_to_R.Rmd"
[17] "exercise_03a-intro_to_tidyverse.Rmd"
[18] "exercise_03b-intro_to_tidyverse.Rmd"
[19] "plots"
[20] "README.md"
[21] "screenshots"
[22] "scripts"
00a-rstudio_guide.Rmd
00b-debugging_resources.Rmd
00c-good_scientific_coding_practices.Rmd
01-intro_to_base_R-live.Rmd
01-intro_to_base_R.nb.html
01-intro_to_base_R.Rmd
02-intro_to_ggplot2-live.Rmd
02-intro_to_ggplot2.nb.html
02-intro_to_ggplot2.Rmd
03-intro_to_tidyverse-live.Rmd
03-intro_to_tidyverse.nb.html
03-intro_to_tidyverse.Rmd
data
diagrams
exercise_01-intro_to_base_R.Rmd
exercise_02-intro_to_R.Rmd
exercise_03a-intro_to_tidyverse.Rmd
exercise_03b-intro_to_tidyverse.Rmd
plots
README.md
screenshots
scripts
This shows us there is no folder called “results” yet.
If we want to more pointedly look for “results” in our working
directory we can use the dir.exists()
function.
# Check if the results directory exists
dir.exists("results")
[1] FALSE
If the above says FALSE
that means we will need to
create a results
directory. We’ve previously seen that we
can make directories in R using the base R function
dir.create()
. But we’ve also seen that this function will
throw an error if you try to create a directory that already exists,
which can be frustrating if you are re-running code! A different option
is to use the fs
package, which provides functions for you to interact with your
computer’s file system with a more consistent behavior than the base R
functions. One function from this package is
fs::dir_create()
(note that it has an underscore,
not a period), and much like the base R dir.create()
, it
creates directories. It has some other helpful features too: - It will
simply do nothing if that directory already exists; no errors, and
nothing will get overwritten - It allows creating nested
directories by default, i.e. in one call make directories inside of
other directories
Let’s go ahead and use it to create our results
directory:
# Make a directory within the working directory called 'results'
fs::dir_create("results")
After creating the results directory above, let’s re-run
dir.exists()
to see if now it exists.
# Re-check if the results directory exists
dir.exists("results")
[1] TRUE
The dir.exists()
function will not work on files
themselves. In that case, there is an analogous function called
file.exists()
.
Try using the file.exists()
function to see if the file
gene_results_GSE44971.tsv
exists in the current directory.
Use the code chunk we set up for you below. Note that in our notebooks
(and sometimes elsewhere), wherever you see a
<FILL_IN_THE_BLANK>
like in the chunk below, that is
meant for you to replace (including the angle brackets) with the correct
phrase before you run the chunk (otherwise you will get an error).
# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)
It doesn’t seem that file exists in our current directory,
but that doesn’t mean it doesn’t exist it all. In fact, this file is
inside the relative path data/
, so let’s check
again if the whole relative path to that file exists.
# This time, use file.path() to form your argument to file.exists()
file.exists(<PUT_PATH_TO_FILE_HERE>)
With the right relative path, we can confirm this file exists.
Declare the name of the directory where we will read in the data.
data_dir <- "data"
Although base R has functions to read in data files, the functions in
the readr
package (part of the tidyverse) are faster and
more straightforward to use so we are going to use those here. Because
the file we are reading in is a TSV (tab separated values) file we will
be using the read_tsv
function. There are analogous
functions for CSV (comma separated values) files
(read_csv()
) and other files types.
stats_df <- readr::read_tsv(
file.path(data_dir,
"gene_results_GSE44971.tsv")
)
Rows: 6804 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): ensembl_id, gene_symbol, contrast
dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Following the template of the previous chunk, use this chunk to read
in the file GSE44971.tsv
that is in the data
folder and save it in the variable gene_df
.
# Use this chunk to read in data from the file `GSE44971.tsv`
gene_df <- readr::read_tsv(
file.path(data_dir,
"GSE44971.tsv")
)
Rows: 20056 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (58): GSM1094814, GSM1094815, GSM1094816, GSM1094817, GSM1094818, GSM109...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use this chunk to explore what gene_df
looks like.
# Explore `gene_df`
What information is contained in gene_df
?
R
pipesOne nifty feature that was added to R
in version 4.1 is
the pipe: |>
. Pipes are very handy things that allow you
to funnel the result of one expression to the next, making your code
more streamlined and fluently expressing the flow of data through a
series of operations.
Note: If you are using a version of R
prior to
4.1 (or looking at older code), pipe functionality was available through
the magrittr
package, which used a pipe that looked like
this: %>%
. That pipe was the inspiration for the native
R pipe we are using here. While there are some minor differences, you
can mostly treat them interchangeably as long as you load the
magrittr
package or dplyr
, which also loads
that version of the pipe.
For example, the output from this:
filter(stats_df, contrast == "male_female")
…is the same as the output from this:
stats_df |> filter(contrast == "male_female")
This can make your code cleaner and easier to follow a series of related commands. Let’s look at an example with our stats of of how the same functions look with or without pipes:
Example 1: without pipes:
stats_arranged <- arrange(stats_df, t_statistic)
stats_filtered <- filter(stats_arranged, avg_expression > 50)
stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)
UGH, we have to keep track of all of those different intermediate data frames and type their names so many times here! We could maybe streamline things by using the same variable name at each stage, but even then there is a lot of extra typing, and it is easy to get confused about what has been done where. It’s annoying and makes it harder for people to read.
Example 2: Same result as 1 but with pipes!
# Example of the same modifications as above but with pipes!
stats_pipe <- stats_df |>
arrange(t_statistic) |>
filter(avg_expression > 50) |>
select(contrast, log_fold_change, p_value)
What the |>
(pipe) is doing here is feeding the
result of the expression on its left into the first argument of the next
function (to its right, or on the next line here). We can then skip that
first argument (the data in these cases), and move right on to the part
we care about at that step: what we are arranging, filtering, or
selecting in this case. The key insight that makes the pipe work here is
to recognize that each of these functions (arrange
,
filter
, and select
) are fundamental
dplyr
(a tidyverse package) functions which work as “data
in, data out.” In other words, these functions operate on data frames,
and return data frames; you give them a data frame, and they give you
back a data frame. Because these functions all follow a “data in, data
out” framework, we can chain them together with pipe and send data all
the way through the…pipeline!
Let’s double check that these versions with and without pipe yield
the same solution by using the base R function
all.equal()
.
all.equal(stats_nopipe, stats_pipe)
[1] TRUE
all.equal()
is letting us know that these two objects
are the same.
Now that hopefully you are convinced that the tidyverse can help you make your code neater and easier to use and read, let’s go through some of the popular tidyverse functions and so we can create pipelines like this.
Let’s say we wanted to filter this gene expression dataset to
particular sample groups. In order to do this, we would use the function
filter()
as well as a logic statement (usually one that
refers to a column or columns in the data frame).
# Here let's filter stats_df to only keep the gene_symbol "SNCA"
stats_df |>
filter(gene_symbol == "SNCA")
We can use filter()
similarly for numeric
statements.
# Here let's filter the data to rows with average expression values above 50
stats_df |>
filter(avg_expression > 50)
We can apply multiple filters at once, which will require all of them to be satisfied for every row in the results:
# filter to highly expressed genes with contrast "male_female"
stats_df |>
filter(contrast == "male_female",
avg_expression > 50)
When we are filtering, the %in%
operator can come in
handy if we have multiple items we would like to match. Let’s take a
look at what using %in%
does.
genes_of_interest <- c("SNCA", "CDKN1A")
# Are these genes present in the `gene_symbol` column in stats_df?
stats_df$gene_symbol %in% genes_of_interest
%in%
returns a logical vector that now we can use in
dplyr::filter
.
# filter to keep only genes of interest
stats_df |>
filter(gene_symbol %in% c("SNCA", "CDKN1A"))
Let’s return to our first filter()
and build on to it.
This time, let’s keep only some of the columns from the data frame using
the select()
function. Let’s also save this as a new data
frame called stats_filtered_df
.
# filter to highly expressed "male_female"
# and select gene_symbol, log_fold_change and t_statistic
stats_filtered_df <- stats_df |>
filter(contrast == "male_female",
avg_expression > 50) |>
select(log_fold_change, t_statistic)
Let’s say we wanted to arrange this dataset so that the genes are
arranged by the smallest p values to the largest. In order to do this,
we would use the function arrange()
as well as the column
we would like to sort by (in this case p_value
).
stats_df |>
arrange(p_value)