This notebook will demonstrate how to:
|>) to combine multiple operations
apply()function to apply functions across rows or columns of a matrix
More tidyverse resources:
The tidyverse is a collection of packages that are handy for general data wrangling, analysis, and visualization. Other packages that are specifically handy for different biological analyses are found on Bioconductor. If we want to use a package’s functions we first need to install them.
Our RStudio Server already has the
tidyverse group of
packages installed for you. But if you needed to install it or other
packages available on CRAN, you do it using the
install.packages() function like this:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.2 ✔ readr 2.1.4 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ✔ purrr 1.0.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Note that if we had not imported the tidyverse set of packages using
library() like above, and we wanted to use a tidyverse
read_tsv(), we would need to tell R what
package to find this function in. To do this, we would use
:: to tell R to load in this function from the
readr package by using
will see this
:: method of referencing libraries within
packages throughout the course. We like to use it in part to remove any
ambiguity in which version of a function we are using; it is not too
uncommon for different packages to use the same name for very different
Before we can import the data we need, we should double check where R
is looking for files, aka the current working
directory. We can do this by using the
function, which will tell us what folder we are in.
# Let's check what directory we are in: getwd()
For Rmd files, the working directory is wherever the file is located, but commands executed in the console may have a different working directory.
We will want to make a directory for our output and we will call this
results. But before we create the directory, we
should check if it already exists. We will show two ways that we can do
First, we can use the
dir() function to have R list the
files in our working directory.
# Let's check what files are here dir()
 "00a-rstudio_guide.Rmd"  "00b-debugging_resources.Rmd"  "00c-good_scientific_coding_practices.Rmd"  "01-intro_to_base_R-live.Rmd"  "01-intro_to_base_R.nb.html"  "01-intro_to_base_R.Rmd"  "02-intro_to_ggplot2-live.Rmd"  "02-intro_to_ggplot2.nb.html"  "02-intro_to_ggplot2.Rmd"  "03-intro_to_tidyverse-live.Rmd"  "03-intro_to_tidyverse.nb.html"  "03-intro_to_tidyverse.Rmd"  "data"  "diagrams"  "exercise_01-intro_to_base_R.Rmd"  "exercise_02-intro_to_R.Rmd"  "exercise_03a-intro_to_tidyverse.Rmd"  "exercise_03b-intro_to_tidyverse.Rmd"  "plots"  "README.md"  "screenshots"  "scripts"
00a-rstudio_guide.Rmd 00b-debugging_resources.Rmd 00c-good_scientific_coding_practices.Rmd 01-intro_to_base_R-live.Rmd 01-intro_to_base_R.nb.html 01-intro_to_base_R.Rmd 02-intro_to_ggplot2-live.Rmd 02-intro_to_ggplot2.nb.html 02-intro_to_ggplot2.Rmd 03-intro_to_tidyverse-live.Rmd 03-intro_to_tidyverse.nb.html 03-intro_to_tidyverse.Rmd data diagrams exercise_01-intro_to_base_R.Rmd exercise_02-intro_to_R.Rmd exercise_03a-intro_to_tidyverse.Rmd exercise_03b-intro_to_tidyverse.Rmd plots README.md screenshots scripts
This shows us there is no folder called “results” yet.
If we want to more pointedly look for “results” in our working
directory we can use the
# Check if the results directory exists dir.exists("results")
If the above says
FALSE that means we will need to
results directory. We’ve previously seen that we
can make directories in R using the base R function
dir.create(). But we’ve also seen that this function will
throw an error if you try to create a directory that already exists,
which can be frustrating if you are re-running code! A different option
is to use the
package, which provides functions for you to interact with your
computer’s file system with a more consistent behavior than the base R
functions. One function from this package is
fs::dir_create() (note that it has an underscore,
not a period), and much like the base R
creates directories. It has some other helpful features too: - It will
simply do nothing if that directory already exists; no errors, and
nothing will get overwritten - It allows creating nested
directories by default, i.e. in one call make directories inside of
Let’s go ahead and use it to create our
# Make a directory within the working directory called 'results' fs::dir_create("results")
After creating the results directory above, let’s re-run
dir.exists() to see if now it exists.
# Re-check if the results directory exists dir.exists("results")
dir.exists() function will not work on files
themselves. In that case, there is an analogous function called
Try using the
file.exists() function to see if the file
gene_results_GSE44971.tsv exists in the current directory.
Use the code chunk we set up for you below. Note that in our notebooks
(and sometimes elsewhere), wherever you see a
<FILL_IN_THE_BLANK> like in the chunk below, that is
meant for you to replace (including the angle brackets) with the correct
phrase before you run the chunk (otherwise you will get an error).
# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for # Remember to use quotes to make it a character string file.exists(<PUT_FILE_NAME_HERE>)
It doesn’t seem that file exists in our current directory,
but that doesn’t mean it doesn’t exist it all. In fact, this file is
inside the relative path
data/, so let’s check
again if the whole relative path to that file exists.
# This time, use file.path() to form your argument to file.exists() file.exists(<PUT_PATH_TO_FILE_HERE>)
With the right relative path, we can confirm this file exists.
Declare the name of the directory where we will read in the data.
data_dir <- "data"
Although base R has functions to read in data files, the functions in
readr package (part of the tidyverse) are faster and
more straightforward to use so we are going to use those here. Because
the file we are reading in is a TSV (tab separated values) file we will
be using the
read_tsv function. There are analogous
functions for CSV (comma separated values) files
read_csv()) and other files types.
stats_df <- readr::read_tsv( file.path(data_dir, "gene_results_GSE44971.tsv") )
Rows: 6804 Columns: 8 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (3): ensembl_id, gene_symbol, contrast dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Following the template of the previous chunk, use this chunk to read
in the file
GSE44971.tsv that is in the
folder and save it in the variable
# Use this chunk to read in data from the file `GSE44971.tsv` gene_df <- readr::read_tsv( file.path(data_dir, "GSE44971.tsv") )
Rows: 20056 Columns: 59 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (1): Gene dbl (58): GSM1094814, GSM1094815, GSM1094816, GSM1094817, GSM1094818, GSM109... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use this chunk to explore what
gene_df looks like.
# Explore `gene_df`
What information is contained in
One nifty feature that was added to
R in version 4.1 is
|>. Pipes are very handy things that allow you
to funnel the result of one expression to the next, making your code
more streamlined and fluently expressing the flow of data through a
series of operations.
Note: If you are using a version of
R prior to
4.1 (or looking at older code), pipe functionality was available through
magrittr package , which used a pipe that looked like
%>%. That pipe was the inspiration for the native
R pipe we are using here. While there are some minor differences, you
can mostly treat them interchangeably as long as you load the
magrittr package or
dplyr, which also loads
that version of the pipe.
For example, the output from this:
filter(stats_df, contrast == "male_female")
…is the same as the output from this:
stats_df |> filter(contrast == "male_female")
This can make your code cleaner and easier to follow a series of related commands. Let’s look at an example with our stats of of how the same functions look with or without pipes:
Example 1: without pipes:
stats_arranged <- arrange(stats_df, t_statistic) stats_filtered <- filter(stats_arranged, avg_expression > 50) stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)
UGH, we have to keep track of all of those different intermediate data frames and type their names so many times here! We could maybe streamline things by using the same variable name at each stage, but even then there is a lot of extra typing, and it is easy to get confused about what has been done where. It’s annoying and makes it harder for people to read.
Example 2: Same result as 1 but with pipes!
# Example of the same modifications as above but with pipes! stats_pipe <- stats_df |> arrange(t_statistic) |> filter(avg_expression > 50) |> select(contrast, log_fold_change, p_value)
|> (pipe) is doing here is feeding the
result of the expression on its left into the first argument of the next
function (to its right, or on the next line here). We can then skip that
first argument (the data in these cases), and move right on to the part
we care about at that step: what we are arranging, filtering, or
selecting in this case. The key insight that makes the pipe work here is
to recognize that each of these functions (
select) are fundamental
dplyr (a tidyverse package) functions which work as “data
in, data out.” In other words, these functions operate on data frames,
and return data frames; you give them a data frame, and they give you
back a data frame. Because these functions all follow a “data in, data
out” framework, we can chain them together with pipe and send data all
the way through the…pipeline!
Let’s double check that these versions with and without pipe yield
the same solution by using the base R function
all.equal() is letting us know that these two objects
are the same.
Now that hopefully you are convinced that the tidyverse can help you make your code neater and easier to use and read, let’s go through some of the popular tidyverse functions and so we can create pipelines like this.
Let’s say we wanted to filter this gene expression dataset to
particular sample groups. In order to do this, we would use the function
filter() as well as a logic statement (usually one that
refers to a column or columns in the data frame).
# Here let's filter stats_df to only keep the gene_symbol "SNCA" stats_df |> filter(gene_symbol == "SNCA")
We can use
filter() similarly for numeric
# Here let's filter the data to rows with average expression values above 50 stats_df |> filter(avg_expression > 50)
We can apply multiple filters at once, which will require all of them to be satisfied for every row in the results:
# filter to highly expressed genes with contrast "male_female" stats_df |> filter(contrast == "male_female", avg_expression > 50)
When we are filtering, the
%in% operator can come in
handy if we have multiple items we would like to match. Let’s take a
look at what using
genes_of_interest <- c("SNCA", "CDKN1A") # Are these genes present in the `gene_symbol` column in stats_df? stats_df$gene_symbol %in% genes_of_interest
%in% returns a logical vector that now we can use in
# filter to keep only genes of interest stats_df |> filter(gene_symbol %in% c("SNCA", "CDKN1A"))
Let’s return to our first
filter() and build on to it.
This time, let’s keep only some of the columns from the data frame using
select() function. Let’s also save this as a new data
# filter to highly expressed "male_female" # and select gene_symbol, log_fold_change and t_statistic stats_filtered_df <- stats_df |> filter(contrast == "male_female", avg_expression > 50) |> select(log_fold_change, t_statistic)
Let’s say we wanted to arrange this dataset so that the genes are
arranged by the smallest p values to the largest. In order to do this,
we would use the function
arrange() as well as the column
we would like to sort by (in this case
stats_df |> arrange(p_value)
What if we want to sort from largest to smallest? Like if we want to
see the genes with the highest average expression? We can use the same
function, but instead use the
desc() function and now we
# arrange descending by avg_expression stats_df |> arrange(desc(avg_expression))
What if we would like to create a new column of values? For that we
stats_df |> mutate(log10_p_value = -log10(p_value))
What if we want to obtain summary statistics for a column or columns?
summarize function allows us to calculate summary
statistics for a column. Here we will use summarize to calculate two
summary statistics of log-fold change across all genes: mean (function
mean()) and standard deviation (function
stats_df |> summarize(mean(log_fold_change), sd(log_fold_change))
What if we’d like to obtain a summary statistics but have them for
various groups? Conveniently named, there’s a function called
group_by() that seamlessly allows us to do this. Also note
group_by() allows us to group by multiple variables at
a time if you want to.
stats_summary_df <- stats_df |> group_by(contrast) |> summarize(mean(log_fold_change), sd(log_fold_change))
Let’s look at a preview of what we made:
Here we have the mean log fold change expression per each contrast we made.
applyfamily of functions
In base R, the
apply family of functions can be an
alternative methods for performing transformations across a data frame,
matrix or other object structures.
One of this family is (shockingly) the function
which operates on matrices.
A matrix is similar to a data frame in that it is a rectangular table of data, but it has an additional constraint: rather than each column having a type, ALL data in a matrix has the same type.
The first argument to
apply() is the data object we want
to work on. The third argument is the function we will apply to each row
or column of the data object. The second argument in specifies whether
we are applying the function across rows or across columns (1 for rows,
2 for columns).
gene_df is a gene x sample gene expression
data frame that has columns of two different types, character and
numeric. Converting it to a matrix will require us to make them all the
same type. We can coerce it into a matrix using
as.matrix(), in which case R will pick a type that it can
convert everything to. What does it choose?
# Coerce `gene_df` into a matrix gene_matrix <- as.matrix(gene_df)
# Explore the structure of the `gene_matrix` object str(gene_matrix)
chr [1:20056, 1:59] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:59] "Gene" "GSM1094814" "GSM1094815" "GSM1094816" ...
While that worked, it is rare that we want numbers converted to text, so we are going to select only the columns with numeric values before converting it to a matrix. We can do this most easily by removing the first column, which contains the gene names stored as character values.
# Let's save a new matrix object names `gene_num_matrix` containing only # the numeric values gene_num_matrix <- as.matrix(gene_df[, -1]) # Explore the structure of the `gene_num_matrix` object str(gene_num_matrix)
num [1:20056, 1:58] 9.5951 -0.0436 8.5246 1.6013 0.6189 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:58] "GSM1094814" "GSM1094815" "GSM1094816" "GSM1094817" ...
Why do we have a
[, -1] after
the above chunk?
Now that the matrix is all numbers, we can do things like calculate
the column or row statistics using