Objectives

This notebook will demonstrate how to:

  • Use functions from the tidyverse to read and write data frames
  • Implement and use tidyverse functions to wrangle data (i.e. filter, mutate, arrange, join)
  • Use R pipes (|>) to combine multiple operations
  • Use the apply() function to apply functions across rows or columns of a matrix

We’ll use the same gene expression dataset we used in the previous notebook. It is a pre-processed astrocytoma microarray dataset that we performed a set of differential expression analyses on.

More tidyverse resources:

Set Up

The tidyverse is a collection of packages that are handy for general data wrangling, analysis, and visualization. Other packages that are specifically handy for different biological analyses are found on Bioconductor. If we want to use a package’s functions we first need to install them.

Our RStudio Server already has the tidyverse group of packages installed for you. But if you needed to install it or other packages available on CRAN, you do it using the install.packages() function like this: install.packages("tidyverse").

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Referencing a library’s function with ::

Note that if we had not imported the tidyverse set of packages using library() like above, and we wanted to use a tidyverse function like read_tsv(), we would need to tell R what package to find this function in. To do this, we would use :: to tell R to load in this function from the readr package by using readr::read_tsv(). You will see this :: method of referencing libraries within packages throughout the course. We like to use it in part to remove any ambiguity in which version of a function we are using; it is not too uncommon for different packages to use the same name for very different functions!

Managing directories

Before we can import the data we need, we should double check where R is looking for files, aka the current working directory. We can do this by using the getwd() function, which will tell us what folder we are in.

# Let's check what directory we are in:
getwd()
[1] "/__w/training-modules/training-modules/intro-to-R-tidyverse"
/__w/training-modules/training-modules/intro-to-R-tidyverse

For Rmd files, the working directory is wherever the file is located, but commands executed in the console may have a different working directory.

We will want to make a directory for our output and we will call this directory: results. But before we create the directory, we should check if it already exists. We will show two ways that we can do this.

First, we can use the dir() function to have R list the files in our working directory.

# Let's check what files are here
dir()
 [1] "00a-rstudio_guide.Rmd"                   
 [2] "00b-debugging_resources.Rmd"             
 [3] "00c-good_scientific_coding_practices.Rmd"
 [4] "01-intro_to_base_R-live.Rmd"             
 [5] "01-intro_to_base_R.nb.html"              
 [6] "01-intro_to_base_R.Rmd"                  
 [7] "02-intro_to_ggplot2-live.Rmd"            
 [8] "02-intro_to_ggplot2.nb.html"             
 [9] "02-intro_to_ggplot2.Rmd"                 
[10] "03-intro_to_tidyverse-live.Rmd"          
[11] "03-intro_to_tidyverse.nb.html"           
[12] "03-intro_to_tidyverse.Rmd"               
[13] "data"                                    
[14] "diagrams"                                
[15] "exercise_01-intro_to_base_R.Rmd"         
[16] "exercise_02-intro_to_R.Rmd"              
[17] "exercise_03a-intro_to_tidyverse.Rmd"     
[18] "exercise_03b-intro_to_tidyverse.Rmd"     
[19] "plots"                                   
[20] "README.md"                               
[21] "screenshots"                             
[22] "scripts"                                 
00a-rstudio_guide.Rmd

00b-debugging_resources.Rmd

00c-good_scientific_coding_practices.Rmd

01-intro_to_base_R-live.Rmd

01-intro_to_base_R.nb.html

01-intro_to_base_R.Rmd

02-intro_to_ggplot2-live.Rmd

02-intro_to_ggplot2.nb.html

02-intro_to_ggplot2.Rmd

03-intro_to_tidyverse-live.Rmd

03-intro_to_tidyverse.nb.html

03-intro_to_tidyverse.Rmd

data

diagrams

exercise_01-intro_to_base_R.Rmd

exercise_02-intro_to_R.Rmd

exercise_03a-intro_to_tidyverse.Rmd

exercise_03b-intro_to_tidyverse.Rmd

plots

README.md

screenshots

scripts

This shows us there is no folder called “results” yet.

If we want to more pointedly look for “results” in our working directory we can use the dir.exists() function.

# Check if the results directory exists
dir.exists("results")
[1] FALSE

If the above says FALSE that means we will need to create a results directory. We’ve previously seen that we can make directories in R using the base R function dir.create(). But we’ve also seen that this function will throw an error if you try to create a directory that already exists, which can be frustrating if you are re-running code! A different option is to use the fs package, which provides functions for you to interact with your computer’s file system with a more consistent behavior than the base R functions. One function from this package is fs::dir_create() (note that it has an underscore, not a period), and much like the base R dir.create(), it creates directories. It has some other helpful features too: - It will simply do nothing if that directory already exists; no errors, and nothing will get overwritten - It allows creating nested directories by default, i.e. in one call make directories inside of other directories

Let’s go ahead and use it to create our results directory:

# Make a directory within the working directory called 'results'
fs::dir_create("results")

After creating the results directory above, let’s re-run dir.exists() to see if now it exists.

# Re-check if the results directory exists
dir.exists("results")
[1] TRUE

The dir.exists() function will not work on files themselves. In that case, there is an analogous function called file.exists().

Try using the file.exists() function to see if the file gene_results_GSE44971.tsv exists in the current directory. Use the code chunk we set up for you below. Note that in our notebooks (and sometimes elsewhere), wherever you see a <FILL_IN_THE_BLANK> like in the chunk below, that is meant for you to replace (including the angle brackets) with the correct phrase before you run the chunk (otherwise you will get an error).

# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)

It doesn’t seem that file exists in our current directory, but that doesn’t mean it doesn’t exist it all. In fact, this file is inside the relative path data/, so let’s check again if the whole relative path to that file exists.

# This time, use file.path() to form your argument to file.exists()
file.exists(<PUT_PATH_TO_FILE_HERE>)

With the right relative path, we can confirm this file exists.

Read a TSV file

Declare the name of the directory where we will read in the data.

data_dir <- "data"

Although base R has functions to read in data files, the functions in the readr package (part of the tidyverse) are faster and more straightforward to use so we are going to use those here. Because the file we are reading in is a TSV (tab separated values) file we will be using the read_tsv function. There are analogous functions for CSV (comma separated values) files (read_csv()) and other files types.

Read in the differential expression analysis results file

stats_df <- readr::read_tsv(
  file.path(data_dir,
            "gene_results_GSE44971.tsv")
  )
Rows: 6804 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): ensembl_id, gene_symbol, contrast
dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Following the template of the previous chunk, use this chunk to read in the file GSE44971.tsv that is in the data folder and save it in the variable gene_df.

# Use this chunk to read in data from the file `GSE44971.tsv`
gene_df <- readr::read_tsv(
  file.path(data_dir,
            "GSE44971.tsv")
  )
Rows: 20056 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): Gene
dbl (58): GSM1094814, GSM1094815, GSM1094816, GSM1094817, GSM1094818, GSM109...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use this chunk to explore what gene_df looks like.

# Explore `gene_df`

What information is contained in gene_df?

R pipes

One nifty feature that was added to R in version 4.1 is the pipe: |>. Pipes are very handy things that allow you to funnel the result of one expression to the next, making your code more streamlined and fluently expressing the flow of data through a series of operations.

Note: If you are using a version of R prior to 4.1 (or looking at older code), pipe functionality was available through the magrittr package , which used a pipe that looked like this: %>%. That pipe was the inspiration for the native R pipe we are using here. While there are some minor differences, you can mostly treat them interchangeably as long as you load the magrittr package or dplyr, which also loads that version of the pipe.

For example, the output from this:

filter(stats_df, contrast == "male_female")

…is the same as the output from this:

stats_df |> filter(contrast == "male_female")

This can make your code cleaner and easier to follow a series of related commands. Let’s look at an example with our stats of of how the same functions look with or without pipes:

Example 1: without pipes:

stats_arranged <- arrange(stats_df, t_statistic)
stats_filtered <- filter(stats_arranged, avg_expression > 50)
stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)

UGH, we have to keep track of all of those different intermediate data frames and type their names so many times here! We could maybe streamline things by using the same variable name at each stage, but even then there is a lot of extra typing, and it is easy to get confused about what has been done where. It’s annoying and makes it harder for people to read.

Example 2: Same result as 1 but with pipes!

# Example of the same modifications as above but with pipes!
stats_pipe  <- stats_df |>
               arrange(t_statistic) |>
               filter(avg_expression > 50) |>
               select(contrast, log_fold_change, p_value)

What the |> (pipe) is doing here is feeding the result of the expression on its left into the first argument of the next function (to its right, or on the next line here). We can then skip that first argument (the data in these cases), and move right on to the part we care about at that step: what we are arranging, filtering, or selecting in this case. The key insight that makes the pipe work here is to recognize that each of these functions (arrange, filter, and select) are fundamental dplyr (a tidyverse package) functions which work as “data in, data out.” In other words, these functions operate on data frames, and return data frames; you give them a data frame, and they give you back a data frame. Because these functions all follow a “data in, data out” framework, we can chain them together with pipe and send data all the way through the…pipeline!

Let’s double check that these versions with and without pipe yield the same solution by using the base R function all.equal().

all.equal(stats_nopipe, stats_pipe)
[1] TRUE

all.equal() is letting us know that these two objects are the same.

Now that hopefully you are convinced that the tidyverse can help you make your code neater and easier to use and read, let’s go through some of the popular tidyverse functions and so we can create pipelines like this.

Common tidyverse functions

Let’s say we wanted to filter this gene expression dataset to particular sample groups. In order to do this, we would use the function filter() as well as a logic statement (usually one that refers to a column or columns in the data frame).

# Here let's filter stats_df to only keep the gene_symbol "SNCA"
stats_df |>
  filter(gene_symbol == "SNCA")

We can use filter() similarly for numeric statements.

# Here let's filter the data to rows with average expression values above 50
stats_df |>
  filter(avg_expression > 50)

We can apply multiple filters at once, which will require all of them to be satisfied for every row in the results:

# filter to highly expressed genes with contrast "male_female"
stats_df |>
  filter(contrast == "male_female",
         avg_expression > 50)

When we are filtering, the %in% operator can come in handy if we have multiple items we would like to match. Let’s take a look at what using %in% does.

genes_of_interest <- c("SNCA", "CDKN1A")
# Are these genes present in the `gene_symbol` column in stats_df?
stats_df$gene_symbol %in% genes_of_interest

%in% returns a logical vector that now we can use in dplyr::filter.

# filter to keep only genes of interest
stats_df |>
  filter(gene_symbol %in% c("SNCA", "CDKN1A"))

Let’s return to our first filter() and build on to it. This time, let’s keep only some of the columns from the data frame using the select() function. Let’s also save this as a new data frame called stats_filtered_df.

# filter to highly expressed "male_female"
# and select gene_symbol, log_fold_change and t_statistic
stats_filtered_df <- stats_df |>
  filter(contrast == "male_female",
         avg_expression > 50) |>
  select(log_fold_change, t_statistic)

Let’s say we wanted to arrange this dataset so that the genes are arranged by the smallest p values to the largest. In order to do this, we would use the function arrange() as well as the column we would like to sort by (in this case p_value).

stats_df |>
  arrange(p_value)

What if we want to sort from largest to smallest? Like if we want to see the genes with the highest average expression? We can use the same function, but instead use the desc() function and now we are using avg_expression column.

# arrange descending by avg_expression
stats_df |>
  arrange(desc(avg_expression))

What if we would like to create a new column of values? For that we use mutate() function.

stats_df |>
  mutate(log10_p_value = -log10(p_value))

What if we want to obtain summary statistics for a column or columns? The summarize function allows us to calculate summary statistics for a column. Here we will use summarize to calculate two summary statistics of log-fold change across all genes: mean (function mean()) and standard deviation (function sd()).

stats_df |>
  summarize(mean(log_fold_change),
            sd(log_fold_change))

What if we’d like to obtain a summary statistics but have them for various groups? Conveniently named, there’s a function called group_by() that seamlessly allows us to do this. Also note that group_by() allows us to group by multiple variables at a time if you want to.

stats_summary_df <- stats_df |>
      group_by(contrast) |>
      summarize(mean(log_fold_change),
                sd(log_fold_change))

Let’s look at a preview of what we made:

stats_summary_df

Here we have the mean log fold change expression per each contrast we made.

A brief intro to the apply family of functions

In base R, the apply family of functions can be an alternative methods for performing transformations across a data frame, matrix or other object structures.

One of this family is (shockingly) the function apply(), which operates on matrices.

A matrix is similar to a data frame in that it is a rectangular table of data, but it has an additional constraint: rather than each column having a type, ALL data in a matrix has the same type.

The first argument to apply() is the data object we want to work on. The third argument is the function we will apply to each row or column of the data object. The second argument in specifies whether we are applying the function across rows or across columns (1 for rows, 2 for columns).

Remember that gene_df is a gene x sample gene expression data frame that has columns of two different types, character and numeric. Converting it to a matrix will require us to make them all the same type. We can coerce it into a matrix using as.matrix(), in which case R will pick a type that it can convert everything to. What does it choose?

# Coerce `gene_df` into a matrix
gene_matrix <- as.matrix(gene_df)
# Explore the structure of the `gene_matrix` object
str(gene_matrix)
 chr [1:20056, 1:59] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:59] "Gene" "GSM1094814" "GSM1094815" "GSM1094816" ...

While that worked, it is rare that we want numbers converted to text, so we are going to select only the columns with numeric values before converting it to a matrix. We can do this most easily by removing the first column, which contains the gene names stored as character values.

# Let's save a new matrix object names `gene_num_matrix` containing only
# the numeric values
gene_num_matrix <- as.matrix(gene_df[, -1])

# Explore the structure of the `gene_num_matrix` object
str(gene_num_matrix)
 num [1:20056, 1:58] 9.5951 -0.0436 8.5246 1.6013 0.6189 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:58] "GSM1094814" "GSM1094815" "GSM1094816" "GSM1094817" ...

Why do we have a [, -1] after gene_df in the above chunk?

Now that the matrix is all numbers, we can do things like calculate the column or row statistics using apply().