This notebook will demonstrate how to:
R is a statistical computing language that is open source, meaning the underlying code for the language is freely available to anyone. You do not need a special license or set of permissions to use and develop code in R.
R itself is an interpreted computer language and comes with functionality that comes bundled with the language itself, known as “base R”. But there is also rich additional functionality provided by external packages, or libraries of code that assist in accomplishing certain tasks and can be freely downloaded and loaded for use.
In the next notebook and subsequent modules, we will be using a suite
of packages collectively known as The Tidyverse. The
tidyverse
is geared towards intuitive data science
applications that follow a shared data philosophy. But there are still
many core features of base R which are important to be aware of, and we
will be using concepts from both base R and the tidyverse in our
analyses, as well as task specific packages for analyses such as gene
expression.
RStudio is a graphical environment (“integrated development environment” or IDE) for writing and developing R code. RStudio is NOT a separate programming language - it is an interface we use to facilitate R programming. In other words, you can program in R without RStudio, but you can’t use the RStudio environment without R.
For more information about RStudio than you ever wanted to know, see this RStudio IDE Cheatsheet (pdf).
The RStudio environment has four main panes, each of
which may have a number of tabs that display different information or
functionality. (their specific location can be changed under Tools ->
Global Options -> Pane Layout).
The Editor pane is where you can write R scripts and other documents. Each tab here is its own document. This is your text editor, which will allow you to save your R code for future use. Note that change code here will not run automatically until you run it.
The Console pane is where you can interactively run R code.
The Environment pane primarily displays the variables, sometimes known as objects that are defined during a given R session, and what data or values they might hold.
The final pane, Files, Plots, Help, …, has several pretty important tabs:
The most basic use of R is as a regular calculator:
Operation | Symbol |
---|---|
Add | + |
Subtract | - |
Multiply | * |
Divide | / |
Exponentiate | ^ or ** |
For example, we can do some simple multiplication like this. When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter on a Mac, or Ctrl+Shift+Enter on a PC.
5 * 6
[1] 30
Use the console to calculate other expressions. Standard order of
operations applies (mostly), and you can use parentheses ()
as you might expect (but not brackets []
or braces
{}
, which have special meanings). Note however, that you
must always specify multiplication with *
;
implicit multiplication such as 10(3 + 4)
or
10x
will not work and will generate an error, or worse.
10 * (3 + 4)^2
[1] 490
To define a variable, we use the assignment operator which
looks like an arrow: <-
, for example
x <- 7
takes the value on the right-hand side of the
operator and assigns it to the variable name on the left-hand side.
# Define a variable x to equal 7, and print out the value of x
x <- 7
# We can have R repeat back to us what `x` is by just using `x`
x
[1] 7
Some features of variables, considering the example
x <- 7
: Every variable has a name, a
value, and a type. This variable’s
name is x
, its value is 7
, and its type is
numeric
(7 is a number!). Re-defining a variable will
overwrite the value.
x <- 5.5
x
[1] 5.5
We can modify an existing variable by reassigning it to its same
name. Here we’ll add 2
to x
and reassign the
result back to x
.
x <- x + 2
x
[1] 7.5
As best you can, it is a good idea to make your variable names
informative (e.g. x
doesn’t mean anything, but
sandwich_price
is meaningful… if we’re talking about the
cost of sandwiches, that is..).
We can use pre-built computation methods called “functions” for other
operations. Functions have the following format, where the
argument is the information we are providing to the function
for it to run. An example of this was the atan()
function
used above.
function_name(argument)
To learn about functions, we’ll examine one called log()
first.
To know what a function does and how to use it, use the question mark
which will reveal documentation in the help pane:
?log
The documentation tells us that log()
is derived from
{base}
, meaning it is a function that is part of base R. It
provides a brief description of what the function does and shows several
examples of to how use it.
In particular, the documentation tells us about what argument(s) to provide:
e
.Functions also return values for us to use. In the case of
log()
, the returned value is the log’d value the function
computed.
log(73)
[1] 4.290459
Here we can specify an argument of base
to
calculate log base 3.
log(81, base = 3)
[1] 4
If we don’t specify the argument names, it assumes they are
in the order that log
defines them. See ?log
to see more about its arguments.
log(8, 2)
[1] 3
We can switch the order if we specify the argument names.
log(base = 10, x = 4342)
[1] 3.63769
We can also provide variables as arguments in the same way as the raw values.
meaning <- 42
log(meaning)
[1] 3.73767
Variable types in R can sometimes be coerced (converted) from one type to another.
# Define a variable with a number
x <- 15
The function class()
will tell us the variable’s
type.
class(x)
[1] "numeric"
numeric
Let’s coerce it to a character.
x <- as.character(x)
class(x)
[1] "character"
character
See it now has quotes around it? It’s now a character and will behave as such.
x
[1] "15"
15
Use this chunk to try to perform calculations with x
,
now that it is a character, what happens?
# Try to perform calculations on `x`
But we can’t coerce everything:
# Let's create a character variable
x <- "look at my character variable"
Let’s try making this a numeric variable:
x <- as.numeric(x)
Warning: NAs introduced by coercion
Print out x
.
x
[1] NA
R is telling us it doesn’t know how to convert this to a numeric
variable, so it has returned NA
instead.
For reference, here’s a summary of some of the most important variable types.
Variable Type | Definition | Examples | Coercion |
---|---|---|---|
numeric |
Any number value | 5 7.5 -1 |
as.numeric() |
integer |
Any whole number value (no decimals) | 5 -100 |
as.integer() |
character |
Any collection of characters defined within quotation marks. Also known as a “string”. | "a" (a single letter)
"stringofletters" (a whole bunch of characters put
together as one) "string of letters and spaces" "5" 'single quotes are also good' |
as.character() |
logical |
A value of TRUE , FALSE , or
NA |
TRUE FALSE NA (not
defined) |
as.logical() |
factor |
A special type of variable that denotes specific categories of a categorical variable | (stay tuned..) | as.factor() |
You will have noticed that all your computations tend to pop up with
a [1]
preceding them in R’s output. This is because, in
fact, all (ok mostly all) variables are by default vectors, and
our answers are the first (in these cases only) value in the vector. As
vectors get longer, new index indicators will appear at the start of new
lines.
# This is actually an vector that has one item in it.
x <- 7
# The length() functions tells us how long an vector is:
length(x)
[1] 1
We can define vectors with the function c()
, which
stands for “combine”. This function takes a comma-separated set of
values to place in the vector, and returns the vector itself:
my_numeric_vector <- c(1, 1, 2, 3, 5, 8, 13, 21)
my_numeric_vector
[1] 1 1 2 3 5 8 13 21
We can build on vectors in place by redefining them:
# add the next two Fibonacci numbers to the series.
my_numeric_vector <- c(my_numeric_vector, 34, 55)
my_numeric_vector
[1] 1 1 2 3 5 8 13 21 34 55
We can pull out specific items from an vector using a process called
indexing, which uses brackets []
to specify the
position of an item.
# Grab the fourth value from my_numeric_vector
# This gives us an vector of length 1
my_numeric_vector[4]
[1] 3
Colons are also a nice way to quickly make ordered numeric vectors.
Use a colon to specify an inclusive range of indices. This will return a
vector of the 2nd, 3rd, 4th, and 5th values from
my_numeric_vector
.
my_numeric_vector[2:5]
[1] 1 2 3 5
One major benefit of vectors is the concept of vectorization, where R by default performs operations on the entire vector at once. For example, we can get the log of all numbers 1-20 with a single, simple call, and more!
values_1_to_20 <- 1:20
# calculate the log of values_1_to_20
log(values_1_to_20)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
[8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573
[15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323
Finally, we can apply logical expressions to vectors, just as we can
do for single values. The output here is a logical vector telling us
whether each value in values_1_to_20
is TRUE
or FALSE
# Which values are <= 3?
values_1_to_20 <= 3
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
To ask if values are equal to another value, we need to use
a double equal sign ==
(no spaces)! A single
equals sign will end up (usually) re-assigning variables, so remember to
use double equals for comparisons.
# Which values are equal to 3?
values_1_to_20 == 3
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We can also use an exclamation mark !
to negate
expressions, for example:
# Place expressions inside !() to negate them
# Which values are _not less than_ to 3?
!(values_1_to_20 < 3)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# You can use the shortcut != to ask "not equals" too
# Which values are not equal to 3?
values_1_to_20 != 3
[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
There are several key functions which can be used on vectors containing numeric values, some of which are below.
mean()
: The average value in the vectormin()
: The minimum value in the vectormax()
: The maximum value in the vectorsum()
: The sum of all values in the vectorWe can try out these functions on the vector
values_1_to_20
we’ve created.
mean(values_1_to_20)
[1] 10.5
# Try out some of the other functions we've listed above
We have learned functions such as c
,
length
, sum
, and etc. Imagine defining a
variable called c
: This will work, but it will lead to a
lot of unintended bugs, so it’s best to avoid this.
%in%
logical operator%in%
is useful for determining whether a given item(s)
are in an vector.
# is `7` in our vector?
7 %in% values_1_to_20
[1] TRUE
# is `50` in our vector?
50 %in% values_1_to_20
[1] FALSE
We can test a vector of values being within another vector of values.
question_values <- c(1:3, 7, 50)
# Are these values in our vector?
question_values %in% values_1_to_20
[1] TRUE TRUE TRUE TRUE FALSE
Data frames are one of the most useful tools for data analysis in
R. They are tables which consist of rows and columns, much like a
spreadsheet. Each column is a variable which behaves as a
vector, and each row is an observation. We will begin our
exploration with dataset of measurements from three penguin species
measured, which we can find in the palmerpenguins
package. We’ll talk more about packages soon! To use this dataset,
we will load it from the palmerpenguins
package using a
::
(more on this later) and assign it to a variable named
penguins
in our current environment.
penguins <- palmerpenguins::penguins
Artwork by @allison_horst
The first step to using any data is to look at it!!! RStudio contains
a special function View()
which allows you to literally
view a variable. You can also click on the object in the environment
pane to see its overall properties, or click the table icon on the
object’s row to automatically view the variable.
Some useful functions for exploring our data frame include:
head()
to see the first 6 rows of a data frame.
Additional arguments supplied can change the number of rows.tail()
to see the last 6 rows of a data frame.
Additional arguments supplied can change the number of rows.names()
to see the column names of the data frame.nrow()
to see how many rows are in the data framencol()
to see how many columns are in the data
frame.We can additionally explore overall properties of the data
frame with two different functions: summary()
and
str()
.
This provides summary statistics for each column:
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
This provides a short view of the structure and contents of the data frame.
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
You’ll notice that the column species
is a
factor: This is a special type of character variable that
represents distinct categories known as “levels”. We have learned here
that there are three levels in the species
column: Adelie,
Chinstrap, and Gentoo. We might want to explore individual columns of
the data frame more in-depth. We can examine individual columns using
the dollar sign $
to select one by name:
# Extract bill_length_mm as a vector
penguins$bill_length_mm
[1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
[16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
[31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
[46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
[61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
[76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
[91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2
[ reached getOption("max.print") -- omitted 244 entries ]
# indexing operators can be used on these vectors too
penguins$bill_length_mm[1:10]
[1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0
We can perform our regular vector operations on columns directly.
# calculate the mean of the bill_length_mm column
mean(penguins$bill_length_mm,
na.rm = TRUE) # remove missing values before calculating the mean
[1] 43.92193
We can also calculate the full summary statistics for a single column directly.
# show a summary of the bill_length_mm column
summary(penguins$bill_length_mm)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
32.10 39.23 44.45 43.92 48.50 59.60 2
Extract species
as a vector and subset it to see a
preview.
# get the first 10 values of the species column
penguins$species[1:10]
[1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
Levels: Adelie Chinstrap Gentoo
And view its levels with the levels()
function.
levels(penguins$species)
[1] "Adelie" "Chinstrap" "Gentoo"
Adelie
Chinstrap
Gentoo
In many situations, we will be reading in tabular data from a file
and using it as a data frame. To practice, we will read in a file we
will be using in the next notebook as well,
gene_results_GSE44971.tsv
, in the data
folder.
File paths are relative to the location where this notebook file (.Rmd)
is saved.
Here we will use a function, read_tsv()
from the
readr
package. Before we are able to use the function, we
have to load the package using library()
.
library(readr)
file.path()
creates a properly formatted file path by
adding a path separator (/
on Mac and Linux operating
systems, the latter of which is the operating system that our RStudio
Server runs on) between separate folders or directories. Because file
path separators can differ between your computer and the computer of
someone who wants to use your code, we use file.path()
instead of typing out "data/gene_results_GSE44971.tsv"
.
Each argument to file.path()
is a directory or
file name. You’ll notice each argument is in quotes, we specify
data
first because the file,
gene_results_GSE44971.tsv
is in the data
folder.
file.path("data", "gene_results_GSE44971.tsv")
[1] "data/gene_results_GSE44971.tsv"
data/gene_results_GSE44971.tsv
As you can see above, the result of running file.path()
is that it creates a string with an accurately-formatted path
for your file system. This string can be used moving forward when you
need to refer to the path to your file. Let’s go ahead and store this
file path as a variable in our environment.
gene_file_path <- file.path("data", "gene_results_GSE44971.tsv")
Now we are ready to use read_tsv()
to read the file into
R. The resulting data frame will be stored in a variable named
stats_df
. Note the <-
(assignment
operator!) is responsible for saving this to our global environment.
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(gene_file_path)
Rows: 6804 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): ensembl_id, gene_symbol, contrast
dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at your environment panel to see what
stats_df
looks like. We can also print out a preview of the
stats_df
data frame here.
# display stats_df
stats_df
At the end of every notebook, you will see us print out
sessionInfo
. This aids in the reproducibility of your code
by showing exactly what packages and versions were being used the last
time the notebook was run.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_2.1.5 optparse_1.7.5
loaded via a namespace (and not attached):
[1] crayon_1.5.2 vctrs_0.6.5 cli_3.6.2
[4] knitr_1.46 rlang_1.1.3 xfun_0.43
[7] stringi_1.8.3 jsonlite_1.8.8 bit_4.0.5
[10] glue_1.7.0 htmltools_0.5.8.1 sass_0.4.9
[13] hms_1.1.3 fansi_1.0.6 rmarkdown_2.26
[16] evaluate_0.23 jquerylib_0.1.4 tibble_3.2.1
[19] tzdb_0.4.0 fastmap_1.1.1 yaml_2.3.8
[22] lifecycle_1.0.4 palmerpenguins_0.1.1 stringr_1.5.1
[25] compiler_4.4.1 getopt_1.20.4 pkgconfig_2.0.3
[28] digest_0.6.35 R6_2.5.1 tidyselect_1.2.1
[31] utf8_1.2.4 parallel_4.4.1 vroom_1.6.5
[34] pillar_1.9.0 magrittr_2.0.3 bslib_0.7.0
[37] bit64_4.0.5 tools_4.4.1 cachem_1.0.8
Comments
Arguably the most important aspect of your coding is comments: Small pieces of explanatory text you leave in your code to explain what the code is doing and/or leave notes to yourself or others. Comments are invaluable for communicating your code to others, but they are most important for Future You. Future You comes into existence about one second after you write code, and has no idea what on earth Past You was thinking.
Comments in R code are indicated with pound signs (aka hashtags, octothorps). R will ignore any text in a line after the pound sign, so you can put whatever text you like there.
Help out Future You by adding lots of comments! Future You next week thinks Today You is an idiot, and the only way you can convince Future You that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad.