In this tutorial, we will explore some basic data manipulation and visualization in R. Occassionally, datasets start out perfectly formatted with no missing data, outliers, etc. More often, you’ll need to do some kind of filtering, subsetting, and/or merging of data files. Plotting your data is helpful not only for publication and presentation, but also for identifying potential issues in your data.
I highly recommend writing all of your code in an R script and commenting it liberally. Even for relatively simple operations, if you need to repeat them or slightly tweak them, it’s much easier to simply open up your script and re-run it or edit it than to re-type everything into R. Even more importantly, it ensures that there is a record of exactly what you did. If you haven’t already dealt with trying to remember how you handled data when you come back to it to after a couple weeks or even just a few days, you’ll probably be surprised by just how quickly you can forget what you were doing and why.
How to open a script is detailed in the intro to R tutorial.
Today we’ll explore some data on thermal physiology taken from this paper: Body size impacts critical thermal maximum measurements in lizards.
Before we can work with the data, we need to download it. Go to the
Github repository for this workshop: https://github.com/wyoibc/r4grads
and click on the green Code
button, then on the
Download ZIP
button. (You can alternately clone the repo if
you’re familiar with git). You can also download the individual files
from here
if you know how to do that.
Start by setting our working directory. If you set things up exactly as I did, then this path will work for you. If not, then you’ll need to edit this path:
setwd("~/r4grads-master/lizard_data/")
Again, make sure this points to the correct path for you, not my path.
The data are all in this directory, as is a pdf of the paper that these data are from. “scelop_body_exp1.csv” is the actual data file for the paper, which is very nicely formatted. For illustrative purposes, I have split this into two files (“lizard_data.csv” and “physio_data.csv”), reduced the dataset, and introduced intentional errors that we will track down and fix.
We’ll start by reading in those two files. One file contains
information about which species each lizard belongs to, it’s size
(snout-vent length or SVL), the elevation each lizard was at, and a
unique ID number for each lizard – we’ll read this in as
sample_data
. The other file contains lizard IDs and a
physiological measurement of the maximum temperature at which a lizard
can maintain normal physiological function called the critical thermal
maximum or CTmax.
One of the main purposes of this paper was to investigate how CTmax changes with body size (SVL) in four species of spiny lizards of the genus Sceloporus, and we’ll explore these relationships aas we plot and manipulate the data.
sample_data <- read.csv("lizard_data.csv")
physio_data <- read.csv("physio_data.csv")
Let’s take a quick look at the top few rows of each dataset.
head(sample_data) # Check the top or head of a dataset
## Lizard.ID. Species Elevation..m. svl
## 1 8_o S. occidentalis 904 65
## 2 9_o S. occidentalis 904 74
## 3 10_o S. occidentalis 904 75
## 4 11_o S. occidentalis 904 66
## 5 12_o S. occidentalis 904 68
## 6 13_o S. occidentalis 904 69
head(physio_data) # if you're not familiar, anything after '#' is a comment and not interpreted by R, it's there just as your own notes on what you're doing. Use them.
## Lizard.ID. Ctmax
## 1 8_o 40.54
## 2 9_o 42.68
## 3 10_o 40.50
## 4 11_o 43.72
## 5 12_o 42.73
## 6 13_o 42.28
We can also look at the bottom few rows.
tail(sample_data)
## Lizard.ID. Species Elevation..m. svl
## 176 91_j S. jarrovii 1646 50
## 177 7_v S. virgatus 1646 55
## 178 8_v S. virgatus 1646 50
## 179 27_v S. virgatus 1646 50
## 180 84_v S. virgatus 1646 45
## 181 94_v S. virgatus 1646 58
See how many rows and columns are in each dataframe:
dim(sample_data) # get the dimensions of 'body' dataset
## [1] 181 4
dim(physio_data) # get the dimensions of the 'iso' dataset
## [1] 182 2
Let’s take a look at what species are included and how many
samples we have of each species.
unique(sample_data$Species)
## [1] "S. occidentalis" "S. Occidentalis" "S. occidentlis" "S. graciosus"
## [5] "graciosus" "S. jarrovii" "S. virgatus"
summary(as.factor(sample_data$Species))
## graciosus S. graciosus S. jarrovii S. occidentalis S. Occidentalis
## 2 56 25 89 2
## S. occidentlis S. virgatus
## 2 5
What is summary telling us? Do you notice any potential problems with the data from running the previous two commands?
Looking at either of those last outputs, you should notice that we have some misspellings. In some cases, “S. occidentalis” has a capital “O” as “S. Occidentalis” and because R is case sensitive (as are most other coding languages), these are interpreted as different species. We also have “S. graciosus” abbreviated down to just “graciosus” in two cases, and a misspelling of “S. occidentalis” as “S. occidentlis” (missing an “a” towards the end). We will want to correct these errors before we move forward with any further data processing.
There are a few ways to do this. One is by using an indexing approach to identify all of the elements of the objects that contain the values we want to replace, and replacing them with the values we want.
Let’s build this out:
We’ll start by identifying which elements of the “Species” column of
sample_data
contains "S. Occidentalis"
sample_data$Species=="S. Occidentalis"
You should see a long list of TRUE/FALSE values corresponding to
whether each element is (TRUE) or is not (FALSE) “S. Occidentalis”. We
can then use this to select out only the TRUE elements of
sample_data$Species
:
sample_data$Species[sample_data$Species=="S. Occidentalis"]
And finally, using that indexing to identify the incorrect entries, we can replace them with “S. occidentalis”:
sample_data$Species[sample_data$Species=="S. Occidentalis"] <- "S. occidentalis" # replace all instances of S. Occidentalis with S. occidentalis
But there are also faster ways to do this. The function
gsub()
performs pattern matching and replacement, similar
to using conrtol or command F in a Word document. One of the most
essential skills in R is learning how to use new functions. If you
already know what function you want to use, you can ?
before a function to get the built-in help documentation. Try it
out:
?gsub()
?
. You can use ??
to
search in all installed packages, even those not currently loaded.As is the case for many functions, gsub()
has several
options that are set to defaults that we won’t worry about, we only
really care about the first few options here most of the time.
From looking at this help menu, how would we would replace
occurrences of “S. occidentlis” with “S. occidentalis” in
sample_data$Species
?
sample_data$Species <- gsub("S. occidentlis", "S. occidentalis", sample_data$Species)
What if we want to replace “graciosus” with “S. graciosus”? Try it out.
I’m guessing that you used something like:
sample_data$Species <- gsub("graciosus", "S. graciosus", sample_data$Species)
Take another look at the data and let’s see if we’ve cleaned up the species names
unique(sample_data$Species)
## [1] "S. occidentalis" "S. S. graciosus" "S. graciosus" "S. jarrovii"
## [5] "S. virgatus"
summary(as.factor(sample_data$Species))
## S. graciosus S. jarrovii S. occidentalis S. S. graciosus S. virgatus
## 2 25 93 56 5
Do you notice any problems?
What you should notice is that we replaced ALL instances of “graciosus” with “S. graciosus”, so what was previously “S. graciosus” is now “S. S. graciosus”. What we should have done was the following:
sample_data$Species <- gsub("^graciosus$", "S. graciosus", sample_data$Species)
In the above, the ^
indicates the start of a string and
the $
indicates the end of string of a string, indicating
that only want to replace “graciosus” when the “g” is the start of a
string and the “s” is the end. We could go back to the start, read the
data back in from scratch and run the above line, but let’s fix “S. S.
graciosus” in the existing object now.
Try changing “S. S. graciosus” back into “S. graciosus”
There are a few options for doing this:
sample_data$Species <- gsub("^S. S. graciosus$", "S. graciosus", sample_data$Species)
# OR
sample_data$Species <- gsub("S. S.", "S. ", sample_data$Species) # you would need to check that there isn't supposed to be an S. S. anywhere else in your data before using this one
# OR
sample_data$Species[sample_data$Species=="S. S. graciosus"] <- "S. graciosus"
If we take a look at the data again, we should see that these errors have been corrected:
unique(sample_data$Species)
## [1] "S. occidentalis" "S. graciosus" "S. jarrovii" "S. virgatus"
summary(as.factor(sample_data$Species))
## S. graciosus S. jarrovii S. occidentalis S. virgatus
## 58 25 93 5
We should also know how many species we sampled, and we can check how many are in this dataset:
length(unique(sample_data$Species))
## [1] 4
Things look pretty good now. Note that misspellings like these are relatively easy to catch, but incorrect numerical values can be much harder. Those errors will typically require plotting of the data to identify obviously incorrect values, which we’ll cover later.
Before we continue on, we’d like to have all of our data in a single object. This is simpler to keep track of and also allows us to apply filters and manipulations to the entire dataset at once, rather than needing to modify each object individually.
When merging, datasets may not include the same exact samples or samples may be in different orders, so we can’t just stick the columns all together.
Take another look at the dimensions of our two data objects like we did at the start.
You’ll notice that they have different numbers of rows, indicating that at least one sample is in one set but not the other.
We can check for Lizard.ID
elements that are in the
sample_data
but not the physio_data
:
which(!sample_data$Lizard.ID %in% physio_data$Lizard.ID)
## [1] 7 11 17 114 150 154 178 180
the %in%
operator checks for occurrences of the
preceding object in the following object, and returns a vector of
TRUE/FALSE. The !
at the beginning reverses TRUE/FALSE, so
that TRUE instead corresponds to elements of
sample_data$Lizard.ID
that are NOT in
physio_data$Lizard.ID
, and the which()
gives
us the numeric indexes of the elements of the TRUE/FALSE vector that are
true. The result is that the numbers this spits out are the indices of
sample_data$Lizard.ID
that are NOT in
physio_data$Lizard.ID
.
We can use this as an index to get the actual values of
sample_data$Lizard.ID
that are not shared by
physio_data$Lizard.ID
:
sample_data$Lizard.ID[which(!sample_data$Lizard.ID %in% physio_data$Lizard.ID)]
## [1] "20_o" "25_o" "37_o" "186_o" "434_o" "20_j" "8_v" "84_v"
and we can run the same check in reverse order to see values of
physio_data$Lizard.ID
not in
sample_data$Lizard.ID
:
physio_data$Lizard.ID[which(!physio_data$Lizard.ID %in% sample_data$Lizard.ID)]
## [1] "23_o" "65_g" "181_o" "242_o" "176_o" "70_j" "4_v" "9_v" "36_v"
We can see that we have several samples that are present in one of
the datasets, but not the other. We could use some fancy R indexing and
the match()
function to subset both of the datasets down to
only these shared samples in the same row order, but this is somewhat
tedious to do.
Instead, there is an R function we can use that will do all of that
for us: merge()
. By default, it will merge the datasets
based on the column that they share, here Lizard.ID
. Try it
out:
all_data <- merge(sample_data, physio_data)
head(all_data)
## Lizard.ID. Species Elevation..m. svl Ctmax
## 1 10_o S. occidentalis 904 75 40.50
## 2 11_o S. occidentalis 904 66 43.72
## 3 12_j S. jarrovii 1646 53 42.08
## 4 12_o S. occidentalis 904 68 42.73
## 5 120_o S. occidentalis 0 71 42.18
## 6 126_o S. occidentalis 0 66 43.24
dim(all_data)
## [1] 173 5
by.x
and by.y
arguments.Knowing how R objects are structured and how to extract specific
elements from objects using brackets and $
is useful, but
there are functions that will simplify most common data manipulations,
and we’ll explore these shortly.
Before we move on, let’s write our merged all_data
object to a csv file:
write.csv(all_data, "lizard_data_merged.csv")
Then we can easily read this cleaned and merged data into R or
another program anytime we want without having to repeat these steps.
.csv
or “comma-separated values” is a very common file
format for data. It is easily computer-readable because it contains no
formatting, only values, with columns separated by commas.
Also note that by writing a new file from R, we can read in the raw data, edit/filter it as we like, and then write the output to a new file with no risk of accidentally overwriting or editing the raw data. If all of your R commands are saved in a script, then you will end up with your untouched raw data, the manipulated data, and a full record of the manipulations.
Read that .csv file back into R just to demonstrate that we have successfully written out the data.
all_data <- read.csv("lizard_data_merged.csv")
One of the best things about R is its ability to make pretty much any type of plot you’d like. This is useful for exploring your data and also for making high-quality figures that you can put into your publications. I have made nearly all of the figures in my published papers in R.
You can do a lot of plotting with R’s base functions, e.g., using the
plot()
function, but you get much better control and
flexibility using the ggplot2
package. If you get familiar
with the general style of ggplot
, you’ll pretty quickly
start to notice that many figures in research papers are made with this
package.
ggplot2
is part of the Tidyverse set of packages, so we
can install and then load it using:
install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Before we actually get started, there’s an important issue to point
out in this loading message. The conflicts show that there are functions
in dplyr
and the base R stats
packages that
share the same function names. Whenever this happens, the function from
the most recently loaded package will mask the other function. If you
load dplyr
last and then run filter()
, what
you’ll get is the function from dplyr
. Alternately, if you
load stats
last, you’ll get the filter()
function from that package.
This is important to keep track of. Especially if you write a script
and then edit it to load a package at the the top of a script. You can
always call a function from a specific package by using the notation
package::function()
. The double colons tell R to explicitly
use a function from the stated package.
You can also specifically install and/or load ggplot2 by itself using
the following if you uncomment it. I’ve commented it here because if you
try to install ggplot2
on its own right now, it will
restart R and we’ll need to re-run everything we’ve done so far.
# install.packages("ggplot2")
# library(ggplot2)
ggplot2
makes some very nice figures, but it has a
unique syntax that can take a little while to learn. We’ll make a quick
plot, then explain what’s going on here:
ggplot(data = all_data) +
geom_point(mapping = aes(x = svl, y = Ctmax))
All calls to ggplot()
are composed of at least two
pieces. The first simply specifies the object that contains the data.
All of the data for the plot should be in a single object, with
different variables in different columns and each row specifying a
single observation of a data point (this is part of the general “Tidy”
data philosophy).
The basic ggplot()
call without adding in a
geom
function will just render a blank plot, since you
haven’t told it what kind of plot to make with the data - this is unlike
the base R plot()
function, which will try to guess what
type of plot you want based on the nature of the data.
ggplot(data = all_data) # this makes an empty plot
In the full call:
ggplot(data = all_data) +
geom_point(mapping = aes(x = svl, y = Ctmax))
The geom_point()
function tells ggplot
to
plot out the data as points. Within this function, the
mapping
argument specifies how the data are mapped to the
visualization. The mapping is specified using the aes()
(aesthetic) function. Here we specify only which variable is x and which
is y, but there are other things we can specify as well.
geom
functions can be combined into a single
plot, and the mapping
argument can be specified
independently in each, or can be specified globally within the
ggplot()
function call.We’ve so far plotted out two variables, but we can add information
about additional variables by passing additional arguments to
aes()
. For example, we can change the shape the points by
species:
ggplot(data = all_data) +
geom_point(mapping = aes(x = svl, y = Ctmax, shape = Species))
We can alternately color the points by species:
ggplot(data = all_data) +
geom_point(mapping = aes(x = svl, y = Ctmax, color = Species))
We can also scale the size of the points by some variable using the
size
argument in aes()
. Try modifying the
above plot so that the points are sized by
Elevation..m.
.
ggplot(data = all_data) +
geom_point(mapping = aes(
x = svl,
y = Ctmax,
color = Species,
size = Elevation..m.))
This is a very busy plot at this point, and for me, all points are too large to be interpretable, so let’s add a function that controls the range of point sizes:
ggplot(data = all_data) +
geom_point(mapping = aes(
x = svl,
y = Ctmax,
color = Species,
size = Elevation..m.))+
scale_size(range = c(0.1, 2))
This is a little better, but still is clearly not the best way to display these data. Just because you can plot things in a certain way, doesn’t mean you should plot them that way.
If we want to change the color (or size, shape, etc.) of all points,
rather than according to a variable, we can do this by pulling that
aesthetic outside of the aes()
function and setting it
manually:
ggplot(data = all_data) +
geom_point(mapping = aes(x = svl, y = Ctmax), color = "blue")