In this tutorial, we will explore some basic data manipulation and visualization in R. Occassionally, datasets start out perfectly formatted with no missing data, outliers, etc. More often, you’ll need to do some kind of filtering, subsetting, and/or merging of data files. Plotting your data is helpful not only for publication and presentation, but also for identifying potential issues in your data.

I highly recommend writing all of your code in an R script and commenting it liberally. Even for relatively simple operations, if you need to repeat them or slightly tweak them, it’s much easier to simply open up your script and re-run it or edit it than to re-type everything into R. Even more importantly, it ensures that there is a record of exactly what you did. If you haven’t already dealt with trying to remember how you handled data when you come back to it to after a couple weeks or even just a few days, you’ll probably be surprised by just how quickly you can forget what you were doing and why.

How to open a script is detailed in the intro to R tutorial.

When there are large spaces in this tutorial, this indicates places where you should try to figure out how to code something. If you keep scrolling down, you’ll find the solution, but you’ll get a lot more out of it if you make a solid attempt to solve the problems yourself first.

Today we’ll explore some data on thermal physiology taken from this paper: Body size impacts critical thermal maximum measurements in lizards.

Reading in the data

Before we can work with the data, we need to download it. Go to the Github repository for this workshop: https://github.com/wyoibc/r4grads and click on the green Code button, then on the Download ZIP button. (You can alternately clone the repo if you’re familiar with git). You can also download the individual files from here if you know how to do that.

download zip

Start by setting our working directory. If you set things up exactly as I did, then this path will work for you. If not, then you’ll need to edit this path:

setwd("~/r4grads-master/lizard_data/")

Again, make sure this points to the correct path for you, not my path.

The data are all in this directory, as is a pdf of the paper that these data are from. “scelop_body_exp1.csv” is the actual data file for the paper, which is very nicely formatted. For illustrative purposes, I have split this into two files (“lizard_data.csv” and “physio_data.csv”), reduced the dataset, and introduced intentional errors that we will track down and fix.

We’ll start by reading in those two files. One file contains information about which species each lizard belongs to, it’s size (snout-vent length or SVL), the elevation each lizard was at, and a unique ID number for each lizard – we’ll read this in as sample_data. The other file contains lizard IDs and a physiological measurement of the maximum temperature at which a lizard can maintain normal physiological function called the critical thermal maximum or CTmax.

One of the main purposes of this paper was to investigate how CTmax changes with body size (SVL) in four species of spiny lizards of the genus Sceloporus, and we’ll explore these relationships aas we plot and manipulate the data.

sample_data <- read.csv("lizard_data.csv")
physio_data <- read.csv("physio_data.csv")

If you get an error here, your working directory is probably not set correctly.

Let’s take a quick look at the top few rows of each dataset.

head(sample_data) # Check the top or head of a dataset

##   Lizard.ID.         Species Elevation..m. svl
## 1        8_o S. occidentalis           904  65
## 2        9_o S. occidentalis           904  74
## 3       10_o S. occidentalis           904  75
## 4       11_o S. occidentalis           904  66
## 5       12_o S. occidentalis           904  68
## 6       13_o S. occidentalis           904  69

head(physio_data) # if you're not familiar, anything after '#' is a comment and not interpreted by R, it's there just as your own notes on what you're doing. Use them.

##   Lizard.ID. Ctmax
## 1        8_o 40.54
## 2        9_o 42.68
## 3       10_o 40.50
## 4       11_o 43.72
## 5       12_o 42.73
## 6       13_o 42.28

We can also look at the bottom few rows.

tail(sample_data)

##     Lizard.ID.     Species Elevation..m. svl
## 176       91_j S. jarrovii          1646  50
## 177        7_v S. virgatus          1646  55
## 178        8_v S. virgatus          1646  50
## 179       27_v S. virgatus          1646  50
## 180       84_v S. virgatus          1646  45
## 181       94_v S. virgatus          1646  58

See how many rows and columns are in each dataframe:

dim(sample_data) # get the dimensions of 'body' dataset

## [1] 181   4

dim(physio_data) # get the dimensions of the 'iso' dataset

## [1] 182   2

Let’s take a look at what species are included and how many samples we have of each species.

unique(sample_data$Species)

## [1] "S. occidentalis" "S. Occidentalis" "S. occidentlis"  "S. graciosus"   
## [5] "graciosus"       "S. jarrovii"     "S. virgatus"

summary(as.factor(sample_data$Species))

##       graciosus    S. graciosus     S. jarrovii S. occidentalis S. Occidentalis 
##               2              56              25              89               2 
##  S. occidentlis     S. virgatus 
##               2               5

What is summary telling us? Do you notice any potential problems with the data from running the previous two commands?

Fixing spelling

Looking at either of those last outputs, you should notice that we have some misspellings. In some cases, “S. occidentalis” has a capital “O” as “S. Occidentalis” and because R is case sensitive (as are most other coding languages), these are interpreted as different species. We also have “S. graciosus” abbreviated down to just “graciosus” in two cases, and a misspelling of “S. occidentalis” as “S. occidentlis” (missing an “a” towards the end). We will want to correct these errors before we move forward with any further data processing.

There are a few ways to do this. One is by using an indexing approach to identify all of the elements of the objects that contain the values we want to replace, and replacing them with the values we want.

Let’s build this out:

We’ll start by identifying which elements of the “Species” column of sample_data contains "S. Occidentalis"

sample_data$Species=="S. Occidentalis"

You should see a long list of TRUE/FALSE values corresponding to whether each element is (TRUE) or is not (FALSE) “S. Occidentalis”. We can then use this to select out only the TRUE elements of sample_data$Species:

sample_data$Species[sample_data$Species=="S. Occidentalis"]

And finally, using that indexing to identify the incorrect entries, we can replace them with “S. occidentalis”:

sample_data$Species[sample_data$Species=="S. Occidentalis"] <- "S. occidentalis" # replace all instances of S. Occidentalis with S. occidentalis

But there are also faster ways to do this. The function gsub() performs pattern matching and replacement, similar to using conrtol or command F in a Word document. One of the most essential skills in R is learning how to use new functions. If you already know what function you want to use, you can ? before a function to get the built-in help documentation. Try it out:

?gsub()

Note that if you are trying to get help for a function that is not in base R, but contained in a specific package, you’ll need to have that package loaded to use ?. You can use ?? to search in all installed packages, even those not currently loaded.

As is the case for many functions, gsub() has several options that are set to defaults that we won’t worry about, we only really care about the first few options here most of the time.

From looking at this help menu, how would we would replace occurrences of “S. occidentlis” with “S. occidentalis” in sample_data$Species?

sample_data$Species <- gsub("S. occidentlis", "S. occidentalis", sample_data$Species)

What if we want to replace “graciosus” with “S. graciosus”? Try it out.

I’m guessing that you used something like:

sample_data$Species <- gsub("graciosus", "S. graciosus", sample_data$Species)

Take another look at the data and let’s see if we’ve cleaned up the species names

unique(sample_data$Species)

## [1] "S. occidentalis" "S. S. graciosus" "S. graciosus"    "S. jarrovii"    
## [5] "S. virgatus"

summary(as.factor(sample_data$Species))

##    S. graciosus     S. jarrovii S. occidentalis S. S. graciosus     S. virgatus 
##               2              25              93              56               5

Do you notice any problems?

What you should notice is that we replaced ALL instances of “graciosus” with “S. graciosus”, so what was previously “S. graciosus” is now “S. S. graciosus”. What we should have done was the following:

sample_data$Species <- gsub("^graciosus$", "S. graciosus", sample_data$Species)

In the above, the ^ indicates the start of a string and the $ indicates the end of string of a string, indicating that only want to replace “graciosus” when the “g” is the start of a string and the “s” is the end. We could go back to the start, read the data back in from scratch and run the above line, but let’s fix “S. S. graciosus” in the existing object now.

Try changing “S. S. graciosus” back into “S. graciosus”

There are a few options for doing this:

sample_data$Species <- gsub("^S. S. graciosus$", "S. graciosus", sample_data$Species)
# OR
sample_data$Species <- gsub("S. S.", "S. ", sample_data$Species) # you would need to check that there isn't supposed to be an S. S. anywhere else in your data before using this one
# OR
sample_data$Species[sample_data$Species=="S. S. graciosus"] <- "S. graciosus"

If we take a look at the data again, we should see that these errors have been corrected:

unique(sample_data$Species)

## [1] "S. occidentalis" "S. graciosus"    "S. jarrovii"     "S. virgatus"

summary(as.factor(sample_data$Species))

##    S. graciosus     S. jarrovii S. occidentalis     S. virgatus 
##              58              25              93               5

We should also know how many species we sampled, and we can check how many are in this dataset:

length(unique(sample_data$Species))

## [1] 4

Things look pretty good now. Note that misspellings like these are relatively easy to catch, but incorrect numerical values can be much harder. Those errors will typically require plotting of the data to identify obviously incorrect values, which we’ll cover later.

Merging the data

Before we continue on, we’d like to have all of our data in a single object. This is simpler to keep track of and also allows us to apply filters and manipulations to the entire dataset at once, rather than needing to modify each object individually.

When merging, datasets may not include the same exact samples or samples may be in different orders, so we can’t just stick the columns all together.

Take another look at the dimensions of our two data objects like we did at the start.

You’ll notice that they have different numbers of rows, indicating that at least one sample is in one set but not the other.

We can check for Lizard.ID elements that are in the sample_data but not the physio_data:

which(!sample_data$Lizard.ID %in% physio_data$Lizard.ID)

## [1]   7  11  17 114 150 154 178 180

the %in% operator checks for occurrences of the preceding object in the following object, and returns a vector of TRUE/FALSE. The ! at the beginning reverses TRUE/FALSE, so that TRUE instead corresponds to elements of sample_data$Lizard.ID that are NOT in physio_data$Lizard.ID, and the which() gives us the numeric indexes of the elements of the TRUE/FALSE vector that are true. The result is that the numbers this spits out are the indices of sample_data$Lizard.ID that are NOT in physio_data$Lizard.ID.

We can use this as an index to get the actual values of sample_data$Lizard.ID that are not shared by physio_data$Lizard.ID:

sample_data$Lizard.ID[which(!sample_data$Lizard.ID %in% physio_data$Lizard.ID)]

## [1] "20_o"  "25_o"  "37_o"  "186_o" "434_o" "20_j"  "8_v"   "84_v"

and we can run the same check in reverse order to see values of physio_data$Lizard.ID not in sample_data$Lizard.ID:

physio_data$Lizard.ID[which(!physio_data$Lizard.ID %in% sample_data$Lizard.ID)]

## [1] "23_o"  "65_g"  "181_o" "242_o" "176_o" "70_j"  "4_v"   "9_v"   "36_v"

We can see that we have several samples that are present in one of the datasets, but not the other. We could use some fancy R indexing and the match() function to subset both of the datasets down to only these shared samples in the same row order, but this is somewhat tedious to do.

Instead, there is an R function we can use that will do all of that for us: merge(). By default, it will merge the datasets based on the column that they share, here Lizard.ID. Try it out:

all_data <- merge(sample_data, physio_data)
head(all_data)

##   Lizard.ID.         Species Elevation..m. svl Ctmax
## 1       10_o S. occidentalis           904  75 40.50
## 2       11_o S. occidentalis           904  66 43.72
## 3       12_j     S. jarrovii          1646  53 42.08
## 4       12_o S. occidentalis           904  68 42.73
## 5      120_o S. occidentalis             0  71 42.18
## 6      126_o S. occidentalis             0  66 43.24

dim(all_data)

## [1] 173   5

If you don’t have identically named columns in your 2 datasets that you are merging, you can specify which columns to merge by using the by.x and by.y arguments.

Knowing how R objects are structured and how to extract specific elements from objects using brackets and $ is useful, but there are functions that will simplify most common data manipulations, and we’ll explore these shortly.

Before we move on, let’s write our merged all_data object to a csv file:

write.csv(all_data, "lizard_data_merged.csv")

Then we can easily read this cleaned and merged data into R or another program anytime we want without having to repeat these steps. .csv or “comma-separated values” is a very common file format for data. It is easily computer-readable because it contains no formatting, only values, with columns separated by commas.

Also note that by writing a new file from R, we can read in the raw data, edit/filter it as we like, and then write the output to a new file with no risk of accidentally overwriting or editing the raw data. If all of your R commands are saved in a script, then you will end up with your untouched raw data, the manipulated data, and a full record of the manipulations.

Read that .csv file back into R just to demonstrate that we have successfully written out the data.

all_data <- read.csv("lizard_data_merged.csv")

Plotting with ggplot2

One of the best things about R is its ability to make pretty much any type of plot you’d like. This is useful for exploring your data and also for making high-quality figures that you can put into your publications. I have made nearly all of the figures in my published papers in R.

You can do a lot of plotting with R’s base functions, e.g., using the plot() function, but you get much better control and flexibility using the ggplot2 package. If you get familiar with the general style of ggplot, you’ll pretty quickly start to notice that many figures in research papers are made with this package.

ggplot2 is part of the Tidyverse set of packages, so we can install and then load it using:

install.packages("tidyverse")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Before we actually get started, there’s an important issue to point out in this loading message. The conflicts show that there are functions in dplyr and the base R stats packages that share the same function names. Whenever this happens, the function from the most recently loaded package will mask the other function. If you load dplyr last and then run filter(), what you’ll get is the function from dplyr. Alternately, if you load stats last, you’ll get the filter() function from that package.

This is important to keep track of. Especially if you write a script and then edit it to load a package at the the top of a script. You can always call a function from a specific package by using the notation package::function(). The double colons tell R to explicitly use a function from the stated package.

You can also specifically install and/or load ggplot2 by itself using the following if you uncomment it. I’ve commented it here because if you try to install ggplot2 on its own right now, it will restart R and we’ll need to re-run everything we’ve done so far.

# install.packages("ggplot2")
# library(ggplot2)

ggplot syntax

ggplot2 makes some very nice figures, but it has a unique syntax that can take a little while to learn. We’ll make a quick plot, then explain what’s going on here:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = svl, y = Ctmax))

All calls to ggplot() are composed of at least two pieces. The first simply specifies the object that contains the data. All of the data for the plot should be in a single object, with different variables in different columns and each row specifying a single observation of a data point (this is part of the general “Tidy” data philosophy).

The basic ggplot() call without adding in a geom function will just render a blank plot, since you haven’t told it what kind of plot to make with the data - this is unlike the base R plot() function, which will try to guess what type of plot you want based on the nature of the data.

ggplot(data = all_data) # this makes an empty plot

In the full call:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = svl, y = Ctmax))

The geom_point() function tells ggplot to plot out the data as points. Within this function, the mapping argument specifies how the data are mapped to the visualization. The mapping is specified using the aes() (aesthetic) function. Here we specify only which variable is x and which is y, but there are other things we can specify as well.

Multiple geom functions can be combined into a single plot, and the mapping argument can be specified independently in each, or can be specified globally within the ggplot() function call.

Aesthetic mappings

We’ve so far plotted out two variables, but we can add information about additional variables by passing additional arguments to aes(). For example, we can change the shape the points by species:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = svl, y = Ctmax, shape = Species))

We can alternately color the points by species:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = svl, y = Ctmax, color = Species))

We can also scale the size of the points by some variable using the size argument in aes(). Try modifying the above plot so that the points are sized by Elevation..m..

ggplot(data = all_data) + 
  geom_point(mapping = aes(
    x = svl, 
    y = Ctmax, 
    color = Species,
    size = Elevation..m.))

This is a very busy plot at this point, and for me, all points are too large to be interpretable, so let’s add a function that controls the range of point sizes:

ggplot(data = all_data) + 
  geom_point(mapping = aes(
    x = svl, 
    y = Ctmax, 
    color = Species,
    size = Elevation..m.))+
  scale_size(range = c(0.1, 2))

This is a little better, but still is clearly not the best way to display these data. Just because you can plot things in a certain way, doesn’t mean you should plot them that way.

If we want to change the color (or size, shape, etc.) of all points, rather than according to a variable, we can do this by pulling that aesthetic outside of the aes() function and setting it manually:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = svl, y = Ctmax), color = "blue")

Home

R data manipulation and plotting

Sean Harrington & Erin Bentley

2025-03-05