Today we will continue with data manipulation and plotting that we started last session.

Set up:

If you don’t have the data from last time handy, or have forgotten where you put it, use the instructions from the the week 2 session to re-download it. Then set your working directory to where the data are:

Remember that this file path will almost definitely be different for you!

setwd("~/r4grads/lizard_data/")

We’ll need our all_data object again. If you have your "lizard_data_merged.csv" file that we wrote out last time, you can read in:

all_data <- read.csv("lizard_data_merged.csv")

If you’ve lost track of this file, we can create it again:

If you have the “lizard_data_merged.csv” file and were able to run the line above, you do not need to run these lines!

# read in each of the data files
sample_data <- read.csv("lizard_data.csv")
physio_data <- read.csv("physio_data.csv")
# Make all of the substitutions that we made last time
sample_data$Species <- gsub("S. Occidentalis", "S. occidentalis", sample_data$Species)
sample_data$Species <- gsub("S. occidentlis", "S. occidentalis", sample_data$Species)
sample_data$Species <- gsub("^graciosus$", "S. graciosus", sample_data$Species)
# Merge the data
all_data <- merge(sample_data, physio_data)

We also need to load up the tidyverse set of packages:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Filtering and subsetting data

There are many times when you will want to analyze certain rows or columns of a dataset, or select only certain values from a dataset. R makes this very easy to do, particularly with functions from the Tidyverse set of R packages, which all share some general principles in their architecture. I won’t go into the details of the Tidyverse philosophy, as it’s very well documented and you can read about it at the link.

For example, we might want to analyze and plot of just a single species of lizard at a time. Let’s filter our dataset down to just S. occidentalis.

We can filter data using the filter() function from the dplyr package:

occ_data <- filter(all_data, Species == "S. occidentalis")
occ_data

Filter works by taking a dataframe as input, then filtering that down to only the rows where some condition is met in a given column. In the above command, we have filtered to only rows of the all_data object where the Species column is equal to (denoted by == in R because = is used for other things) "S. occidentalis".

You can use equal ==, not equal !=, greater or less than > or <, and other conditions in filter. You can also link together conditions using & to specify that both conditions must be met or | to specify that either condition must be met.

How do you think that we would filter out dataset to only large S. occidentalis with SVL lengths greater than 65 mm?

big_occ <- filter(all_data, Species == "S. occidentalis" & svl > 65)
big_occ

What if we want to get a dataset that has all of the rows for S. jarrovii and S. occidentalis?

jar_occ <- filter(all_data, Species == "S. jarrovii" | Species == "S. occidentalis")
jar_occ

Here, even though we want S. jarrovii AND S. occidentalis, we need to tell filter() to keep rows where Species == "S. jarrovii" OR (|) Species == "S. occidentalis".

Note that you can achieve all of the same ends without Tidyverse functions using R’s basic indexing, e.g., you could alternately get only S. occidentalis rows by running:

occ_data <- all_data[all_data$Species == "S. occidentalis",]
head(occ_data, n=4)

##   Lizard.ID.         Species Elevation..m. svl Ctmax
## 1       10_o S. occidentalis           904  75 40.50
## 2       11_o S. occidentalis           904  66 43.72
## 4       12_o S. occidentalis           904  68 42.73
## 5      120_o S. occidentalis             0  71 42.18

If we want to select only certain columns from a dataframe, we can use the select() function from dplyr, e.g.,:

some_occ_cols <- select(occ_data, Lizard.ID., svl, Ctmax)
head(some_occ_cols)

##   Lizard.ID. svl Ctmax
## 1       10_o  75 40.50
## 2       11_o  66 43.72
## 4       12_o  68 42.73
## 5      120_o  71 42.18
## 6      126_o  66 43.24
## 7      127_o  62 42.17

From the occ_data object, make a plot for this species that has svl on the x axis with Ctmax on the y axis, and points sized by Elevation..m.. Change the color of the points to a color of your preference.

ggplot(data = occ_data) + 
  geom_point(mapping = aes(
    x = svl, 
    y = Ctmax, 
    size = Elevation..m.),
    color = "hotpink") +
  scale_size(range = c(0.1, 2))

Let’s plot this out as a smoothed line to see how that looks

ggplot(data = occ_data) + 
  geom_smooth(mapping = aes(x = svl, y = Ctmax), color = "hotpink")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Faceted plotting

This is cool, but what if we want individual plots for all species? We could manually code each species or make a loop to plot each species, but ggplot makes it really easy to create multiple plots at once in a single, larger plot:

ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = svl, y = Ctmax), color = "hotpink") +
  facet_wrap(~ Species)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Let’s ignore the warnings for now and just look at the plots.

What’s prpblems do you see here? How would you go about fixing it? Remember that we can look at the help for a function to see available options: ?facet_wrap()

By default, when making a faceted plot, ggplot fixes the x and y axes to the same bounds across all plots. Sometimes this is desired, other times it is not. For these data, it probably is best to keep the same scales when presenting the data so that the relationships between the sizes of different lizard species are clear, but for exploration, it’s useful to allow different x and y scales so we can better see what’s going on.

So let’s test out changing the x and y axes across the plots.

If you look at the scales argument in the documentation for facet_wrap(), this looks like what we want.

If that hadn’t been helpful, we can always google something like “ggplot facets different axes”. I get this as one of the top results, and it also indicates that we need to set the scales argument to "free".

ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = svl, y = Ctmax), color = "hotpink") +
  facet_wrap(~ Species, scales = "free")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small.  fewer data values than degrees of freedom.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 49.96

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 5.04

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 9.2416

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.  fewer
## data values than degrees of freedom.

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 49.96

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 5.04

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 9.2416

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

This looks much better, but we still have a lot of warnings.

Warnings vs. errors: An error in R means that something has gone wrong such that the function could not be successfully executed. Warnings are potentially less severe. It means that some potentially undesirable behavior has occurred, but that the function is still able to run and return output. However, a warning could very well mean that your function has returned bad results, and you should always investigate what caused a warning and if the output is acceptable or not.

We are getting warnings because because at least one of our species has very few samples, which causes issues for the smoothing algorithm that draws the line and computes confidence intervals.

To be able to tell which species has only a few points, it might be useful to add the points into these plots along with the lines. How would you simultaneously plot both?

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = svl, y = Ctmax), color = "black") +
  geom_smooth(mapping = aes(x = svl, y = Ctmax), color = "hotpink") +
  facet_wrap(~ Species, scales = "free")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The order of geoms matters for whether points or lines will be plotted on top of the other.

We have some duplicated code because we’ve specified the aesthetic mapping twice. We can move this up into the main ggplot function, which will globally set these mappings. You can always override them for specific geoms in the geom function call:

ggplot(data = all_data, mapping = aes(x = svl, y = Ctmax)) + 
  geom_point(color = "black") +
  geom_smooth(color = "hotpink") +
  facet_wrap(~ Species, scales = "free")

It looks like the species S. virgatus has the fewest observations, which is probably what causes those warnings and the line with no confidence interval around it.

Use filter() to remove S. virgatus from the all_data object and create a no_virg object. If you’re not sure how to start this, look at how we filtered previously and the list of equality operators that are available to us.

no_virg <- filter(all_data, Species != "S. virgatus") # Note the != for "not equal to"
unique(no_virg$Species)

## [1] "S. occidentalis" "S. jarrovii"     "S. graciosus"

Looks good. Let’s plot this just like above using this filtered object:

ggplot(data = no_virg, mapping = aes(x = svl, y = Ctmax)) + 
  geom_point(color = "black") +
  geom_smooth(color = "hotpink") +
  facet_wrap(~ Species, scales = "free")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

No warnings and better plots!

We can alternately filter the data within the ggplot() function if we don’t want to create a new object beforehand:

ggplot(data = filter(all_data, Species != "S. virgatus"), mapping = aes(x = svl, y = Ctmax)) + 
  geom_point(color = "black") +
  geom_smooth(color = "hotpink") +
  facet_wrap(~ Species, scales = "free")

Plots of discrete variables

We’ve made a bunch of scatter and line plots, let’s explore a few ways to plot out discrete variables.

We can make just a scatterplot, but it’s really not very informative.

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_point()

We can make a boxplot instead:

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_boxplot()

This tells us a lot more information about how Ctmax varies across these speceis. We can also do things like swap the x and y coordinates:

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_boxplot() +
  coord_flip()

Or fill the boxes with colors by species:

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax, fill = Species)) + 
  geom_boxplot() +
  theme(legend.position="none")

What does theme(legend.position="none") in the above code do? What happens if we remove it?

There are lots of other things we could do with boxplots, but we’ll move on. My favorite way of visualizing distributions for discrete variables is using violin plots:

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE)

This is fine, but not great. Change this plot so that the violins are filled with different colors for each species. Also get rid of any legend that shows up.

ggplot(data = all_data, mapping = aes(x = Species, y = svl, fill = Species)) + 
  geom_violin(trim=FALSE) +
  theme(legend.position="none")

If we want to, we can add boxplots within the violins. This is similar to adding both geom_smooth and geom_point to the plots above. Try making a violin plot with boxplots inside.

Make the violins filled by different colors for each species, but keep the boxplots unfilled. As a hint, the argument width controls the width of boxplots.

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

This looks pretty fancy to me. What would happen if we swapped the order of the geom_violin() and geom_boxplot() lines? Try it out.

What you should see is that the order of geoms matters. Sequential geoms are plotted on top of the previous ones, and you need to consider this when making plots. If needed, there are ways to make colors semi-transparent.

A couple final things:

We can change the axis labels and add a plot title if we want:

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  theme(legend.position="none") +
  labs(title="Lizard Thermal Maxima", x = "Species", y = "CTmax")

And/or we can change the theme of the background:

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = "Species", y = "CTmax") +
  theme_minimal() +
  theme(legend.position="none")

Note that we had to move theme(legend.position="none") to after theme_minimal() because whichever comes last will override some of the parameters set by the one that is run first.

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "CTmax") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic() +
  theme(legend.position="none")

You can customize each of these plots almost endlessly, and there are many other types of plots that you can create, as well. If you want to get an idea for some of the plotting diversity available, check out the the R graph gallery.

Saving plots

R allows us to easily export plots using a couple simple lines of code. We glossed over this last time, so we’ll cover it here again.

RStudio also has options to export and save plots from the “Plots” window - DON’T SAVE PLOTS THAT WAY! These will be lower resolution and sometimes get stretched in weird ways.

The standard way to save a plot in R is to open a plotting device, run your plotting code which will write to the open file connection, then close the plotting device. Here is plotting to a pdf:

pdf(file = "violin.pdf", width = 7, height = 4) # create and open the pdf plotting device

# Make the plot
ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "CTmax") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic() +
  theme(legend.position="none")

dev.off() # close the pdf plotting device

## quartz_off_screen 
##                 2

If you forget to run dev.off(), all subsequent plots will continue to go to the plotting device, and you won’t be able to view the pdf because it won’t be “finished”.

ggplot also allows you to save a plot in an object, so you don’t need to put the entire plotting call inside the pdf() and dev.off() lines. Here we’ll do this and then also save .png and .tif files:

# Make the plot
to_plot <- ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "CTmax") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic() +
  theme(legend.position="none")


# Make a pdf
pdf(file = "violin2.pdf", width = 7, height = 4)
to_plot
dev.off()

## quartz_off_screen 
##                 2

# make a png
png(file = "violin3.png", width=600, height=350) # width and height are in pixels for png
to_plot
dev.off()

## quartz_off_screen 
##                 2

# make a .tiff
tiff(file="violin4.tiff", width = 7, height = 4, units="in", res=100)
to_plot
dev.off()

## quartz_off_screen 
##                 2

Note that you need to add the file extensions (e.g., .pdf, .png) to your files if you want them.

I like to use pdf format because it’s vector based, and so has effectively infinite resolution. I typically write pdf figures out from R and then directly use these as the figures I submit to journals with my manuscripts.

Home

R data manipulation and plotting pt. 2

Sean Harrington & Erin Bentley

2025-03-12

Set up:

Filtering and subsetting data

Faceted plotting

Plots of discrete variables

Saving plots