Home



Today we will continue with data manipulation and plotting that we started last session.


Set up:

If you don’t have the data from last time handy, or have forgotten where you put it, use the instructions from the the week 2 session to re-download it. Then set your working directory to where the data are:

setwd("~/r4grads/Fish_data/Modified/")

We’ll need our all_data object again. If you have your "fish_data_merged.csv" file that we wrote out last time, you can read in:

all_data <- read.csv("fish_data_merged.csv")



If you’ve lost track of this file, we can create it again:

# read in each of the data files
body <- read.csv("Fish_body_size.csv")
iso <- read.csv("Fish_isotopes.csv")
# Make all of the subsitutions that we made last time
body$Species <- gsub("Steelhesd", "Steelhead", body$Species)
body$Species <- gsub("COHO", "Coho", body$Species)
body$Species <- gsub("^Dolly$", "Dolly varden", body$Species)
# Merge the data
all_data <- merge(body, iso)


We also need to load up the tidyverse set of packages:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Filtering and subsetting data


There are many times when you will want to analyze certain rows or columns of a dataset, or select only certain values from a dataset. R makes this very easy to do, particularly with functions from the Tidyverse set of R packages, which all share some general principles in their architecture. I won’t go into the details of the Tidyverse philosphy, as it’s very well documented and you can read all about it at the link.

For example, we might want to analyze and plot of just a single species of fish at a time. Let’s filter our dataset down to just Coho.


We can filter data using the filter() function from the dplyr package:

coho_data <- filter(all_data, Species == "Coho")
coho_data

Filter works by taking a dataframe as input, then filtering that down to only the rows where some condition is met in a given column. In the above command, we have filtered to only rows of the all_data object where the Species column is equal to (denoted by == in R because = is used for other things) "Coho".

You can use equal ==, not equal !=, greater or less than > or <, and other conditions in filter. You can also link together conditions using & to specify that both conditions must be met or | to specify that either condition must be met.


How do you think that we would filter out dataset to only large Coho with fork lengths greater than 10 cm?



























big_coho <- filter(all_data, Species == "Coho" & Fork.length..cm. > 10)
big_coho


What if we want to get a dataset that has all of the rows for Coho and Steelhead?



























coho_steel <- filter(all_data, Species == "Coho" | Species == "Steelhead")
coho_steel

Here, even though we want Coho AND Steelhead, we need to tell filter() to keep rows where Species == "Coho" OR (|) Species == "Steelhead".



Note that you can achieve all of the same ends without Tidyverse functions using R’s basic indexing, e.g., you could alternately get only Coho rows by running:

coho_data <- all_data[all_data$Species == "Coho",]
head(coho_data, n=4)
##   Fish.code Species Site     Date Weight..g. Fork.length..cm. del13C Std..error
## 1       C01    Coho RK17 11/10/92       13.2             10.2 -23.08       0.00
## 2       C02    Coho RK17 11/10/92        5.8              7.9 -22.82       0.11
## 3       C03    Coho RT02 11/18/92        8.6              8.9 -22.44       0.11
## 4       C04    Coho RT02 11/18/92       11.8              9.8 -21.69       0.09
##   del15N Std..error.1
## 1  13.56         0.04
## 2  13.04         0.16
## 3  14.08         0.13
## 4  14.01         0.11


If we want to select only certain columns from a dataframe, we can use the select() function from dplyr, e.g.,:

some_coho_rows <- select(coho_data, Fish.code, Weight..g., Fork.length..cm.)
head(some_coho_rows)
##   Fish.code Weight..g. Fork.length..cm.
## 1       C01       13.2             10.2
## 2       C02        5.8              7.9
## 3       C03        8.6              8.9
## 4       C04       11.8              9.8
## 5       C05        5.0              7.7
## 6       C06        5.3              8.1



From the coho_data object, make a plot for this species that has Fork.length..cm. on the x axis with del15N on the y axis, and points sized by Weight..g.. Change the color of the points to a color of your preference.

































ggplot(data = coho_data) + 
  geom_point(mapping = aes(
    x = Fork.length..cm., 
    y = del15N, 
    size = Weight..g.),
    color = "hotpink") +
  scale_size(range = c(0.1, 2))








Let’s plot this out as a smoothed line to see how that looks

ggplot(data = coho_data) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'







Faceted plotting

This is cool, but what if we want individual plots for all species? We could manually code each species or make a loop to plot each species, but ggplot makes it really easy to create multiple plots at once in a single, larger plot:

ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink") +
  facet_wrap(~ Species)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Let’s ignore the warnings for now and just look at the plots.

What’s the problem here? How would you go about fixing it? Remember that we can look at the help for a function to see available options: ?facet_wrap()

































By default, when making a faceted plot, ggplot fixes the x and y axes to the same bounds across all plots. Sometimes this is desired, other times it is not. Here we want to allow both x and y axes to change freely across plot.

If you look at the scales argument in the documentation for facet_wrap(), this looks like what we want.

If that hadn’t been helpful, we can always google something like “ggplot facets different axes”. I get this as the top result, and it also indicates that we need to set the scales argument to "free".


ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink") +
  facet_wrap(~ Species, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 3.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 3.2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 0.1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0

This looks much better, but we still have a lot of warnings.

  • Warnings vs. errors: An error in R means that something has gone wrong such that the function could not be successfully executed. Warnings are potentially less severe. It means that some potetentially undesirable behavior has occurred, but that the function is still able to run and return output. However, a warning could very well mean that your function has returned bad results, and you should always investigate what caused a warning and if the output is acceptable or not.

We are getting warnings because because at least one of our species has very few samples, which causes issues for the smoothing algorithm that draws the line and computes confidence intervals.


To be able to tell which species has only a few points, it might be useful to add the points into these plots along with the lines. How would you simultaneously plot both?

































ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = del15N), color = "black") +
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink") +
  facet_wrap(~ Species, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'



  • The order of geoms matters for wether points or lines will be plotted on top of the other.

We have some duplicated code because we’ve specified the asesthetic mapping twice. We can move this up into the main ggplot function, which will globally set these mappings. You can always override them for specific geoms in the geom function call:

ggplot(data = all_data, mapping = aes(x = Fork.length..cm., y = del15N)) + 
  geom_point(color = "black") +
  geom_smooth(color = "hotpink") +
  facet_wrap(~ Species, scales = "free")




It looks like the species “Pink” has the fewest observations, which is probably what causes that erratic line.

Use filter() to remove “Pink” from the all_data object and create a no_pink object. If you’re not sure how to start this, look at how we filtered previously and the list of equality operators that are available to us.

































no_pink <- filter(all_data, Species != "Pink") # Note the != for "not equal to"
unique(no_pink$Species)
## [1] "Coho"                    "Chum"                   
## [3] "Dolly varden"            "coastrange sculpin"     
## [5] "Steelhead"               "Three spine stickleback"

Looks good. Let’s plot this just like above using this filtered object:


ggplot(data = no_pink, mapping = aes(x = Fork.length..cm., y = del15N)) + 
  geom_point(color = "black") +
  geom_smooth(color = "hotpink") +
  facet_wrap(~ Species, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


No warnings and better plots!


We can alternately filter the data within the ggplot() function if we don’t want to create a new object beforehand:

ggplot(data = filter(all_data, Species != "Pink"), mapping = aes(x = Fork.length..cm., y = del15N)) + 
  geom_point(color = "black") +
  geom_smooth(color = "hotpink") +
  facet_wrap(~ Species, scales = "free")






Plots of discrete variables

We’ve made a bunch of scatter and line plots, let’s explore a few ways to plot out discrete variables.


We can make just a scatterplot, but it’s really not very informative.

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_point()



We can make a boxplot instead:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_boxplot()

This tells us a lot more information. We can also do things like swap the x and y coordinates:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_boxplot() +
  coord_flip()

Or fill the boxes with colors by species:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N, fill = Species)) + 
  geom_boxplot() +
  theme(legend.position="none")

  • What does theme(legend.position="none") in the above code do? What happens if we remove it?





There are lots of other things we could do with boxplots, but we’ll move on. My favorite way of visualizing distributions for discrete variables is using violin plots:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE)

This is fine, but not great. Change this plot so that the violins are filled with different colors for each species. Also get rid of any legend that shows up.

































ggplot(data = all_data, mapping = aes(x = Species, y = del15N, fill = Species)) + 
  geom_violin(trim=FALSE) +
  theme(legend.position="none")


If we want to, we can add boxplots within the violins. This is similar to adding both geom_smooth and geom_point to the plots above. Try making a violin plot with boxplots inside.

Make the violins filled by different colors for each species, but keep the boxplots unfilled. As a hint, the argument width controls the width of boxplots.

































ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

This looks pretty fancy to me. What would happen if we swapped the order of the geom_violin() and geom_boxplot() lines? Try it out.





What you should see is that the order of geoms matters. Sequential geoms are plotted on top of the precious ones, and you need to consider this when making plots. Note that if you have multiple layers, there are ways to use semi-transparent colors.



A couple final things:

We can change the axis labels and add a plot title if we want:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  theme(legend.position="none") +
  labs(title="Fish nitrogen isotopes", x = "Species", y = "delta 15 N")

And/or we can change the theme of the background:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = "Species", y = "delta 15 N") +
  theme_minimal() +
  theme(legend.position="none")

  • Note that we had to move theme(legend.position="none") to after theme_minimal() because whichever comes last will override some of the parameters set by the one that is run first.

or

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic() +
  theme(legend.position="none")





You customize each of these plots almost endlesslessly, and there are many other types of plots that you can create, as well. If you want to get an idea for some of the plotting diversity available, check out the the R graph gallery.



Saving plots

R allows us to easily export plots using a couple simple lines of code. We glossed over this last time, so we’ll cover it here again.

RStudio also has options to export and save plots from the “Plots” window - DON’T SAVE PLOTS THAT WAY! These will be lower resolution and sometimes get stretched in weird ways.

The standard way to save a plot in R is to open a plotting device, run your plotting code which will write to the open file connection, then close the plotting device. Here is plotting to a pdf:

pdf(file = "violin.pdf", width = 7, height = 4) # create and open the pdf plotting device

# Make the plot
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic() +
  theme(legend.position="none")

dev.off() # close the pdf plotting device
## quartz_off_screen 
##                 2

If you forget to run dev.off(), all subsequent plots will continue to go to the plotting device, and you won’t be able to view the pdf because it won’t be “finished”.


ggplot also allows you to save a plot in an object, so you don’t need to put the entire plotting call inside the pdf() and dev.off() lines. Here we’ll do this and then also save .png and .tif files:

# Make the plot
to_plot <- ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "delta 15 N") +
  theme_classic() +
  theme(legend.position="none")


# Make a pdf
pdf(file = "violin2.pdf", width = 7, height = 4)
to_plot
dev.off() 
## quartz_off_screen 
##                 2
# make a png
png(file = "violin3.png", width=600, height=350) # width and height are in pixels for png
to_plot
dev.off()
## quartz_off_screen 
##                 2
# make a .tiff
tiff(file="violin4.tiff", width = 7, height = 4, units="in", res=100)
to_plot
dev.off()
## quartz_off_screen 
##                 2
  • Note that you need to add the file extensions (e.g., .pdf, .png) to your files if you want them.

I like to use pdf format because it’s vector based, and so has effectively infinite resolution. I typically write pdf figures out from R and then directly use these as the figures I submit to journals with my manuscripts.





Home