Today we will continue with data manipulation and plotting that we started last session.
If you don’t have the data from last time handy, or have forgotten where you put it, use the instructions from the the week 2 session to re-download it. Then set your working directory to where the data are:
setwd("~/r4grads/Fish_data/Modified/")
We’ll need our all_data
object again. If you have your
"fish_data_merged.csv"
file that we wrote out last time,
you can read in:
all_data <- read.csv("fish_data_merged.csv")
If you’ve lost track of this file, we can create it again:
# read in each of the data files
body <- read.csv("Fish_body_size.csv")
iso <- read.csv("Fish_isotopes.csv")
# Make all of the subsitutions that we made last time
body$Species <- gsub("Steelhesd", "Steelhead", body$Species)
body$Species <- gsub("COHO", "Coho", body$Species)
body$Species <- gsub("^Dolly$", "Dolly varden", body$Species)
# Merge the data
all_data <- merge(body, iso)
We also need to load up the tidyverse set of packages:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
There are many times when you will want to analyze certain rows or columns of a dataset, or select only certain values from a dataset. R makes this very easy to do, particularly with functions from the Tidyverse set of R packages, which all share some general principles in their architecture. I won’t go into the details of the Tidyverse philosphy, as it’s very well documented and you can read all about it at the link.
For example, we might want to analyze and plot of just a single species of fish at a time. Let’s filter our dataset down to just Coho.
We can filter data using the filter()
function from the
dplyr
package:
coho_data <- filter(all_data, Species == "Coho")
coho_data
Filter works by taking a dataframe as input, then filtering that down
to only the rows where some condition is met in a given column. In the
above command, we have filtered to only rows of the
all_data
object where the Species
column is
equal to (denoted by ==
in R because =
is used
for other things) "Coho"
.
You can use equal ==
, not equal !=
, greater
or less than >
or <
, and other
conditions in filter. You can also link together conditions using
&
to specify that both conditions must be met or
|
to specify that either condition must be met.
How do you think that we would filter out dataset to only large Coho with fork lengths greater than 10 cm?
big_coho <- filter(all_data, Species == "Coho" & Fork.length..cm. > 10)
big_coho
What if we want to get a dataset that has all of the rows for Coho and Steelhead?
coho_steel <- filter(all_data, Species == "Coho" | Species == "Steelhead")
coho_steel
Here, even though we want Coho AND Steelhead, we need to tell
filter()
to keep rows where Species == "Coho"
OR (|
) Species == "Steelhead"
.
Note that you can achieve all of the same ends without Tidyverse functions using R’s basic indexing, e.g., you could alternately get only Coho rows by running:
coho_data <- all_data[all_data$Species == "Coho",]
head(coho_data, n=4)
## Fish.code Species Site Date Weight..g. Fork.length..cm. del13C Std..error
## 1 C01 Coho RK17 11/10/92 13.2 10.2 -23.08 0.00
## 2 C02 Coho RK17 11/10/92 5.8 7.9 -22.82 0.11
## 3 C03 Coho RT02 11/18/92 8.6 8.9 -22.44 0.11
## 4 C04 Coho RT02 11/18/92 11.8 9.8 -21.69 0.09
## del15N Std..error.1
## 1 13.56 0.04
## 2 13.04 0.16
## 3 14.08 0.13
## 4 14.01 0.11
If we want to select only certain columns from a dataframe, we can
use the select()
function from dplyr
,
e.g.,:
some_coho_rows <- select(coho_data, Fish.code, Weight..g., Fork.length..cm.)
head(some_coho_rows)
## Fish.code Weight..g. Fork.length..cm.
## 1 C01 13.2 10.2
## 2 C02 5.8 7.9
## 3 C03 8.6 8.9
## 4 C04 11.8 9.8
## 5 C05 5.0 7.7
## 6 C06 5.3 8.1
From the coho_data
object, make a plot for this species
that has Fork.length..cm.
on the x axis with
del15N
on the y axis, and points sized by
Weight..g.
. Change the color of the points to a color of
your preference.
ggplot(data = coho_data) +
geom_point(mapping = aes(
x = Fork.length..cm.,
y = del15N,
size = Weight..g.),
color = "hotpink") +
scale_size(range = c(0.1, 2))
Let’s plot this out as a smoothed line to see how that looks
ggplot(data = coho_data) +
geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This is cool, but what if we want individual plots for all species?
We could manually code each species or make a loop to plot each species,
but ggplot
makes it really easy to create multiple plots at
once in a single, larger plot:
ggplot(data = all_data) +
geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink") +
facet_wrap(~ Species)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Let’s ignore the warnings for now and just look at the plots.
What’s the problem here? How would you go about fixing it? Remember
that we can look at the help for a function to see available options:
?facet_wrap()
By default, when making a faceted plot, ggplot
fixes the
x and y axes to the same bounds across all plots. Sometimes this is
desired, other times it is not. Here we want to allow both x and y axes
to change freely across plot.
If you look at the scales
argument in the documentation
for facet_wrap()
, this looks like what we want.
If that hadn’t been helpful, we can always google something like
“ggplot facets different axes”. I get this
as the top result, and it also indicates that we need to set the scales
argument to "free"
.
ggplot(data = all_data) +
geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink") +
facet_wrap(~ Species, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 3.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 3.2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 0.1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
This looks much better, but we still have a lot of warnings.
We are getting warnings because because at least one of our species has very few samples, which causes issues for the smoothing algorithm that draws the line and computes confidence intervals.
To be able to tell which species has only a few points, it might be useful to add the points into these plots along with the lines. How would you simultaneously plot both?
ggplot(data = all_data) +
geom_point(mapping = aes(x = Fork.length..cm., y = del15N), color = "black") +
geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N), color = "hotpink") +
facet_wrap(~ Species, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We have some duplicated code because we’ve specified the asesthetic
mapping twice. We can move this up into the main ggplot
function, which will globally set these mappings. You can always
override them for specific geoms in the geom function call:
ggplot(data = all_data, mapping = aes(x = Fork.length..cm., y = del15N)) +
geom_point(color = "black") +
geom_smooth(color = "hotpink") +
facet_wrap(~ Species, scales = "free")
It looks like the species “Pink” has the fewest observations, which is probably what causes that erratic line.
Use filter()
to remove “Pink” from the
all_data
object and create a no_pink
object.
If you’re not sure how to start this, look at how we filtered previously
and the list of equality operators that are available to us.
no_pink <- filter(all_data, Species != "Pink") # Note the != for "not equal to"
unique(no_pink$Species)
## [1] "Coho" "Chum"
## [3] "Dolly varden" "coastrange sculpin"
## [5] "Steelhead" "Three spine stickleback"
Looks good. Let’s plot this just like above using this filtered object:
ggplot(data = no_pink, mapping = aes(x = Fork.length..cm., y = del15N)) +
geom_point(color = "black") +
geom_smooth(color = "hotpink") +
facet_wrap(~ Species, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
No warnings and better plots!
We can alternately filter the data within the ggplot()
function if we don’t want to create a new object beforehand:
ggplot(data = filter(all_data, Species != "Pink"), mapping = aes(x = Fork.length..cm., y = del15N)) +
geom_point(color = "black") +
geom_smooth(color = "hotpink") +
facet_wrap(~ Species, scales = "free")
We’ve made a bunch of scatter and line plots, let’s explore a few ways to plot out discrete variables.
We can make just a scatterplot, but it’s really not very informative.
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_point()
We can make a boxplot instead:
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_boxplot()
This tells us a lot more information. We can also do things like swap the x and y coordinates:
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_boxplot() +
coord_flip()
Or fill the boxes with colors by species:
ggplot(data = all_data, mapping = aes(x = Species, y = del15N, fill = Species)) +
geom_boxplot() +
theme(legend.position="none")
theme(legend.position="none")
in the above
code do? What happens if we remove it?
There are lots of other things we could do with boxplots, but we’ll move on. My favorite way of visualizing distributions for discrete variables is using violin plots:
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE)
This is fine, but not great. Change this plot so that the violins are filled with different colors for each species. Also get rid of any legend that shows up.
ggplot(data = all_data, mapping = aes(x = Species, y = del15N, fill = Species)) +
geom_violin(trim=FALSE) +
theme(legend.position="none")
If we want to, we can add boxplots within the violins. This is
similar to adding both geom_smooth
and
geom_point
to the plots above. Try making a violin plot
with boxplots inside.
Make the violins filled by different colors for each species, but
keep the boxplots unfilled. As a hint, the argument width
controls the width of boxplots.
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
theme(legend.position="none")
This looks pretty fancy to me. What would happen if we swapped the
order of the geom_violin()
and geom_boxplot()
lines? Try it out.
What you should see is that the order of geoms matters. Sequential geoms are plotted on top of the precious ones, and you need to consider this when making plots. Note that if you have multiple layers, there are ways to use semi-transparent colors.
A couple final things:
We can change the axis labels and add a plot title if we want:
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
theme(legend.position="none") +
labs(title="Fish nitrogen isotopes", x = "Species", y = "delta 15 N")
And/or we can change the theme of the background:
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = "Species", y = "delta 15 N") +
theme_minimal() +
theme(legend.position="none")
theme(legend.position="none")
to after theme_minimal()
because whichever comes last will
override some of the parameters set by the one that is run first.or
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
theme_classic() +
theme(legend.position="none")
You customize each of these plots almost endlesslessly, and there are many other types of plots that you can create, as well. If you want to get an idea for some of the plotting diversity available, check out the the R graph gallery.
R allows us to easily export plots using a couple simple lines of code. We glossed over this last time, so we’ll cover it here again.
RStudio also has options to export and save plots from the “Plots” window - DON’T SAVE PLOTS THAT WAY! These will be lower resolution and sometimes get stretched in weird ways.
The standard way to save a plot in R is to open a plotting device, run your plotting code which will write to the open file connection, then close the plotting device. Here is plotting to a pdf:
pdf(file = "violin.pdf", width = 7, height = 4) # create and open the pdf plotting device
# Make the plot
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
theme_classic() +
theme(legend.position="none")
dev.off() # close the pdf plotting device
## quartz_off_screen
## 2
If you forget to run dev.off()
, all subsequent plots
will continue to go to the plotting device, and you won’t be able to
view the pdf because it won’t be “finished”.
ggplot also allows you to save a plot in an object, so you don’t need
to put the entire plotting call inside the pdf()
and
dev.off()
lines. Here we’ll do this and then also save .png
and .tif files:
# Make the plot
to_plot <- ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "delta 15 N") +
theme_classic() +
theme(legend.position="none")
# Make a pdf
pdf(file = "violin2.pdf", width = 7, height = 4)
to_plot
dev.off()
## quartz_off_screen
## 2
# make a png
png(file = "violin3.png", width=600, height=350) # width and height are in pixels for png
to_plot
dev.off()
## quartz_off_screen
## 2
# make a .tiff
tiff(file="violin4.tiff", width = 7, height = 4, units="in", res=100)
to_plot
dev.off()
## quartz_off_screen
## 2
I like to use pdf format because it’s vector based, and so has effectively infinite resolution. I typically write pdf figures out from R and then directly use these as the figures I submit to journals with my manuscripts.