In this tutorial, we will explore some basic data manipulation and visualization in R. Occassionally, datasets start out perfectly formatted with no missing data, outliers, etc., More often, you’ll need to do some kind of filtering, subsetting, and/or merging of data files. Plotting your data is helpful not only for publication and presentation, but also for identifying potential issues in your data.
I highly recommend writing all of your code in an R script and commenting it liberally. Even for relatively simple operations, if you need to repeat them or slightly tweak them, it’s much easier to simply open up your script and re-run it or edit it than to re-type everything into R. Even more importantly, it ensures that there is a record of exactly what you did. If you haven’t already dealt with trying to remember how you handled data when you come back it to after a couple weeks or even just a few days, you’ll probably be surprised by just how quickly you can forget what you were doing and why.
How to open a script is detailed in the intro to R tutorial.
Before we can work with the data, we need to download it. Go to the
Github repository for this workshop: https://github.com/wyoibc/r4grads
and click on the green Code
button, then on the
Download ZIP
button. (You can alternately clone the repo if
you’re familiar with git). You can also download the individual files
from here
if you know how to do that.
Start by setting our working directory. If you set things up exactly as I did, then this path will work for you. If not, then you’ll need to edit this path:
setwd("~/r4grads-master/Fish_data/Modified/")
Again, make sure this points to the correct path for you, not my path.
Then read in the data. Here, we have the data in two separate csv files for different data pertaining to some fish. One file is body size data with some information about sampling and the species, and the other file contains stable isotope data.
body <- read.csv("Fish_body_size.csv")
iso <- read.csv("Fish_isotopes.csv")
Let’s take a quick look at the top few rows of each dataset.
head(body) # Check the top or head of a dataset
## Fish.code Species Site Date Weight..g. Fork.length..cm.
## 1 C01 Coho RK17 11/10/92 13.2 10.2
## 2 C02 Coho RK17 11/10/92 5.8 7.9
## 3 C03 Coho RT02 11/18/92 8.6 8.9
## 4 C04 Coho RT02 11/18/92 11.8 9.8
## 5 C05 Coho RT02 11/18/92 5.0 7.7
## 6 C06 Coho RT02 11/18/92 5.3 8.1
head(iso) # if you're not familiar, anything after '#' is a comment and not interpreted by R, it's there just as your own notes on what you're doing. Use them.
## Fish.code del13C Std..error del15N Std..error.1
## 1 C01 -23.08 0.00 13.56 0.04
## 2 C02 -22.82 0.11 13.04 0.16
## 3 C03 -22.44 0.11 14.08 0.13
## 4 C04 -21.69 0.09 14.01 0.11
## 5 C05 -27.27 0.00 9.20 0.03
## 6 C06 -23.56 0.00 12.84 0.00
We can also look at the bottom few rows.
tail(body)
## Fish.code Species Site Date Weight..g.
## 408 ST29 Three spine stickleback RT02 4/19/93 1.0
## 409 ST30 Three spine stickleback BP02-R 4/19/93 0.3
## 410 ST31 Three spine stickleback BP04 10/23/93 1.4
## 411 ST32 Three spine stickleback BP04 10/23/93 0.8
## 412 ST33 Three spine stickleback RK09 10/22/93 0.9
## 413 ST34 Three spine stickleback RK09 10/25/93 1.8
## Fork.length..cm.
## 408 5.1
## 409 3.2
## 410 5.4
## 411 4.4
## 412 4.6
## 413 5.4
See how many rows and columns are in each dataframe:
dim(body) # get the dimensions of 'body' dataset
## [1] 413 6
dim(iso) # get the dimensions of the 'iso' dataset
## [1] 414 5
Let’s take a look at what species are included and how many
samples we have of each species.
unique(body$Species)
## [1] "Coho" "COHO"
## [3] "Chum" "Dolly varden"
## [5] "Dolly" "Pink"
## [7] "coastrange sculpin" "Steelhead"
## [9] "Steelhesd" "Three spine stickleback"
summary(as.factor(body$Species))
## Chum coastrange sculpin Coho
## 10 58 183
## COHO Dolly Dolly varden
## 6 1 98
## Pink Steelhead Steelhesd
## 15 7 1
## Three spine stickleback
## 34
What is summary telling us? Do you notice any potential problems with the data from running the previous two commands?
Looking at either of those last outputs, you should notice that we have some misspellings. In some cases, “Coho” was written in all capital letters, and because R is case sensitive (as are most other coding languages), these are interpreted as different species. We also have “Dolly Varden” abbreviated down to just “Dolly” in one case, and a misspelling of “Steelhead” as “Steelhesd”. We will want to correct these before we move forward with any further data processing.
There are a few ways to do this. One is by using an indexing approach to identify all of the elements of the objects that contain the values we want to replace, and replacing them with the values we want.
Let’s build this out:
We’ll start by identifying which elements of the “Species” column of
body
contains "COHO"
body$Species=="COHO"
You should see a long list of TRUE/FALSE values corresponding to
whether each element is (TRUE) or is not (FALSE) “COHO”. We can then use
this to select out only the TRUE elements of
body$Species
:
body$Species[body$Species=="COHO"]
And finally, using that indexing to identify the incorrect entries, we can replace them with “Coho”:
body$Species[body$Species=="COHO"] <- "Coho" # replace all instances of COHO with Coho
But there are also much faster ways to do this. The function
gsub()
performs pattern matching and replacement. One of
the most essential skills in R is learning how to use new functions. If
you already know what function you want to use, you can ?
before a function to get the built-in help documentation. Try it
out:
?gsub()
?
. You can use ??
to
search in all installed packages, even those not currently loaded.As is the case for many functions, gsub()
has several
options that are set to defaults that we won’t worry about, we only
really care about the first few options here most of the time.
From looking at this help menu, how would we would replace
occurrences of “Steelhesd” with “Steelhead” in
body$Species
?
body$Species <- gsub("Steelhesd", "Steelhead", body$Species)
What if we want to replace “Dolly” with “Dolly varden”? Try it out.
I’m guessing that you used something like:
body$Species <- gsub("Dolly", "Dolly varden", body$Species)
Take another look at the data and let’s see if we’ve cleaned up the species names
unique(body$Species)
## [1] "Coho" "Chum"
## [3] "Dolly varden varden" "Dolly varden"
## [5] "Pink" "coastrange sculpin"
## [7] "Steelhead" "Three spine stickleback"
summary(as.factor(body$Species))
## Chum coastrange sculpin Coho
## 10 58 189
## Dolly varden Dolly varden varden Pink
## 1 98 15
## Steelhead Three spine stickleback
## 8 34
Do you notice any problems?
What you should notice is that we replaced ALL instances of “Dolly” with “Dolly varden”, so what was previously “Dolly varden” is now “Dolly varden varden”. What we should have done was the following:
body$Species <- gsub("^Dolly$", "Dolly varden", body$Species)
In the above, the ^
indicates the start of a string and
the $
indicates the end of string of a string, indicating
that only want to replace Dolly when the D is the start of a string and
the y is the end. We could go back to the start, read the data back in
from scratch and run the above line, but let’s fix “Dolly varden varden”
in the existing object now.
Try out turning “Dolly varden varden” back into “Dolly varden
There are a few options for doing this:
body$Species <- gsub("^Dolly varden varden$", "Dolly varden", body$Species)
# OR
body$Species <- gsub("varden varden", "varden", body$Species)
# OR
body$Species[body$Species=="Dolly varden varden"] <- "Dolly varden"
If we take a look at the data again, we should see that these errors have been corrected:
unique(body$Species)
## [1] "Coho" "Chum"
## [3] "Dolly varden" "Pink"
## [5] "coastrange sculpin" "Steelhead"
## [7] "Three spine stickleback"
summary(as.factor(body$Species))
## Chum coastrange sculpin Coho
## 10 58 189
## Dolly varden Pink Steelhead
## 99 15 8
## Three spine stickleback
## 34
We should also know how many species we sampled, and we can check how many are in this dataset:
length(unique(body$Species))
## [1] 7
Things look pretty good now. Note that misspellings like these are relatively easy to catch, but incorrect numerical values can be much harder. Those errors will typically require plotting of the data to identify obviously incorrect values, which we’ll cover later.
Before we continue on, we’d like to have all of our data in a single object. This is simpler to keep track of and also allows us to apply filters and manipulations to the entire dataset at once, rather than needing to modify each object individually.
When merging, datasets may not include the same exact samples or samples may be in different orders, so we can’t just stick the columns all together.
Take another look at the dimensions of our two data objects like we did at the start.
You’ll notice that they have different numbers of rows, indicating that at least one sample is in one set but not the other.
We can check for Fish.code
elements that are in the body
size data but not the isotope data:
which(!body$Fish.code %in% iso$Fish.code)
## [1] 132 277 364 406
the %in%
operator checks for occurrences of the
preceding object in the following object, and returns a vector of
TRUE/FALSE. The !
at the beginning reverses TRUE/FALSE, so
that TRUE instead corresponds to elements of body$Fish.code
that are NOT in iso$Fish.code
, and the which()
gives us the numeric indices of the elements of the TRUE/FALSE vector
that are true. The result is that the numbers this spits out are the
indices of body$Fish.code
that are NOT in
iso$Fish.code
.
We can use this as an index to get the actual values of
body$Fish.code
that are not shared by
iso$Fish.code
:
body$Fish.code[which(!body$Fish.code %in% iso$Fish.code)]
## [1] "C39" "D78" "S51" "ST27"
and we can run the same check in reverse order to see values of
iso$Fish.code
not in body$Fish.code
:
iso$Fish.code[which(!iso$Fish.code %in% body$Fish.code)]
## [1] "C10" "C180" "C46" "C67" "SH08"
We can see that we have a total of 9 samples that are present in one
of the datasets, but not the other. We could use some fancy R indexing
and the match()
function to subset both of the datasets
down to only these shared samples in the same row order, but this is
somewhat tedious to do.
Instead, there is an R function we can use that will do all of that
for us: merge()
:
all_data <- merge(body, iso)
head(all_data)
## Fish.code Species Site Date Weight..g. Fork.length..cm. del13C Std..error
## 1 C01 Coho RK17 11/10/92 13.2 10.2 -23.08 0.00
## 2 C02 Coho RK17 11/10/92 5.8 7.9 -22.82 0.11
## 3 C03 Coho RT02 11/18/92 8.6 8.9 -22.44 0.11
## 4 C04 Coho RT02 11/18/92 11.8 9.8 -21.69 0.09
## 5 C05 Coho RT02 11/18/92 5.0 7.7 -27.27 0.00
## 6 C06 Coho RT02 11/18/92 5.3 8.1 -23.56 0.00
## del15N Std..error.1
## 1 13.56 0.04
## 2 13.04 0.16
## 3 14.08 0.13
## 4 14.01 0.11
## 5 9.20 0.03
## 6 12.84 0.00
dim(all_data)
## [1] 409 10
Knowing how R objects are structured and how to extract specific
elements from objects using brackets and $
is useful, but
there are functions that will simplify most common data manipulations,
and we’ll explore these shortly.
Before we move on, let’s write our merged all_data
object to a csv file:
write.csv(all_data, "fish_data_merged.csv")
Then we can easily read this cleaned and merged data into R or
another program anytime we want without having to repeat these steps.
.csv
or “comma-separated values” is a very common file
format for data. It is easily computer-readable because it contains no
formatting, only values, with columns separated by commas.
Also note that by writing a new file from R, we can read in the raw data, edit/filter it as we like, and then write the output to a new file with no risk of accidentally overwriting or editing the raw data. If all of your R commands are saved in a script, then you will end up with your untouched raw data, the manipulated data, and a full record of the manipulations.
Read that .csv file back into R just to demonstrate that we have successfully written out the data.
all_data <- read.csv("fish_data_merged.csv")
One of the best things about R is its ability to make pretty much any type of plot you’d like. This is useful for exploring your data and also for making high-quality figures that you can put into your publications. I have made nearly all of the figures in my published papers in R.
You can do a lot of plotting with R’s base functions, e.g., using the
plot()
function, but you get much better control and
flexibility using the ggplot2
package. If you get familiar
with the general style of ggplot
, you’ll pretty quickly
start to notice that many figures in research papers are made with this
package.
ggplot2
is part of the Tidyverse set of packages, so we
can install and then load it using:
install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Before we actually get started, there’s an important issue to point
out in this loading message. The conflicts show that there are functions
in dplyr
and the base R stats
packages that
share the same function names. Whenever this happens, the function from
the most recently loaded package will mask the other function. If you
load dplyr
last and then run filter()
, what
you’ll get is the function from dplyr
. Alternately, if you
load stats
last, you’ll get the filter()
function from that package.
This is important to keep track of. Especially if you write a script
and then edit it to load up a package at the the top of a script. You
can always call a function from a specific package by using the notation
package::function()
. The double colons tell R to explicitly
use a function from the stated package.
You can also specifically install and/or load ggplot2 by itself using
the following if you uncomment it. I’ve commented it here because if you
try to install ggplot2
on its own right now, it will
restart R and we’ll need to re-run everything we’ve done so far.
# install.packages("ggplot2")
# library(ggplot2)
ggplot
makes some very nice figures, but it has a unique
syntax that can take a little while to learn. We’ll make a quick plot,
then explain what’s going on here:
ggplot(data = all_data) +
geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.))
All calls to ggplot()
are composed of at least two
pieces. The first simply specifies the object that contains the data.
All of the data for the plot should be in a single object, with
different variables in different columns and each row specifying a
single observation of a data (this is part of the general “Tidy” data
philosophy).
The basic ggplot()
call without adding in a
geom
function will just render a blank plot, since you
haven’t told it what kind of plot to make with the data - this is unlike
the base R plot()
function we used above, which will try to
guess what type of plot you want based on the nature of the data.
ggplot(data = all_data) # this makes an empty plot
In the full call:
ggplot(data = all_data) +
geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.))
The geom_point()
function tells ggplot
to
plot out the data as points. Within this function, the
mapping
argument specifies how the data are mapped to the
visualization. The mapping is specified using the aes()
(aesthetic) function. Here we specify only which variable is x and which
is y, but there are other things we can specify as well.
geom
functions can be combined into a single
plot, and the mapping
argument can be specified
independently in each, or can be specified globally within the
ggplot()
function call.We’ve so far plotted out two variables, but we can add information
about additional variables by passing additional arguments to
aes()
. For example, we can change the shape the points by
species:
ggplot(data = all_data) +
geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g., shape = Species))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
## that many have them.
## Warning: Removed 33 rows containing missing values (`geom_point()`).
Instead we can change the color by species:
ggplot(data = all_data) +
geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g., color = Species))
We can also scale the size of the points by some variable using the
size
argument in aes()
. Try modifying the
above plot so that the points are sized by del15N
.
ggplot(data = all_data) +
geom_point(mapping = aes(
x = Fork.length..cm.,
y = Weight..g.,
color = Species,
size = del15N))
This is a very busy plot at this point, and for me, all points are too large to be interpretable, so let’s add a function that controls the range of point sizes:
ggplot(data = all_data) +
geom_point(mapping = aes(
x = Fork.length..cm.,
y = Weight..g.,
color = Species,
size = del15N)) +
scale_size(range = c(0.1, 2))
This is a little better, but still is clearly not the best way to display these data. Just because you can plot things in a certain way, doesn’t mean you should plot them that way.
If we want to change the color (or size, shape, etc.) of all points,
rather than according to a variable, we can do this by pulling that
aesthetic outside of the aes()
function and setting it
manually:
ggplot(data = all_data) +
geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.), color = "blue")