Home



In this tutorial, we will explore some basic data manipulation and visualization in R. Occassionally, datasets start out perfectly formatted with no missing data, outliers, etc., More often, you’ll need to do some kind of filtering, subsetting, and/or merging of data files. Plotting your data is helpful not only for publication and presentation, but also for identifying potential issues in your data.


I highly recommend writing all of your code in an R script and commenting it liberally. Even for relatively simple operations, if you need to repeat them or slightly tweak them, it’s much easier to simply open up your script and re-run it or edit it than to re-type everything into R. Even more importantly, it ensures that there is a record of exactly what you did. If you haven’t already dealt with trying to remember how you handled data when you come back it to after a couple weeks or even just a few days, you’ll probably be surprised by just how quickly you can forget what you were doing and why.

How to open a script is detailed in the intro to R tutorial.




Reading in the data


Before we can work with the data, we need to download it. Go to the Github repository for this workshop: https://github.com/wyoibc/r4grads and click on the green Code button, then on the Download ZIP button. (You can alternately clone the repo if you’re familiar with git). You can also download the individual files from here if you know how to do that.

download zip


Start by setting our working directory. If you set things up exactly as I did, then this path will work for you. If not, then you’ll need to edit this path:

setwd("~/r4grads-master/Fish_data/Modified/")

Again, make sure this points to the correct path for you, not my path.


Then read in the data. Here, we have the data in two separate csv files for different data pertaining to some fish. One file is body size data with some information about sampling and the species, and the other file contains stable isotope data.

body <- read.csv("Fish_body_size.csv")
iso <- read.csv("Fish_isotopes.csv")


Let’s take a quick look at the top few rows of each dataset.

head(body) # Check the top or head of a dataset
##   Fish.code Species Site     Date Weight..g. Fork.length..cm.
## 1       C01    Coho RK17 11/10/92       13.2             10.2
## 2       C02    Coho RK17 11/10/92        5.8              7.9
## 3       C03    Coho RT02 11/18/92        8.6              8.9
## 4       C04    Coho RT02 11/18/92       11.8              9.8
## 5       C05    Coho RT02 11/18/92        5.0              7.7
## 6       C06    Coho RT02 11/18/92        5.3              8.1
head(iso) # if you're not familiar, anything after '#' is a comment and not interpreted by R, it's there just as your own notes on what you're doing. Use them.
##   Fish.code del13C Std..error del15N Std..error.1
## 1       C01 -23.08       0.00  13.56         0.04
## 2       C02 -22.82       0.11  13.04         0.16
## 3       C03 -22.44       0.11  14.08         0.13
## 4       C04 -21.69       0.09  14.01         0.11
## 5       C05 -27.27       0.00   9.20         0.03
## 6       C06 -23.56       0.00  12.84         0.00


We can also look at the bottom few rows.

tail(body)
##     Fish.code                 Species   Site     Date Weight..g.
## 408      ST29 Three spine stickleback   RT02  4/19/93        1.0
## 409      ST30 Three spine stickleback BP02-R  4/19/93        0.3
## 410      ST31 Three spine stickleback   BP04 10/23/93        1.4
## 411      ST32 Three spine stickleback   BP04 10/23/93        0.8
## 412      ST33 Three spine stickleback   RK09 10/22/93        0.9
## 413      ST34 Three spine stickleback   RK09 10/25/93        1.8
##     Fork.length..cm.
## 408              5.1
## 409              3.2
## 410              5.4
## 411              4.4
## 412              4.6
## 413              5.4


See how many rows and columns are in each dataframe:

dim(body) # get the dimensions of 'body' dataset
## [1] 413   6
dim(iso) # get the dimensions of the 'iso' dataset
## [1] 414   5



Let’s take a look at what species are included and how many samples we have of each species.

unique(body$Species)
##  [1] "Coho"                    "COHO"                   
##  [3] "Chum"                    "Dolly varden"           
##  [5] "Dolly"                   "Pink"                   
##  [7] "coastrange sculpin"      "Steelhead"              
##  [9] "Steelhesd"               "Three spine stickleback"
summary(as.factor(body$Species))
##                    Chum      coastrange sculpin                    Coho 
##                      10                      58                     183 
##                    COHO                   Dolly            Dolly varden 
##                       6                       1                      98 
##                    Pink               Steelhead               Steelhesd 
##                      15                       7                       1 
## Three spine stickleback 
##                      34

What is summary telling us? Do you notice any potential problems with the data from running the previous two commands?

















Fixing spelling

Looking at either of those last outputs, you should notice that we have some misspellings. In some cases, “Coho” was written in all capital letters, and because R is case sensitive (as are most other coding languages), these are interpreted as different species. We also have “Dolly Varden” abbreviated down to just “Dolly” in one case, and a misspelling of “Steelhead” as “Steelhesd”. We will want to correct these before we move forward with any further data processing.


There are a few ways to do this. One is by using an indexing approach to identify all of the elements of the objects that contain the values we want to replace, and replacing them with the values we want.


Let’s build this out:


We’ll start by identifying which elements of the “Species” column of body contains "COHO"

body$Species=="COHO"


You should see a long list of TRUE/FALSE values corresponding to whether each element is (TRUE) or is not (FALSE) “COHO”. We can then use this to select out only the TRUE elements of body$Species:

body$Species[body$Species=="COHO"]



And finally, using that indexing to identify the incorrect entries, we can replace them with “Coho”:

body$Species[body$Species=="COHO"] <- "Coho" # replace all instances of COHO with Coho




But there are also much faster ways to do this. The function gsub() performs pattern matching and replacement. One of the most essential skills in R is learning how to use new functions. If you already know what function you want to use, you can ? before a function to get the built-in help documentation. Try it out:

?gsub()



As is the case for many functions, gsub() has several options that are set to defaults that we won’t worry about, we only really care about the first few options here most of the time.

From looking at this help menu, how would we would replace occurrences of “Steelhesd” with “Steelhead” in body$Species?

























body$Species <- gsub("Steelhesd", "Steelhead", body$Species)

What if we want to replace “Dolly” with “Dolly varden”? Try it out.
















I’m guessing that you used something like:

body$Species <- gsub("Dolly", "Dolly varden", body$Species)

Take another look at the data and let’s see if we’ve cleaned up the species names

unique(body$Species)
## [1] "Coho"                    "Chum"                   
## [3] "Dolly varden varden"     "Dolly varden"           
## [5] "Pink"                    "coastrange sculpin"     
## [7] "Steelhead"               "Three spine stickleback"
summary(as.factor(body$Species))
##                    Chum      coastrange sculpin                    Coho 
##                      10                      58                     189 
##            Dolly varden     Dolly varden varden                    Pink 
##                       1                      98                      15 
##               Steelhead Three spine stickleback 
##                       8                      34


Do you notice any problems?















What you should notice is that we replaced ALL instances of “Dolly” with “Dolly varden”, so what was previously “Dolly varden” is now “Dolly varden varden”. What we should have done was the following:

body$Species <- gsub("^Dolly$", "Dolly varden", body$Species)


In the above, the ^ indicates the start of a string and the $ indicates the end of string of a string, indicating that only want to replace Dolly when the D is the start of a string and the y is the end. We could go back to the start, read the data back in from scratch and run the above line, but let’s fix “Dolly varden varden” in the existing object now.

Try out turning “Dolly varden varden” back into “Dolly varden





















There are a few options for doing this:

body$Species <- gsub("^Dolly varden varden$", "Dolly varden", body$Species)
# OR
body$Species <- gsub("varden varden", "varden", body$Species)
# OR
body$Species[body$Species=="Dolly varden varden"] <- "Dolly varden"

If we take a look at the data again, we should see that these errors have been corrected:

unique(body$Species)
## [1] "Coho"                    "Chum"                   
## [3] "Dolly varden"            "Pink"                   
## [5] "coastrange sculpin"      "Steelhead"              
## [7] "Three spine stickleback"
summary(as.factor(body$Species))
##                    Chum      coastrange sculpin                    Coho 
##                      10                      58                     189 
##            Dolly varden                    Pink               Steelhead 
##                      99                      15                       8 
## Three spine stickleback 
##                      34


We should also know how many species we sampled, and we can check how many are in this dataset:

length(unique(body$Species))
## [1] 7


Things look pretty good now. Note that misspellings like these are relatively easy to catch, but incorrect numerical values can be much harder. Those errors will typically require plotting of the data to identify obviously incorrect values, which we’ll cover later.



Merging the data

Before we continue on, we’d like to have all of our data in a single object. This is simpler to keep track of and also allows us to apply filters and manipulations to the entire dataset at once, rather than needing to modify each object individually.

When merging, datasets may not include the same exact samples or samples may be in different orders, so we can’t just stick the columns all together.

Take another look at the dimensions of our two data objects like we did at the start.

You’ll notice that they have different numbers of rows, indicating that at least one sample is in one set but not the other.


We can check for Fish.code elements that are in the body size data but not the isotope data:

which(!body$Fish.code %in% iso$Fish.code)
## [1] 132 277 364 406

the %in% operator checks for occurrences of the preceding object in the following object, and returns a vector of TRUE/FALSE. The ! at the beginning reverses TRUE/FALSE, so that TRUE instead corresponds to elements of body$Fish.code that are NOT in iso$Fish.code, and the which() gives us the numeric indices of the elements of the TRUE/FALSE vector that are true. The result is that the numbers this spits out are the indices of body$Fish.code that are NOT in iso$Fish.code.

We can use this as an index to get the actual values of body$Fish.code that are not shared by iso$Fish.code:

body$Fish.code[which(!body$Fish.code %in% iso$Fish.code)]
## [1] "C39"  "D78"  "S51"  "ST27"

and we can run the same check in reverse order to see values of iso$Fish.code not in body$Fish.code:

iso$Fish.code[which(!iso$Fish.code %in% body$Fish.code)]
## [1] "C10"  "C180" "C46"  "C67"  "SH08"


We can see that we have a total of 9 samples that are present in one of the datasets, but not the other. We could use some fancy R indexing and the match() function to subset both of the datasets down to only these shared samples in the same row order, but this is somewhat tedious to do.

Instead, there is an R function we can use that will do all of that for us: merge():

all_data <- merge(body, iso)
head(all_data)
##   Fish.code Species Site     Date Weight..g. Fork.length..cm. del13C Std..error
## 1       C01    Coho RK17 11/10/92       13.2             10.2 -23.08       0.00
## 2       C02    Coho RK17 11/10/92        5.8              7.9 -22.82       0.11
## 3       C03    Coho RT02 11/18/92        8.6              8.9 -22.44       0.11
## 4       C04    Coho RT02 11/18/92       11.8              9.8 -21.69       0.09
## 5       C05    Coho RT02 11/18/92        5.0              7.7 -27.27       0.00
## 6       C06    Coho RT02 11/18/92        5.3              8.1 -23.56       0.00
##   del15N Std..error.1
## 1  13.56         0.04
## 2  13.04         0.16
## 3  14.08         0.13
## 4  14.01         0.11
## 5   9.20         0.03
## 6  12.84         0.00
dim(all_data)
## [1] 409  10

Knowing how R objects are structured and how to extract specific elements from objects using brackets and $ is useful, but there are functions that will simplify most common data manipulations, and we’ll explore these shortly.


Before we move on, let’s write our merged all_data object to a csv file:

write.csv(all_data, "fish_data_merged.csv")

Then we can easily read this cleaned and merged data into R or another program anytime we want without having to repeat these steps. .csv or “comma-separated values” is a very common file format for data. It is easily computer-readable because it contains no formatting, only values, with columns separated by commas.

Also note that by writing a new file from R, we can read in the raw data, edit/filter it as we like, and then write the output to a new file with no risk of accidentally overwriting or editing the raw data. If all of your R commands are saved in a script, then you will end up with your untouched raw data, the manipulated data, and a full record of the manipulations.


Read that .csv file back into R just to demonstrate that we have successfully written out the data.

all_data <- read.csv("fish_data_merged.csv")



Plotting with ggplot2


One of the best things about R is its ability to make pretty much any type of plot you’d like. This is useful for exploring your data and also for making high-quality figures that you can put into your publications. I have made nearly all of the figures in my published papers in R.

You can do a lot of plotting with R’s base functions, e.g., using the plot() function, but you get much better control and flexibility using the ggplot2 package. If you get familiar with the general style of ggplot, you’ll pretty quickly start to notice that many figures in research papers are made with this package.


ggplot2 is part of the Tidyverse set of packages, so we can install and then load it using:

install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Before we actually get started, there’s an important issue to point out in this loading message. The conflicts show that there are functions in dplyr and the base R stats packages that share the same function names. Whenever this happens, the function from the most recently loaded package will mask the other function. If you load dplyr last and then run filter(), what you’ll get is the function from dplyr. Alternately, if you load stats last, you’ll get the filter() function from that package.

This is important to keep track of. Especially if you write a script and then edit it to load up a package at the the top of a script. You can always call a function from a specific package by using the notation package::function(). The double colons tell R to explicitly use a function from the stated package.


You can also specifically install and/or load ggplot2 by itself using the following if you uncomment it. I’ve commented it here because if you try to install ggplot2 on its own right now, it will restart R and we’ll need to re-run everything we’ve done so far.

# install.packages("ggplot2")
# library(ggplot2)



ggplot syntax

ggplot makes some very nice figures, but it has a unique syntax that can take a little while to learn. We’ll make a quick plot, then explain what’s going on here:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.))


All calls to ggplot() are composed of at least two pieces. The first simply specifies the object that contains the data. All of the data for the plot should be in a single object, with different variables in different columns and each row specifying a single observation of a data (this is part of the general “Tidy” data philosophy).

The basic ggplot() call without adding in a geom function will just render a blank plot, since you haven’t told it what kind of plot to make with the data - this is unlike the base R plot() function we used above, which will try to guess what type of plot you want based on the nature of the data.

ggplot(data = all_data) # this makes an empty plot


In the full call:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.))

The geom_point() function tells ggplot to plot out the data as points. Within this function, the mapping argument specifies how the data are mapped to the visualization. The mapping is specified using the aes() (aesthetic) function. Here we specify only which variable is x and which is y, but there are other things we can specify as well.

  • Multiple geom functions can be combined into a single plot, and the mapping argument can be specified independently in each, or can be specified globally within the ggplot() function call.


Aesthetic mappings

We’ve so far plotted out two variables, but we can add information about additional variables by passing additional arguments to aes(). For example, we can change the shape the points by species:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g., shape = Species))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many have them.
## Warning: Removed 33 rows containing missing values (`geom_point()`).

  • Note the warning that we can only assign up to 6 shapes to variables, and see that three spine stickleback has been left out because of this.


Instead we can change the color by species:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g., color = Species))


We can also scale the size of the points by some variable using the size argument in aes(). Try modifying the above plot so that the points are sized by del15N.

































ggplot(data = all_data) + 
  geom_point(mapping = aes(
    x = Fork.length..cm., 
    y = Weight..g., 
    color = Species,
    size = del15N))

This is a very busy plot at this point, and for me, all points are too large to be interpretable, so let’s add a function that controls the range of point sizes:

ggplot(data = all_data) + 
  geom_point(mapping = aes(
    x = Fork.length..cm., 
    y = Weight..g., 
    color = Species,
    size = del15N)) +
  scale_size(range = c(0.1, 2))

This is a little better, but still is clearly not the best way to display these data. Just because you can plot things in a certain way, doesn’t mean you should plot them that way.


If we want to change the color (or size, shape, etc.) of all points, rather than according to a variable, we can do this by pulling that aesthetic outside of the aes() function and setting it manually:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.), color = "blue")






Home