We will explore some basic data manipulation, including reading, filtering, and transforming. This will all be taking place in R. I highly recommend writing all of your code in an R script and commenting it liberally. Even for relatively simple operations, if you need to repeat them or slightly tweak them, it’s much easier to simply open up your script and re-run it or edit it than to re-type everything into R. Even more importantly, it ensures that there is a record of exactly what you did. If you haven’t already dealt with trying to remember how you handled data when you come back it to after a while, you’ll probably be surprised by just how quickly you can forget what you were doing and why.

Reading in the data

Start by setting out working directory.

setwd("~/r4grads")

We will also create an object with the path to our data directory.

data_dir <- "~/r4grads/Fish_data/Modified/"

Then read in the data. Here, we have the data in two separate csv files for different data pertaining to some fish. One file is body size data with some information about sampling and the species, and the other file contains stable isotope data. Note that we’re using paste0() inside read.csv() to paste together the paths and file names.

body <- read.csv(paste0(data_dir, "/", "Fish_body_size.csv"))
iso <- read.csv(paste0(data_dir, "/","Fish_isotopes.csv"))

Let’s take a quick look at the top few rows of each dataset.

head(body)
##   Fish.code Species Site     Date Weight..g. Fork.length..cm.
## 1       C01    Coho RK17 11/10/92       13.2             10.2
## 2       C02    Coho RK17 11/10/92        5.8              7.9
## 3       C03    Coho RT02 11/18/92        8.6              8.9
## 4       C04    Coho RT02 11/18/92       11.8              9.8
## 5       C05    Coho RT02 11/18/92        5.0              7.7
## 6       C06    Coho RT02 11/18/92        5.3              8.1
head(iso)
##   Fish.code del13C Std..error del15N Std..error.1
## 1       C01 -23.08       0.00  13.56         0.04
## 2       C02 -22.82       0.11  13.04         0.16
## 3       C03 -22.44       0.11  14.08         0.13
## 4       C04 -21.69       0.09  14.01         0.11
## 5       C05 -27.27       0.00   9.20         0.03
## 6       C06 -23.56       0.00  12.84         0.00

We can also look at the bottom few rows.

tail(body)
##     Fish.code                 Species   Site     Date Weight..g.
## 408      ST29 Three spine stickleback   RT02  4/19/93        1.0
## 409      ST30 Three spine stickleback BP02-R  4/19/93        0.3
## 410      ST31 Three spine stickleback   BP04 10/23/93        1.4
## 411      ST32 Three spine stickleback   BP04 10/23/93        0.8
## 412      ST33 Three spine stickleback   RK09 10/22/93        0.9
## 413      ST34 Three spine stickleback   RK09 10/25/93        1.8
##     Fork.length..cm.
## 408              5.1
## 409              3.2
## 410              5.4
## 411              4.4
## 412              4.6
## 413              5.4

See how many rows and columns are in each dataframe:

dim(body)
## [1] 413   6
dim(iso)
## [1] 414   5

Let’s take a look at what species are included and how many samples we have of each species.

unique(body$Species)
##  [1] "Coho"                    "COHO"                   
##  [3] "Chum"                    "Dolly varden"           
##  [5] "Dolly"                   "Pink"                   
##  [7] "coastrange sculpin"      "Steelhead"              
##  [9] "Steelhesd"               "Three spine stickleback"
summary(as.factor(body$Species))
##                    Chum      coastrange sculpin                    Coho 
##                      10                      58                     183 
##                    COHO                   Dolly            Dolly varden 
##                       6                       1                      98 
##                    Pink               Steelhead               Steelhesd 
##                      15                       7                       1 
## Three spine stickleback 
##                      34

Fixing spelling

Looking at either of those last outputs, you should notice that we have some misspellings. In some cases, “Coho” was written in all capital letters, and because R is case sensitive (as are most other coding languages), these are interpreted as different species. We also have “Dolly Varden” abbreviated down to just “Dolly” in one case, and a misspelling of “Steelhead” as “Steelhesd”. We will want to correct these before we move forward with any further data processing.

There are a few ways to do this. One is by using an indexing approach to identify all of the elements of the objects that contain the values we want to replace, and replacing them with the values we want.

Let’s build this out:

We’ll start by identifying which elements of the “Species” column of body contains "COHO"

body$Species=="COHO"

You should see a long list of TRUE/FALSE values corresponding to whether each element is (TRUE) or is not (FALSE) “COHO”. We can then use this to select out only the TRUE elements of body$Species:

body$Species[body$Species=="COHO"]

And finally, using that indexing to identify the incorrect entries, we can replace them with “Coho”:

body$Species[body$Species=="COHO"] <- "Coho"

But there are also much faster ways to do this. The function gsub() will search for all occurrences of its first argument and replace them with its second argument in the object specified in the third argument:

body$Species <- gsub("^Dolly$", "Dolly varden", body$Species)
body$Species <- gsub("Steelhesd", "Steelhead", body$Species)

In the above, the ^ indicates the start of a string and the $ indicates the end of string of a string, indicating that only want to replace Dolly when the D is the start of a string and the y is the end – why is this necessary? What happens if we just replace Dolly?

If we take a look at the data again, we should see that these errors have been corrected:

unique(body$Species)
## [1] "Coho"                    "Chum"                   
## [3] "Dolly varden"            "Pink"                   
## [5] "coastrange sculpin"      "Steelhead"              
## [7] "Three spine stickleback"
summary(as.factor(body$Species))
##                    Chum      coastrange sculpin                    Coho 
##                      10                      58                     189 
##            Dolly varden                    Pink               Steelhead 
##                      99                      15                       8 
## Three spine stickleback 
##                      34

We should also know how many species we sampled, and we can check how many are in this dataset:

length(unique(body$Species))
## [1] 7

We also have a similar error in body$Site that we’ll fix real quick:

body$Site <- gsub("RT02-R", "RT02R", body$Site)
sort(unique(body$Site))
##  [1] "?"       "BA"      "BP01BP"  "BP01R"   "BP02"    "BP02-R"  "BP04"   
##  [8] "RK02"    "RK03"    "RK04"    "RK05"    "RK06"    "RK07"    "RK09"   
## [15] "RK11"    "RK13"    "RK14"    "RK16"    "RK17"    "RK32"    "RP02"   
## [22] "RP03BP"  "RP06"    "RP06BP"  "RP06R"   "RP08"    "RT02"    "RT02-BP"
## [29] "RT02BP"  "RT02R"   "RT03"    "RT12"    "RT16"    "RT19"    "RW04"

These look pretty good now. Note that misspellings like these are relatively easy to catch, but incorrect numerical values can be much harder. Those errors will typically require plotting of the data to identify obviously incorrect values, which we’ll cover later on.

Merging the data

Before we continue on, we’d like to have all of our data in a single object. This is simpler to keep track of and also allows us to apply filters and manipulations to the entire dataset at once, rather than needing to modify each object individually.

When merging, datasets may not include the same exact samples or samples may be in different orders, so we can’t just stick the columns all together. If we look at the dimensions of our two dataframes again, we’ll notice that they have different numbers of rows, indicating that at least one sample is in one set but not the other.

We can check for Fish.code elements that are in the body size data but not the isotope data:

which(!body$Fish.code %in% iso$Fish.code)
## [1] 132 277 364 406

the %in% operator checks for occurrences of the preceding object in the following object, and returns a vector of TRUE/FALSE. The ! at the beginning reverses TRUE/FALSE, so that TRUE instead corresponds to elements of body$Fish.code that are NOT in iso$Fish.code, and the which() gives us the numeric indices of the elements of the TRUE/FALSE vector that are true. The result is that the numbers this spits out are the indices of body$Fish.code that are NOT in iso$Fish.code.

We can use this as an index to get the actual values of body$Fish.code that are not shared by iso$Fish.code:

body$Fish.code[which(!body$Fish.code %in% iso$Fish.code)]
## [1] "C39"  "D78"  "S51"  "ST27"

and we can run the same check in reverse order to see values of iso$Fish.code not in body$Fish.code:

iso$Fish.code[which(!iso$Fish.code %in% body$Fish.code)]
## [1] "C10"  "C180" "C46"  "C67"  "SH08"

We can see that we have a total of 9 samples that are present in one of the datasets, but not the other. We can identify which fish are in both datasets:

in_both <- intersect(iso$Fish.code, body$Fish.code)
in_both

Then we can use that to subset both of the datasets down to only these sahred samples, both in the same row order:

iso_red <- iso[iso$Fish.code %in% in_both, ]
body_red <- body[body$Fish.code %in% in_both, ]
dim(iso_red)
## [1] 409   5
dim(body_red)
## [1] 409   6

Both of these objects now have the same number of rows, and we know that they are in the same row order because we indexed them both from the same object in the same way.

Note that when indexing a dataframe or other objects that have both rows and columns, we index using a [row,column] format. If we are selecting all columns, we can simply leave the portion after the comma blank, but we must still include the comma. The reverse is true for selecting all rows and certain columns.

We can now simply bind these columns together:

all_data1 <- cbind(body_red, iso_red)
head(all_data1)
##   Fish.code Species Site     Date Weight..g. Fork.length..cm. Fish.code del13C
## 1       C01    Coho RK17 11/10/92       13.2             10.2       C01 -23.08
## 2       C02    Coho RK17 11/10/92        5.8              7.9       C02 -22.82
## 3       C03    Coho RT02 11/18/92        8.6              8.9       C03 -22.44
## 4       C04    Coho RT02 11/18/92       11.8              9.8       C04 -21.69
## 5       C05    Coho RT02 11/18/92        5.0              7.7       C05 -27.27
## 6       C06    Coho RT02 11/18/92        5.3              8.1       C06 -23.56
##   Std..error del15N Std..error.1
## 1       0.00  13.56         0.04
## 2       0.11  13.04         0.16
## 3       0.11  14.08         0.13
## 4       0.09  14.01         0.11
## 5       0.00   9.20         0.03
## 6       0.00  12.84         0.00

This leaves us with two “Fish.code” columns, however, so we can can use the numeric index of the second of these to drop out the column that we don’t need:

all_data2 <- all_data1[,-7]
head(all_data2)
##   Fish.code Species Site     Date Weight..g. Fork.length..cm. del13C Std..error
## 1       C01    Coho RK17 11/10/92       13.2             10.2 -23.08       0.00
## 2       C02    Coho RK17 11/10/92        5.8              7.9 -22.82       0.11
## 3       C03    Coho RT02 11/18/92        8.6              8.9 -22.44       0.11
## 4       C04    Coho RT02 11/18/92       11.8              9.8 -21.69       0.09
## 5       C05    Coho RT02 11/18/92        5.0              7.7 -27.27       0.00
## 6       C06    Coho RT02 11/18/92        5.3              8.1 -23.56       0.00
##   del15N Std..error.1
## 1  13.56         0.04
## 2  13.04         0.16
## 3  14.08         0.13
## 4  14.01         0.11
## 5   9.20         0.03
## 6  12.84         0.00
dim(all_data2)
## [1] 409  10

This looks better. This is a fairly long and tedious way of combining dataframes, though. We can actually achieve all of this with a single function:

all_data <- merge(body, iso)
head(all_data)
##   Fish.code Species Site     Date Weight..g. Fork.length..cm. del13C Std..error
## 1       C01    Coho RK17 11/10/92       13.2             10.2 -23.08       0.00
## 2       C02    Coho RK17 11/10/92        5.8              7.9 -22.82       0.11
## 3       C03    Coho RT02 11/18/92        8.6              8.9 -22.44       0.11
## 4       C04    Coho RT02 11/18/92       11.8              9.8 -21.69       0.09
## 5       C05    Coho RT02 11/18/92        5.0              7.7 -27.27       0.00
## 6       C06    Coho RT02 11/18/92        5.3              8.1 -23.56       0.00
##   del15N Std..error.1
## 1  13.56         0.04
## 2  13.04         0.16
## 3  14.08         0.13
## 4  14.01         0.11
## 5   9.20         0.03
## 6  12.84         0.00
dim(all_data)
## [1] 409  10

This has the same dimensions as the way that did this above using indexing and cbind. Looks good. Before we continue on, let’s write this to a csv file. Then we can easily read this cleaned and merged data into R or another program anytime we want without having to repeat these steps:

write.csv(all_data, paste0(data_dir, "/", "fish_data_merged.csv"))

Further filtering and subsetting examples

Let’s read that csv file back into R, just to demonstrate it.

all_data <- read.csv(paste0(data_dir, "/", "fish_data_merged.csv"))

For some analyses, we might want to analyze a single species of fish at a time. We can do this similarly to how we used row indexing to merge datasets above.

coho_data <- all_data[all_data$Species == "Coho",]
head(coho_data, n=4)
##   X Fish.code Species Site     Date Weight..g. Fork.length..cm. del13C
## 1 1       C01    Coho RK17 11/10/92       13.2             10.2 -23.08
## 2 2       C02    Coho RK17 11/10/92        5.8              7.9 -22.82
## 3 3       C03    Coho RT02 11/18/92        8.6              8.9 -22.44
## 4 4       C04    Coho RT02 11/18/92       11.8              9.8 -21.69
##   Std..error del15N Std..error.1
## 1       0.00  13.56         0.04
## 2       0.11  13.04         0.16
## 3       0.11  14.08         0.13
## 4       0.09  14.01         0.11

We could simultaneously select out a single species and only the Fish.code and Fork.length..cm. columns.

coho_length <- all_data[all_data$Species == "Coho", c("Fish.code", "Fork.length..cm.")]
head(coho_length, n=4)
##   Fish.code Fork.length..cm.
## 1       C01             10.2
## 2       C02              7.9
## 3       C03              8.9
## 4       C04              9.8

We can select two or more species at once in a couple different ways:

cs_data <- all_data[all_data$Species %in% c("Coho", "Steelhead"),]
cs_data

Or using more complex pattern matching:

cs_data1 <- all_data[all_data$Species == "Coho" | all_data$Species == "Steelhead",]
cs_data1

where the pipe | indicates that we want to match the first condition or the second condition.

We can also use & to get only rows that satisfy two or more conditions. E.g., we can find the Coho that are smaller than 10 g:

coho_small <- all_data[all_data$Species == "Coho" & all_data$Weight..g. < 10,]
coho_small

Transforming data

R makes it very easy to transform or otherwise convert data. Operations can be applied across entire vectors or columns of dataframes easily. We can easily log-transform our body size data:

log_weight <- log(all_data$Weight..g.)
data_w_log <- cbind(all_data, log_weight)

Handling NAs (not applicable/empty data)

R treats NA data differently than other types of data. For this reason, we need special tools if we want to remove NA. To demonstrate this, let’s first introduce some NAs into our data. We’ll convert all values of “Fork.length..cm.” less than 5 to NA:

all_data$Fork.length..cm.[all_data$Fork.length..cm. < 5] <- NA
all_data$Fork.length..cm.

If we try to check if values are NA the same way that we’ve evaluated things so far, it won’t work:

all_data$Fork.length..cm.==NA

And in fact, if you’re using RStudio and put the above line in a script, it will flag that line and mousing over the exclamation mark will tell you to use is.na(). Let’s try that instead:

is.na(all_data$Fork.length..cm.)

This looks like what we’d expect, and we can use this to filter out NAs in this row. Note that many functions can include NA values, so completely removing NA data may not always be desirable. It will depend heavily on your specific dataset.

data_noNA <- all_data[!is.na(all_data$Fork.length..cm.),]

Notice the ! in the above line that lets us keep everything that is not NA. Now if we check if we have any NAs in the Fork.length..cm. of this new object, we should not get any:

which(is.na(data_noNA$Fork.length..cm.))
## integer(0)