Exploratory data visualization

So far, we’ve explored how we can manipulate our data in R, including reading in, subsetting, and merging datasets. These are all very important, but looking at data in the ways we have so far doesn’t give us a good intuition for what kind of patterns or distributions there are in the data. Do we have mostly large fish? Mostly small fish? Are the data roughly normally distributed? If you have far better mathematical intuition than I do, you might be able to guess at this from looking at a dataframe in the R console, but I sure can’t. Plotting the data gives us a much better idea of what’s going on.

Here, we’ll explore some of the basic plotting functionality in R (i.e., base R graphics). There are much more powerful ways to generate fancier, prettier, publication-quality plots, but base R functionality can provide quick and easy insight into your data.

We’ll start by reading back in the data that we combined previously.

setwd("~/r4grads")
data_dir <- "~/r4grads/Fish_data/Modified/"
fish_dat <- read.csv(paste0(data_dir, "/", "fish_data_merged.csv"))

Relationships among continuous variables

When collecting data, we likely have some hypotheses that we are testing or at least general ideas about what kinds of patterns we expect in the data. In many cases, we may hypothesize that two variables are correlated. We can visualize the association between two variables using a scatterpot.

Let’s quickly remind ourselves of the variables that are in our fish dataset:

colnames(fish_dat)
##  [1] "X"                "Fish.code"        "Species"          "Site"            
##  [5] "Date"             "Weight..g."       "Fork.length..cm." "del13C"          
##  [9] "Std..error"       "del15N"           "Std..error.1"

We would probably expect a relationship between weight and fork length, since it makes sense that longer fish are probably heavier fish. Let’s see what it looks like:

plot(fish_dat$Fork.length..cm., fish_dat$Weight..g.)

Cool, looks like there’s a relationship here. Try plotting out a couple extra variables and see what kinds of relationsips you can identify.

Once you do that, let’s try cleaning this plot up a bit, it’s pretty hideous right now.

plot(fish_dat$Fork.length..cm., fish_dat$Weight..g.,
     pch = 16, frame = FALSE, col = "blue", cex=0.5,
     ylab="Weight (g)", xlab = "Fork length (cm)")

There are a number of other graphical parameters you can tweak as you like. There are good resources all over the internet that can be easily found by searching things like “base R axis labels”.

This plot gives us a pretty good idea that there’s a correlation between our data, but it looks distinctly non-linear. This isn’t surprising, since larger fish don’t just get longer, but also wider and taller, so we’d expect the mass to increase more than linearly. This is a good example of when we might want to transform our data. Let’s log transform weight and then plot that against fork length.

log_weight <- log(fish_dat$Weight..g.)
plot(fish_dat$Fork.length..cm., log_weight,
     pch = 16, frame = FALSE, col = "blue", cex=0.5,
     ylab="Logged weight (g)", xlab = "Fork length (cm)")