Large Language Model chatbots like chatGPT and google Gemini are
increasingly avalible and sophisticated. They can do any number of
tasks, inlcuding help with coding projects and data analysis. Lets
explore a couple ways these tools can be helpful for adding feature to
our plots from last week.
First let’s reload the fish data. Make
sure you set your working directory with setwd() like we did before.
Also lets library tidyverse so we have all those functions as well.
# Make sure you set this to your personal working directory with the fish data!
setwd("~/Downloads/r4grads-master/Fish_data/Modified/")
Now lets reload our tidyverse functions with library() and load the fish data we were using last week.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
all_data <- read.csv("fish_data_merged.csv")
Here’s our violin plots from last week
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
theme_classic()
These look great, but what if we want to turn the x-axis
titles 45 degrees to make them a bit easier to read? If you’re not
immediately sure how to do this, you could google it, but AI chat-bots
might also be helpful.
Open up chatGPT and
login. For most of these chatbots, you’ll need a free account to access
them. You can make a new account, or link your Google, Microsoft or
Apple account to the service and log in that way. If you already have a
Gmail and google account you can use Google Gemini which should give
you the same results. The process is fairly easy, but if you don’t want
to deal with making an account feel free to just follow along for this
part.
Once we’re into the chatbot, lets ask it our question: “How do I
adjust x-axis labels 45 degrees on a ggplot in R?”
Cool! Looks like one way to do this is by
adding to theme(), a function we’ve used before in ggplot. chatGPT and
Gemini typically can identify that you’re askign it questions about
coding, and will give you code answers. We don’t need the entire script
in this case, just the last line:
theme(axis.text.x = element_text(angle = 45, hjust = 1))
and if we copy-paste that into our violin plot function, it should look something like:
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_classic()
That’s weird, it didn’t seem to tilt our x-axis labels. Can
you see what I did wrong?
Maybe
you remember from last week the order matters! But if you didn’t notice
my mistake, or if you encounter other probelms in your coding
adventures, you can also ask these chat bots for help. In this instance,
I’m going to give it my entire block of code and simply ask, “Why
doesn’t this work?” It can be useful to use Shift+Enter here to put
multiple lines of text in the prompt box befor submitting your request.
Of course, we’re trying to
use theme twice, and the theme_classic is overwriting the x labels. Now
we could look at this and manually edit it, or we can just hit the “Copy
code” button and paste that into our R script, now that we’ve given the
chatbot specific about the code we’re trying to write.
ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "delta 15 N") +
theme_classic() + # Apply theme_classic() first
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Then customize axis text
Much better! As you could probably already guess, these chatbots
can only get you so far, and you still need to think for yourself and do
some troubleshooting. They’re still very useful tools though, and can
help teach you about new ways to do things in R, or give you hints as to
why your code isn’t working properly. Lets take a look at another
example.
You now know some of the very basics of R and all the many
wonderful things you can do with it. When starting a project from
scratch however, it can be a bit daunting on where to begin, so lets
think about some solid practices for planning out your very own data
analysis.
When starting a new project I like to say out loud to
myself, or just write down in simple terms what exactly it is I’m trying
to accomplish. This may sound a little silly but can save you lots of
time in the end. Let’s look at an example of manipulating some RNAseq
data.
# Load my data and metadata into R
# Inspect my data to make sure it looks OK
# Filter for genotype and temperature
# Run DEseq2
# Plot a Volcano Plot of the results
Let’s start with the first step, loading my data into R. I already
put this data on my GitHub, so I’ll download it from there. Feel free to
grab my data as well and follow along if you wish.
# Load my data into R
meta_data <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_metadata.csv")
gene_counts <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_gene_counts_matrix.tsv", sep = "\t", row.names = "Gene")
# Inspect my data to make sure it looks OK
head(meta_data)
## file geno temp sample
## 1 counts_BP175_16C_1_S1 BP175 16C 1
## 2 counts_BP175_16C_2_S2 BP175 16C 2
## 3 counts_BP175_16C_3_S3 BP175 16C 3
## 4 counts_BP175_22C_1_S4 BP175 22C 1
## 5 counts_BP175_22C_2_S5 BP175 22C 2
## 6 counts_BP175_22C_3_S6 BP175 22C 3
# Filter for genotype and temperature
# Run DEseq2
# Plot a Volcano Plot of the results
This looks pretty good, now lets look at the gene_counts:
# Inspect my data to make sure it looks OK
head(gene_counts)
## counts_BP175_16C_1_S1 counts_BP175_16C_2_S2 counts_BP175_16C_3_S3
## aap-1 160 146 112
## aat-1 4 4 0
## aat-2 4 0 2
## aat-3 3 2 1
## aat-4 3 2 0
## aat-5 0 0 0
## counts_BP175_22C_1_S4 counts_BP175_22C_2_S5 counts_BP175_22C_3_S6
## aap-1 63 60 55
## aat-1 4 0 6
## aat-2 41 10 3
## aat-3 1 0 9
## aat-4 0 0 0
## aat-5 0 1 0
## counts_BP175_25C_1_S7 counts_BP175_25C_2_S8 counts_BP175_25C_3_S9
## aap-1 114 84 91
## aat-1 4 2 2
## aat-2 4 8 0
## aat-3 14 7 13
## aat-4 0 0 0
## aat-5 0 0 0
## counts_N2_16C_1_S10 counts_N2_16C_2_S11 counts_N2_16C_3_S12
## aap-1 320 351 357
## aat-1 8 5 0
## aat-2 2 0 0
## aat-3 15 11 4
## aat-4 2 0 0
## aat-5 0 0 0
## counts_N2_22C_1_S13 counts_N2_22C_2_S14 counts_N2_22C_3_S15
## aap-1 208 150 282
## aat-1 2 0 0
## aat-2 2 0 1
## aat-3 1 0 2
## aat-4 0 0 0
## aat-5 0 0 0
## counts_N2_25C_1_S16 counts_N2_25C_2_S17 counts_N2_25C_3_S18
## aap-1 158 173 168
## aat-1 6 0 2
## aat-2 4 6 6
## aat-3 16 12 2
## aat-4 0 0 0
## aat-5 0 0 0
## counts_WH171_16C_1_S19 counts_WH171_16C_2_S20 counts_WH171_16C_3_S21
## aap-1 259 203 200
## aat-1 0 4 1
## aat-2 0 0 2
## aat-3 6 4 5
## aat-4 0 1 2
## aat-5 0 0 0
## counts_WH171_22C_1_S22 counts_WH171_22C_2_S23 counts_WH171_22C_3_S24
## aap-1 180 190 209
## aat-1 0 0 0
## aat-2 16 14 13
## aat-3 4 0 2
## aat-4 0 0 0
## aat-5 0 0 0
## counts_WH171_25C_1_S25 counts_WH171_25C_2_S26 counts_WH171_25C_3_S27
## aap-1 150 149 188
## aat-1 0 2 2
## aat-2 0 6 2
## aat-3 37 34 19
## aat-4 0 0 0
## aat-5 0 0 0
# Filter for genotype and temperature
# Run DEseq2
# Plot a Volcano Plot of the results
Looks pretty good! However there’s a lot of different samples
here. Lets ask chatGPT for help filtering out only the samples with
genotype: N2 or BP175 and temperature: 16C
This time chatGPT suggests using the grep function. This is a
command line tool from Unix but we can use it in R as well. Lets try
grep on our gene_counts object.
# Filter for genotype and temperature
gene_counts_cleaned <- gene_counts[, grep("N2_16C|BP175_16C", names(gene_counts))]
colnames(gene_counts_cleaned)
## [1] "counts_BP175_16C_1_S1" "counts_BP175_16C_2_S2" "counts_BP175_16C_3_S3"
## [4] "counts_N2_16C_1_S10" "counts_N2_16C_2_S11" "counts_N2_16C_3_S12"
# Run DESeq2
# Plot a Volcano Plot of the results
Great, looks like we have just 6 samples now, 3 of N2 and 3 of WH171.
Lets do this for out meta_data too. We could try and use this grep
function again (feel free to try and make it work for rows in meta_data)
but we can also go back to the filter() function we learned before.
# Filter for genotype and temperature
meta_data_cleaned <- filter(meta_data, geno == 'N2' & temp == '16C' | geno =='BP175' & temp =='16C')
meta_data_cleaned
## file geno temp sample
## 1 counts_BP175_16C_1_S1 BP175 16C 1
## 2 counts_BP175_16C_2_S2 BP175 16C 2
## 3 counts_BP175_16C_3_S3 BP175 16C 3
## 4 counts_N2_16C_1_S10 N2 16C 1
## 5 counts_N2_16C_2_S11 N2 16C 2
## 6 counts_N2_16C_3_S12 N2 16C 3
# Run DESeq2
# Plot a Volcano Plot of the results
Now it’s time to run our differential expression. This is a BIG
oversimplification of some complex biostatistics. If you were doing this
on your own you would want to read up on the DESeq2
package and understand the nuances of what you were applying to the
data.
However, using chatbots can give you a great start on getting
there, so lets see if chatGPT can help us with running DESeq2. Our
prompt would probably be something like:
* In R, I have a
meta data file called meta_data_cleaned and a counts matrix called
gene_counts_cleaned. I want to run differential expression on this using
DESeq2
Your results may vary, but I got the below back
from chatGPT:
# Install and load required packages
install.packages("DESeq2")
install.packages("tidyverse")
library(DESeq2)
library(tidyverse)
# Load metadata
meta_data <- read.csv("path/to/meta_data.csv") # Replace "path/to/meta_data.csv" with your file path
# Load counts matrix
gene_counts_cleaned <- read.csv("path/to/gene_counts_cleaned.csv", row.names = 1) # Replace "path/to/gene_counts_cleaned.csv" with your file path
# Check dimensions and structure of your data
dim(meta_data)
dim(gene_counts_cleaned)
head(meta_data)
head(gene_counts_cleaned)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
colData = meta_data_cleaned,
design = ~ your_formula_here)
# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition
# Run DESeq2 analysis
dds <- DESeq(dds)
# Get differential expression results
res <- results(dds)
# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]
# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)
# Explore or visualize the results further as needed
A pretty good start! Clearly we don’t need to reload the
packages with library, and we already have meta_data and
gene_counts_cleaned as variables we’ve created, but you can see how
chatGPT is trying to give us it’s best guess as to what we’re doing, and
even uses head() and dim() to QC our data much like we planned before.
If I strip out the unnecessary pieces it should looks something like:
library(DESeq2)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
colData = meta_data_cleaned,
design = ~ your_formula_here)
# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition
# Run DESeq2 analysis
dds <- DESeq(dds)
# Get differential expression results
res <- results(dds)
# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]
# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)
If you read some of the commented code, you can see that it has
a dummy variable “your_formula_here” that we’ll need to replace. Because
we didn’t specify what the condition was we were looking at, we can
remind our selves of what the meta_data table looks like with
View(meta_data) and choose the appropriate column, in this case “geno”
and substitute it in for your_formula_here
suppressMessages(library(DESeq2))
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
colData = meta_data_cleaned,
design = ~ geno)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# Run DESeq2 analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Get differential expression results
res <- results(dds)
# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]
# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)
Now that DESeq is finished running, lets take a look at our
topGenes object that it created:
topGenes
## log2 fold change (MLE): geno N2 vs BP175
## Wald test p-value: geno N2 vs BP175
## DataFrame with 10 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## T28F4.4 76388.16 1.69646 0.0472487 35.9048 2.56953e-282 1.83156e-278
## stn-1 6426.76 1.70544 0.0506905 33.6442 3.78939e-248 1.35054e-244
## arx-1 9389.14 1.95859 0.0587145 33.3579 5.60367e-244 1.33143e-240
## twk-49 48952.82 1.32816 0.0418461 31.7393 4.46729e-221 7.96071e-218
## prp-40 3237.18 1.83227 0.0590478 31.0304 2.09880e-211 2.99204e-208
## trt-1 3598.79 2.03207 0.0671503 30.2616 3.67666e-201 4.36787e-198
## kqt-1 6545.16 1.93614 0.0649398 29.8143 2.54585e-195 2.59240e-192
## pigo-1 6608.04 1.46538 0.0504554 29.0431 1.87914e-185 1.67432e-182
## C03D6.1 29648.59 1.26099 0.0438481 28.7582 7.15561e-182 5.66724e-179
## F25H2.16 18758.88 1.45743 0.0518534 28.1067 8.12292e-174 5.79002e-171
The next thing we set out to do was make a volcano plot with
this data. This is a very common plot for differential transcriptomics,
and you could probably figure it out on your own with ggplot, but lets
see what chatGPT can do. The prompt might be:
* Make a
volcano plot from the output of DESeq2 in R with ggplot
Which might give you an answer similar to:
# Create a dataframe with log2 fold changes and adjusted p-values
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])
# Plot volcano plot
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(size = 1) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red") + # Add significance threshold
labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot")
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).
This looks pretty good, but we can do better! Lets ask chatGPT to
add some text labels for out top hits and change some point colors.
library(ggrepel)
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])
volcano_data$gene <- rownames(volcano_data) # Add gene names as a new column
# Plot volcano plot with customized axes and colors
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj), color = ifelse(log2FoldChange > 0 & padj < 0.05, "red", ifelse(padj < 0.05, "blue", "gray")))) +
geom_point(size = 1) + # Plot points
geom_hline(yintercept = -log10(0.005), linetype = "dashed", color = "black") + # Add significance threshold
geom_text_repel(data = head(volcano_data[order((volcano_data$padj)),], 5), # Top 5 differentially expressed genes
aes(label = gene),
size = 3,
vjust = 0.1,
hjust = 1) + # Add gene labels
labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot") +
theme_classic() +
scale_color_manual(values = c("red", "darkgray", "blue")) + # Set color palette
guides(color = FALSE) # Remove legend
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).
To finish, I want to to try and design and write some very simple
projects yourself. You can use the fish data from before, you can use my
RNAseq data, or you can use the example datasets provided by dplyr.
dplyr is loaded when we load tidyverse, and comes with a couple of small
dataframes. One of these is called “starwars” lets take a look at this
dataframe:
starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Another data set is called storms:
storms
## # A tibble: 19,537 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
## # ℹ 19,527 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>
What kinds of questions can we ask with these data sets? The
possibilities are endless, but try and think of some easy questions on
your own, plan out how you might graph them, and then try and use the
skills you’ve learned with chatGPT to fill in the blanks to make a quick
graph.
I’ve chosen to look at the storms dataset, and
ask “are severe storms getting more frequent?” To do this, I’ll first
map out what I want to do in comment blocks:
# Look at my data set
head(storms)
## # A tibble: 6 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical de… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical de… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical de… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical de… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical de… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical de… NA 25 1012
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>
# Subset this data for only storms of Category 4 or greater
big_storms <- filter(storms, category >= 4)
# Count how many big storms there were per year
storms_per_year <- dplyr::count(big_storms, year)
# Plot this data with year as the x axis and number of storms on the y axis
ggplot(storms_per_year, aes(x = year, y = n)) +
geom_line(color = "blue") +
labs(title = "Number of Storms (Category 4+) by Year",
x = "Year",
y = "Number of Storms") +
theme_classic()
# Add a trend line
ggplot(storms_per_year, aes(x = year, y = n)) +
geom_line(color = "blue") +
labs(title = "Number of Storms (Category 4+) by Year",
x = "Year",
y = "Number of Storms") +
theme_classic() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'