Home



Using chatGPT to help with coding:

chatGPT


Google Gemini

Large Language Model chatbots like chatGPT and Google Gemini are increasingly available and sophisticated. They can do any number of tasks, including help with coding projects and data analysis. Let’s explore a couple ways these tools can be helpful in data analysis, using the data from the last few weeks.


First let’s reload the lizard data. Make sure you set your working directory like we did before.

# Make sure this is set to the correct directory on your computer
setwd("~/r4grads/lizard_data/")

We’ll need our all_data object again. If you have your "lizard_data_merged.csv" file that we wrote out last time, you can read in:

all_data <- read.csv("lizard_data_merged.csv")



If you’ve lost track of this file, we can create it again:

# read in each of the data files
sample_data <- read.csv("lizard_data.csv")
physio_data <- read.csv("physio_data.csv")
# Make all of the substitutions that we made last time
sample_data$Species <- gsub("S. Occidentalis", "S. occidentalis", sample_data$Species)
sample_data$Species <- gsub("S. occidentlis", "S. occidentalis", sample_data$Species)
sample_data$Species <- gsub("^graciosus$", "S. graciosus", sample_data$Species)
# Merge the data
all_data <- merge(sample_data, physio_data)


We also need to load up the tidyverse set of packages:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Here’s our violin plots from last week

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "CTmax") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic() +
  theme(legend.position="none")



This shows us that we probably have some significantly different CTmax values among the different species, but to actually claim that in a paper, we need to run a statistical test like an ANOVA. I know that ANOVAs can be run in R, but I don’t remember how to run them. Instead of Googling that, let’s see if chatGPT can write some code for us.


Open up chatGPT and login. For most of these chatbots, you’ll need a free account to access them. You can make a new account, or link your Google, Microsoft or Apple account to the service and log in that way. If you already have a Gmail and google account you can use Google Gemini which should give you the same results. The process is fairly easy, but if you don’t want to deal with making an account feel free to just follow along for this part.

Once we’re into the chatbot, lets ask it our question: “How do I run an ANOVA in R?”

You should get something like this, with additional explanation that went farther down the page:



Cool! This gives us some dummy code with some groups made of randomly sampled numbers. We could go through and figure out how to apply this to our data, but I’m lazy. Instead, let’s give chatGPT more detail about our data and what we want. How would you structure a query to tell chatGPT about the data and that we want to run an ANOVA to determine if CTmax varies by species?





































I would ask that question like this:


I have an R object called "all_data" that looks like this:

  Lizard.ID.         Species Elevation..m. svl Ctmax
1       10_o S. occidentalis           904  75 40.50
2       11_o S. occidentalis           904  66 43.72
3       12_j     S. jarrovii          1646  53 42.08


I want to run an ANOVA to determine if "Ctmax" is statistically significantly different among "Species". There are 4 different species: "S. occidentalis" "S. jarrovii"     "S. graciosus"    "S. virgatus" 

How do I do this?


I get this:





Let’s try that out:

anova_result <- aov(Ctmax ~ Species, data = all_data)
summary(anova_result)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       3 1029.2   343.1   69.14 <2e-16 ***
## Residuals   169  838.6     5.0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I’m going to skip the homogeneity of variances tests now, but you should not with your own data. Instead, I’m going to skip straight to the post-hoc test to see which groups are different:

TukeyHSD(anova_result)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Ctmax ~ Species, data = all_data)
## 
## $Species
##                                    diff        lwr         upr     p adj
## S. jarrovii-S. graciosus     -7.4417672 -8.8446220 -6.03891246 0.0000000
## S. occidentalis-S. graciosus -0.8296082 -1.8071751  0.14795878 0.1268432
## S. virgatus-S. graciosus     -3.4921839 -6.9144652 -0.06990263 0.0435667
## S. occidentalis-S. jarrovii   6.6121591  5.2811303  7.94318792 0.0000000
## S. virgatus-S. jarrovii       3.9495833  0.4100905  7.48907620 0.0220714
## S. virgatus-S. occidentalis  -2.6625758 -6.0560467  0.73089523 0.1790067

Pretty cool!


What if we want to add in some annotations on our violin plots to show the significance? How would you ask chatGPT for this?





































This is my query:

I have this violin plot code:

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "CTmax") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic() +
  theme(legend.position="none")

and ran ANOVA & Tukey tests:

anova_result <- aov(Ctmax ~ Species, data = all_data)
summary(anova_result)
TukeyHSD(anova_result)



In the violin plot, how can I put askerisks or letters above each of the 4 species to indicate which are significantly from each other?

The output is long, but the relevant code I get out is this:

install.packages("ggsignif")
# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)

# Extracting the significant comparisons
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj"  # I added this manually
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]

# Plot
library(ggplot2)
library(ggsignif)

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width = 0.1) +
  labs(x = NULL, y = "CTmax") +
  theme_classic() +
  theme(legend.position = "none") +
  
  # Adding significance annotations
  geom_signif(comparisons = list(c("S. occidentalis", "S. jarrovii"),
                                 c("S. occidentalis", "S. graciosus"),
                                 c("S. jarrovii", "S. graciosus"),
                                 c("S. virgatus", "S. graciosus")), 
              map_signif_level = TRUE)



There’s an obvious problem here, the lines obove the violin plots are completely overlapping. I’ll tell chatgpt:

with ggsignif, I get completely overlapping bars above my violins

and I get out this modified code:

# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)

# Extracting the significant comparisons
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj"  # I added this manually
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]

# Plot
library(ggplot2)
library(ggsignif)

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width = 0.1) +
  labs(x = NULL, y = "CTmax") +
  theme_classic() +
  theme(legend.position = "none") +
  
  # Adding significance annotations
  geom_signif(comparisons = list(c("S. occidentalis", "S. jarrovii"),
                                 c("S. occidentalis", "S. graciosus"),
                                 c("S. jarrovii", "S. graciosus"),
                                 c("S. virgatus", "S. graciosus")), 
              map_signif_level = TRUE, 
              y_position = c(45, 46, 47, 48))  # Manually adjust y_position

We can fix those lines, and we also need to fix the ordering of the lines. If you look back at your Tukey test summary, you’ll see that we’re not showing the correct significance values for the comparisons - S. virgatus-S. jarrovii should be significant but doesn’t show on the plot and S. occidentalis-S. graciosus should not be significant. When coding with chatGPT you always need to be careful that you’re getting out what you expect. It is easy to get things that look right at first glance but that misrepresent your data.




I asked a few more queries, and the closest I was able to get was this:

# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)

# Extracting the significant comparisons
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj"  # I added this manually
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]

# Plot
library(ggplot2)
library(ggsignif)

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width = 0.1) +
  labs(x = NULL, y = "CTmax") +
  theme_classic() +
  theme(legend.position = "none") +
  
  # Adding significance annotations
  geom_signif(comparisons = list(c("S. jarrovii", "S. graciosus"),
                                 c("S. virgatus", "S. graciosus"),
                                 c("S. occidentalis", "S. jarrovii"),
                                 c("S. virgatus", "S. jarrovii")), 
              map_signif_level = TRUE, 
              y_position = c(45 + 2, 46 + 3, 47 + 4.5, 48 + 5.5))  # Raised by 2 units


Superficially, this looks good, but it has the wrong significance symbols above the lines. This was also the case before, but we couldn’t tell that. After an hour or so of chatGPT giving me increasingly wrong answers, I looked at the code more carefully, realized that ggsignif was calculating incorrect (or at least different) significance values internally, and it was not using the values we calculated from the Tukey test at all. I told that to chatGPT and then finally got this, which manually supplies the annotations to go over the lines:

# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj"  # Rename for consistency

# Extracting significant comparisons
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]

# Calculate y_positions for the significance bars dynamically
y_positions <- seq(from = max(all_data$Ctmax) + 2, 
                   length.out = nrow(significant_pairs), 
                   by = 2)  # Adjust 'by' as needed to create space between bars

# Create a custom data frame for manual p-values
comparisons <- rownames(significant_pairs)
manual_pvalues <- significant_pairs$p.adj
significance_levels <- ifelse(manual_pvalues < 0.001, "***",
                              ifelse(manual_pvalues < 0.01, "**",
                                     ifelse(manual_pvalues < 0.05, "*", "NS")))

# Create a data frame to pass to geom_signif
signif_data <- data.frame(comparison = comparisons, 
                          p_value = manual_pvalues, 
                          y = y_positions, 
                          significance = significance_levels)

# Plot
library(ggplot2)
library(ggsignif)

ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) + 
  geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width = 0.1) +
  labs(x = NULL, y = "CTmax") +
  theme_classic() +
  theme(legend.position = "none") +
  
  # Adding the significance lines with the manually supplied p-values
  geom_signif(comparisons = list(c("S. jarrovii", "S. graciosus"),
                                 c("S. virgatus", "S. graciosus"),
                                 c("S. occidentalis", "S. jarrovii"),
                                 c("S. virgatus", "S. jarrovii")), 
              annotations = signif_data$significance, # Manually supply the significance labels
              y_position = signif_data$y,             # Manually supply y_positions
              map_signif_level = FALSE)  # Don't automatically calculate significance levels

This looks really good and is finally correct, but it took a good bit of me looking at the code, testing various failed suggestions from chatGPT, and ultimately identifying the problem myself and telling chatGPT what the issue was, which finally got the right result. chatGPT is a great resource, but even moderately complex coding can easily be too much for it, and being able to at least identify what is wrong will become essential.



Planning out your coding projects



You now know some of the very basics of R and all the many wonderful things you can do with it. When starting a project from scratch however, it can be a bit daunting on where to begin, so lets think about some solid practices for planning out your very own data analysis.


When starting a new project I like to say out loud to myself, or just write down in simple terms what exactly it is I’m trying to accomplish. This may sound a little silly but can save you lots of time in the end. Let’s look at an example of manipulating some RNAseq data.


# Load my data and metadata into R

# Inspect my data to make sure it looks OK

# Filter for genotype and temperature

# Run DEseq2

# Plot a Volcano Plot of the results

Let’s start with the first step, loading my data into R. I already put this data on my GitHub, so I’ll download it from there. Feel free to grab my data as well and follow along if you wish.


# Load my data into R
meta_data <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_metadata.csv")
gene_counts <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_gene_counts_matrix.tsv", sep = "\t", row.names = "Gene")

# Inspect my data to make sure it looks OK
head(meta_data)
##                    file  geno temp sample
## 1 counts_BP175_16C_1_S1 BP175  16C      1
## 2 counts_BP175_16C_2_S2 BP175  16C      2
## 3 counts_BP175_16C_3_S3 BP175  16C      3
## 4 counts_BP175_22C_1_S4 BP175  22C      1
## 5 counts_BP175_22C_2_S5 BP175  22C      2
## 6 counts_BP175_22C_3_S6 BP175  22C      3
# Filter for genotype and temperature

# Run DEseq2

# Plot a Volcano Plot of the results

This looks pretty good, now lets look at the gene_counts:


# Inspect my data to make sure it looks OK
head(gene_counts)
##       counts_BP175_16C_1_S1 counts_BP175_16C_2_S2 counts_BP175_16C_3_S3
## aap-1                   160                   146                   112
## aat-1                     4                     4                     0
## aat-2                     4                     0                     2
## aat-3                     3                     2                     1
## aat-4                     3                     2                     0
## aat-5                     0                     0                     0
##       counts_BP175_22C_1_S4 counts_BP175_22C_2_S5 counts_BP175_22C_3_S6
## aap-1                    63                    60                    55
## aat-1                     4                     0                     6
## aat-2                    41                    10                     3
## aat-3                     1                     0                     9
## aat-4                     0                     0                     0
## aat-5                     0                     1                     0
##       counts_BP175_25C_1_S7 counts_BP175_25C_2_S8 counts_BP175_25C_3_S9
## aap-1                   114                    84                    91
## aat-1                     4                     2                     2
## aat-2                     4                     8                     0
## aat-3                    14                     7                    13
## aat-4                     0                     0                     0
## aat-5                     0                     0                     0
##       counts_N2_16C_1_S10 counts_N2_16C_2_S11 counts_N2_16C_3_S12
## aap-1                 320                 351                 357
## aat-1                   8                   5                   0
## aat-2                   2                   0                   0
## aat-3                  15                  11                   4
## aat-4                   2                   0                   0
## aat-5                   0                   0                   0
##       counts_N2_22C_1_S13 counts_N2_22C_2_S14 counts_N2_22C_3_S15
## aap-1                 208                 150                 282
## aat-1                   2                   0                   0
## aat-2                   2                   0                   1
## aat-3                   1                   0                   2
## aat-4                   0                   0                   0
## aat-5                   0                   0                   0
##       counts_N2_25C_1_S16 counts_N2_25C_2_S17 counts_N2_25C_3_S18
## aap-1                 158                 173                 168
## aat-1                   6                   0                   2
## aat-2                   4                   6                   6
## aat-3                  16                  12                   2
## aat-4                   0                   0                   0
## aat-5                   0                   0                   0
##       counts_WH171_16C_1_S19 counts_WH171_16C_2_S20 counts_WH171_16C_3_S21
## aap-1                    259                    203                    200
## aat-1                      0                      4                      1
## aat-2                      0                      0                      2
## aat-3                      6                      4                      5
## aat-4                      0                      1                      2
## aat-5                      0                      0                      0
##       counts_WH171_22C_1_S22 counts_WH171_22C_2_S23 counts_WH171_22C_3_S24
## aap-1                    180                    190                    209
## aat-1                      0                      0                      0
## aat-2                     16                     14                     13
## aat-3                      4                      0                      2
## aat-4                      0                      0                      0
## aat-5                      0                      0                      0
##       counts_WH171_25C_1_S25 counts_WH171_25C_2_S26 counts_WH171_25C_3_S27
## aap-1                    150                    149                    188
## aat-1                      0                      2                      2
## aat-2                      0                      6                      2
## aat-3                     37                     34                     19
## aat-4                      0                      0                      0
## aat-5                      0                      0                      0
# Filter for genotype and temperature

# Run DEseq2

# Plot a Volcano Plot of the results


Looks pretty good! However there’s a lot of different samples here. Lets ask chatGPT for help filtering out only the samples with genotype: N2 or BP175 and temperature: 16C



This time chatGPT suggests using the grep function. This is a command line tool from Unix but we can use it in R as well. Lets try grep on our gene_counts object.


# Filter for genotype and temperature
gene_counts_cleaned <- gene_counts[, grep("N2_16C|BP175_16C", names(gene_counts))]
colnames(gene_counts_cleaned)
## [1] "counts_BP175_16C_1_S1" "counts_BP175_16C_2_S2" "counts_BP175_16C_3_S3"
## [4] "counts_N2_16C_1_S10"   "counts_N2_16C_2_S11"   "counts_N2_16C_3_S12"
# Run DESeq2


# Plot a Volcano Plot of the results

Great, looks like we have just 6 samples now, 3 of N2 and 3 of WH171. Lets do this for out meta_data too. We could try and use this grep function again (feel free to try and make it work for rows in meta_data) but we can also go back to the filter() function we learned before.


# Filter for genotype and temperature
meta_data_cleaned <- filter(meta_data, geno == 'N2' & temp == '16C' | geno =='BP175' & temp =='16C')
meta_data_cleaned
##                    file  geno temp sample
## 1 counts_BP175_16C_1_S1 BP175  16C      1
## 2 counts_BP175_16C_2_S2 BP175  16C      2
## 3 counts_BP175_16C_3_S3 BP175  16C      3
## 4   counts_N2_16C_1_S10    N2  16C      1
## 5   counts_N2_16C_2_S11    N2  16C      2
## 6   counts_N2_16C_3_S12    N2  16C      3
# Run DESeq2


# Plot a Volcano Plot of the results


Now it’s time to run our differential expression. This is a BIG oversimplification of some complex biostatistics. If you were doing this on your own you would want to read up on the DESeq2 package and understand the nuances of what you were applying to the data.


However, using chatbots can give you a great start on getting there, so lets see if chatGPT can help us with running DESeq2. Our prompt would probably be something like:



Your results may vary, but I got the below back from chatGPT:


# Install and load required packages
install.packages("DESeq2")
install.packages("tidyverse")
library(DESeq2)
library(tidyverse)

# Load metadata
meta_data <- read.csv("path/to/meta_data.csv")  # Replace "path/to/meta_data.csv" with your file path

# Load counts matrix
gene_counts_cleaned <- read.csv("path/to/gene_counts_cleaned.csv", row.names = 1)  # Replace "path/to/gene_counts_cleaned.csv" with your file path

# Check dimensions and structure of your data
dim(meta_data)
dim(gene_counts_cleaned)
head(meta_data)
head(gene_counts_cleaned)

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
                              colData = meta_data_cleaned,
                              design = ~ your_formula_here)

# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get differential expression results
res <- results(dds)

# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]

# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)

# Explore or visualize the results further as needed


A pretty good start! Clearly we don’t need to reload the packages with library, and we already have meta_data and gene_counts_cleaned as variables we’ve created, but you can see how chatGPT is trying to give us it’s best guess as to what we’re doing, and even uses head() and dim() to QC our data much like we planned before. If I strip out the unnecessary pieces it should looks something like:


library(DESeq2)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
                              colData = meta_data_cleaned,
                              design = ~ your_formula_here)

# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get differential expression results
res <- results(dds)

# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]

# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)


If you read some of the commented code, you can see that it has a dummy variable “your_formula_here” that we’ll need to replace. Because we didn’t specify what the condition was we were looking at, we can remind our selves of what the meta_data table looks like with View(meta_data) and choose the appropriate column, in this case “geno” and substitute it in for your_formula_here

suppressMessages(library(DESeq2))
## Warning: package 'DESeq2' was built under R version 4.3.3
## Warning: package 'GenomeInfoDb' was built under R version 4.3.3
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
                              colData = meta_data_cleaned,
                              design = ~ geno)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# Run DESeq2 analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Get differential expression results
res <- results(dds)

# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]

# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)


Now that DESeq is finished running, lets take a look at our topGenes object that it created:


topGenes
## log2 fold change (MLE): geno N2 vs BP175 
## Wald test p-value: geno N2 vs BP175 
## DataFrame with 10 rows and 6 columns
##           baseMean log2FoldChange     lfcSE      stat       pvalue         padj
##          <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
## T28F4.4   76388.16        1.69646 0.0472487   35.9048 2.56953e-282 1.83156e-278
## stn-1      6426.76        1.70544 0.0506905   33.6442 3.78939e-248 1.35054e-244
## arx-1      9389.14        1.95859 0.0587145   33.3579 5.60367e-244 1.33143e-240
## twk-49    48952.82        1.32816 0.0418461   31.7393 4.46729e-221 7.96071e-218
## prp-40     3237.18        1.83227 0.0590478   31.0304 2.09880e-211 2.99204e-208
## trt-1      3598.79        2.03207 0.0671503   30.2616 3.67666e-201 4.36787e-198
## kqt-1      6545.16        1.93614 0.0649398   29.8143 2.54585e-195 2.59240e-192
## pigo-1     6608.04        1.46538 0.0504554   29.0431 1.87914e-185 1.67432e-182
## C03D6.1   29648.59        1.26099 0.0438481   28.7582 7.15561e-182 5.66724e-179
## F25H2.16  18758.88        1.45743 0.0518534   28.1067 8.12292e-174 5.79001e-171


The next thing we set out to do was make a volcano plot with this data. This is a very common plot for differential transcriptomics, and you could probably figure it out on your own with ggplot, but lets see what chatGPT can do. The prompt might be:



Which might give you an answer similar to:


# Create a dataframe with log2 fold changes and adjusted p-values
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])

# Plot volcano plot
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj))) +
  geom_point(size = 1) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red") +  # Add significance threshold
  labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot")
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).


This looks pretty good, but we can do better! Lets ask chatGPT to add some text labels for out top hits and change some point colors.


library(ggrepel)
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])
volcano_data$gene <- rownames(volcano_data)  # Add gene names as a new column



# Plot volcano plot with customized axes and colors
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj), color = ifelse(log2FoldChange > 0 & padj < 0.05, "red", ifelse(padj < 0.05, "blue", "gray")))) +
  geom_point(size = 1) +  # Plot points
  geom_hline(yintercept = -log10(0.005), linetype = "dashed", color = "black") +  # Add significance threshold
  geom_text_repel(data = head(volcano_data[order((volcano_data$padj)),], 5),  # Top 5 differentially expressed genes
                  aes(label = gene), 
                  size = 3, 
                  vjust = 0.1,
                  hjust = 1) +  # Add gene labels
  labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot") +
  theme_classic() +
  scale_color_manual(values = c("red", "darkgray", "blue")) +  # Set color palette
  guides(color = FALSE)  # Remove legend
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).




Data exploration on your own

To finish, I want to to try and design and write some very simple projects yourself. You can use the lizard data from before, you can use my RNAseq data, or you can use the example datasets provided by dplyr. dplyr is loaded when we load tidyverse, and comes with a couple of small dataframes. One of these is called “starwars” lets take a look at this dataframe:


starwars
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>


Another data set is called storms:


storms
## # A tibble: 19,537 × 13
##    name   year month   day  hour   lat  long status      category  wind pressure
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>          <dbl> <int>    <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical d…       NA    25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical d…       NA    25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical d…       NA    25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical d…       NA    25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical d…       NA    25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical d…       NA    25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical d…       NA    25     1011
##  8 Amy    1975     6    28    18  34   -77   tropical d…       NA    30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical s…       NA    35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical s…       NA    40     1002
## # ℹ 19,527 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## #   hurricane_force_diameter <int>


What kinds of questions can we ask with these data sets? The possibilities are endless, but try to think of some easy questions on your own, plan out how you might graph them, and then try and use the skills you’ve learned with chatGPT to fill in the blanks to make a quick graph.

































I’ve chosen to look at the storms dataset and ask “are severe storms getting more frequent?” To do this, I’ll first map out what I want to do in comment blocks:


# Look at my data set
head(storms)
## # A tibble: 6 × 13
##   name   year month   day  hour   lat  long status       category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropical de…       NA    25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropical de…       NA    25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropical de…       NA    25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropical de…       NA    25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropical de…       NA    25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropical de…       NA    25     1012
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## #   hurricane_force_diameter <int>
# Subset this data for only storms of Category 4 or greater
big_storms <- filter(storms, category >= 4)

# Count how many big storms there were per year
storms_per_year <- dplyr::count(big_storms, year)

# Plot this data with year as the x axis and number of storms on the y axis
ggplot(storms_per_year, aes(x = year, y = n)) +
  geom_line(color = "blue") +  
  labs(title = "Number of Storms (Category 4+) by Year",
       x = "Year",
       y = "Number of Storms") +
       theme_classic()

# Add a trend line
ggplot(storms_per_year, aes(x = year, y = n)) +
  geom_line(color = "blue") +  
  labs(title = "Number of Storms (Category 4+) by Year",
       x = "Year",
       y = "Number of Storms") +
       theme_classic() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'




Home