Home



Using chatGPT to help with coding:

chatGPT
Google Gemini

Large Language Model chatbots like chatGPT and google Gemini are increasingly avalible and sophisticated. They can do any number of tasks, inlcuding help with coding projects and data analysis. Lets explore a couple ways these tools can be helpful for adding feature to our plots from last week.
First let’s reload the fish data. Make sure you set your working directory with setwd() like we did before. Also lets library tidyverse so we have all those functions as well.

# Make sure you set this to your personal working directory with the fish data!
setwd("~/Downloads/r4grads-master/Fish_data/Modified/")

Now lets reload our tidyverse functions with library() and load the fish data we were using last week.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
all_data <- read.csv("fish_data_merged.csv")

Here’s our violin plots from last week

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
  theme_classic()



These look great, but what if we want to turn the x-axis titles 45 degrees to make them a bit easier to read? If you’re not immediately sure how to do this, you could google it, but AI chat-bots might also be helpful.


Open up chatGPT and login. For most of these chatbots, you’ll need a free account to access them. You can make a new account, or link your Google, Microsoft or Apple account to the service and log in that way. If you already have a Gmail and google account you can use Google Gemini which should give you the same results. The process is fairly easy, but if you don’t want to deal with making an account feel free to just follow along for this part.

Once we’re into the chatbot, lets ask it our question: “How do I adjust x-axis labels 45 degrees on a ggplot in R?”



Cool! Looks like one way to do this is by adding to theme(), a function we’ve used before in ggplot. chatGPT and Gemini typically can identify that you’re askign it questions about coding, and will give you code answers. We don’t need the entire script in this case, just the last line:

theme(axis.text.x = element_text(angle = 45, hjust = 1))

and if we copy-paste that into our violin plot function, it should look something like:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "delta 15 N") + # Get rid of the x label, sometimes the categories are self explanatory
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_classic()



That’s weird, it didn’t seem to tilt our x-axis labels. Can you see what I did wrong?



































Maybe you remember from last week the order matters! But if you didn’t notice my mistake, or if you encounter other probelms in your coding adventures, you can also ask these chat bots for help. In this instance, I’m going to give it my entire block of code and simply ask, “Why doesn’t this work?” It can be useful to use Shift+Enter here to put multiple lines of text in the prompt box befor submitting your request.



Of course, we’re trying to use theme twice, and the theme_classic is overwriting the x labels. Now we could look at this and manually edit it, or we can just hit the “Copy code” button and paste that into our R script, now that we’ve given the chatbot specific about the code we’re trying to write.

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = NULL, y = "delta 15 N") +
  theme_classic() +  # Apply theme_classic() first
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Then customize axis text


Much better! As you could probably already guess, these chatbots can only get you so far, and you still need to think for yourself and do some troubleshooting. They’re still very useful tools though, and can help teach you about new ways to do things in R, or give you hints as to why your code isn’t working properly. Lets take a look at another example.

Planning out your coding projects



You now know some of the very basics of R and all the many wonderful things you can do with it. When starting a project from scratch however, it can be a bit daunting on where to begin, so lets think about some solid practices for planning out your very own data analysis.
When starting a new project I like to say out loud to myself, or just write down in simple terms what exactly it is I’m trying to accomplish. This may sound a little silly but can save you lots of time in the end. Let’s look at an example of manipulating some RNAseq data.

# Load my data and metadata into R

# Inspect my data to make sure it looks OK

# Filter for genotype and temperature

# Run DEseq2

# Plot a Volcano Plot of the results

Let’s start with the first step, loading my data into R. I already put this data on my GitHub, so I’ll download it from there. Feel free to grab my data as well and follow along if you wish.

# Load my data into R
meta_data <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_metadata.csv")
gene_counts <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_gene_counts_matrix.tsv", sep = "\t", row.names = "Gene")

# Inspect my data to make sure it looks OK
head(meta_data)
##                    file  geno temp sample
## 1 counts_BP175_16C_1_S1 BP175  16C      1
## 2 counts_BP175_16C_2_S2 BP175  16C      2
## 3 counts_BP175_16C_3_S3 BP175  16C      3
## 4 counts_BP175_22C_1_S4 BP175  22C      1
## 5 counts_BP175_22C_2_S5 BP175  22C      2
## 6 counts_BP175_22C_3_S6 BP175  22C      3
# Filter for genotype and temperature

# Run DEseq2

# Plot a Volcano Plot of the results

This looks pretty good, now lets look at the gene_counts:

# Inspect my data to make sure it looks OK
head(gene_counts)
##       counts_BP175_16C_1_S1 counts_BP175_16C_2_S2 counts_BP175_16C_3_S3
## aap-1                   160                   146                   112
## aat-1                     4                     4                     0
## aat-2                     4                     0                     2
## aat-3                     3                     2                     1
## aat-4                     3                     2                     0
## aat-5                     0                     0                     0
##       counts_BP175_22C_1_S4 counts_BP175_22C_2_S5 counts_BP175_22C_3_S6
## aap-1                    63                    60                    55
## aat-1                     4                     0                     6
## aat-2                    41                    10                     3
## aat-3                     1                     0                     9
## aat-4                     0                     0                     0
## aat-5                     0                     1                     0
##       counts_BP175_25C_1_S7 counts_BP175_25C_2_S8 counts_BP175_25C_3_S9
## aap-1                   114                    84                    91
## aat-1                     4                     2                     2
## aat-2                     4                     8                     0
## aat-3                    14                     7                    13
## aat-4                     0                     0                     0
## aat-5                     0                     0                     0
##       counts_N2_16C_1_S10 counts_N2_16C_2_S11 counts_N2_16C_3_S12
## aap-1                 320                 351                 357
## aat-1                   8                   5                   0
## aat-2                   2                   0                   0
## aat-3                  15                  11                   4
## aat-4                   2                   0                   0
## aat-5                   0                   0                   0
##       counts_N2_22C_1_S13 counts_N2_22C_2_S14 counts_N2_22C_3_S15
## aap-1                 208                 150                 282
## aat-1                   2                   0                   0
## aat-2                   2                   0                   1
## aat-3                   1                   0                   2
## aat-4                   0                   0                   0
## aat-5                   0                   0                   0
##       counts_N2_25C_1_S16 counts_N2_25C_2_S17 counts_N2_25C_3_S18
## aap-1                 158                 173                 168
## aat-1                   6                   0                   2
## aat-2                   4                   6                   6
## aat-3                  16                  12                   2
## aat-4                   0                   0                   0
## aat-5                   0                   0                   0
##       counts_WH171_16C_1_S19 counts_WH171_16C_2_S20 counts_WH171_16C_3_S21
## aap-1                    259                    203                    200
## aat-1                      0                      4                      1
## aat-2                      0                      0                      2
## aat-3                      6                      4                      5
## aat-4                      0                      1                      2
## aat-5                      0                      0                      0
##       counts_WH171_22C_1_S22 counts_WH171_22C_2_S23 counts_WH171_22C_3_S24
## aap-1                    180                    190                    209
## aat-1                      0                      0                      0
## aat-2                     16                     14                     13
## aat-3                      4                      0                      2
## aat-4                      0                      0                      0
## aat-5                      0                      0                      0
##       counts_WH171_25C_1_S25 counts_WH171_25C_2_S26 counts_WH171_25C_3_S27
## aap-1                    150                    149                    188
## aat-1                      0                      2                      2
## aat-2                      0                      6                      2
## aat-3                     37                     34                     19
## aat-4                      0                      0                      0
## aat-5                      0                      0                      0
# Filter for genotype and temperature

# Run DEseq2

# Plot a Volcano Plot of the results


Looks pretty good! However there’s a lot of different samples here. Lets ask chatGPT for help filtering out only the samples with genotype: N2 or BP175 and temperature: 16C

This time chatGPT suggests using the grep function. This is a command line tool from Unix but we can use it in R as well. Lets try grep on our gene_counts object.

# Filter for genotype and temperature
gene_counts_cleaned <- gene_counts[, grep("N2_16C|BP175_16C", names(gene_counts))]
colnames(gene_counts_cleaned)
## [1] "counts_BP175_16C_1_S1" "counts_BP175_16C_2_S2" "counts_BP175_16C_3_S3"
## [4] "counts_N2_16C_1_S10"   "counts_N2_16C_2_S11"   "counts_N2_16C_3_S12"
# Run DESeq2


# Plot a Volcano Plot of the results

Great, looks like we have just 6 samples now, 3 of N2 and 3 of WH171. Lets do this for out meta_data too. We could try and use this grep function again (feel free to try and make it work for rows in meta_data) but we can also go back to the filter() function we learned before.

# Filter for genotype and temperature
meta_data_cleaned <- filter(meta_data, geno == 'N2' & temp == '16C' | geno =='BP175' & temp =='16C')
meta_data_cleaned
##                    file  geno temp sample
## 1 counts_BP175_16C_1_S1 BP175  16C      1
## 2 counts_BP175_16C_2_S2 BP175  16C      2
## 3 counts_BP175_16C_3_S3 BP175  16C      3
## 4   counts_N2_16C_1_S10    N2  16C      1
## 5   counts_N2_16C_2_S11    N2  16C      2
## 6   counts_N2_16C_3_S12    N2  16C      3
# Run DESeq2


# Plot a Volcano Plot of the results


Now it’s time to run our differential expression. This is a BIG oversimplification of some complex biostatistics. If you were doing this on your own you would want to read up on the DESeq2 package and understand the nuances of what you were applying to the data.
However, using chatbots can give you a great start on getting there, so lets see if chatGPT can help us with running DESeq2. Our prompt would probably be something like:
* In R, I have a meta data file called meta_data_cleaned and a counts matrix called gene_counts_cleaned. I want to run differential expression on this using DESeq2
Your results may vary, but I got the below back from chatGPT:

# Install and load required packages
install.packages("DESeq2")
install.packages("tidyverse")
library(DESeq2)
library(tidyverse)

# Load metadata
meta_data <- read.csv("path/to/meta_data.csv")  # Replace "path/to/meta_data.csv" with your file path

# Load counts matrix
gene_counts_cleaned <- read.csv("path/to/gene_counts_cleaned.csv", row.names = 1)  # Replace "path/to/gene_counts_cleaned.csv" with your file path

# Check dimensions and structure of your data
dim(meta_data)
dim(gene_counts_cleaned)
head(meta_data)
head(gene_counts_cleaned)

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
                              colData = meta_data_cleaned,
                              design = ~ your_formula_here)

# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get differential expression results
res <- results(dds)

# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]

# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)

# Explore or visualize the results further as needed


A pretty good start! Clearly we don’t need to reload the packages with library, and we already have meta_data and gene_counts_cleaned as variables we’ve created, but you can see how chatGPT is trying to give us it’s best guess as to what we’re doing, and even uses head() and dim() to QC our data much like we planned before. If I strip out the unnecessary pieces it should looks something like:

library(DESeq2)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
                              colData = meta_data_cleaned,
                              design = ~ your_formula_here)

# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get differential expression results
res <- results(dds)

# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]

# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)


If you read some of the commented code, you can see that it has a dummy variable “your_formula_here” that we’ll need to replace. Because we didn’t specify what the condition was we were looking at, we can remind our selves of what the meta_data table looks like with View(meta_data) and choose the appropriate column, in this case “geno” and substitute it in for your_formula_here

suppressMessages(library(DESeq2))

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
                              colData = meta_data_cleaned,
                              design = ~ geno)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# Run DESeq2 analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Get differential expression results
res <- results(dds)

# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]

# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)


Now that DESeq is finished running, lets take a look at our topGenes object that it created:

topGenes
## log2 fold change (MLE): geno N2 vs BP175 
## Wald test p-value: geno N2 vs BP175 
## DataFrame with 10 rows and 6 columns
##           baseMean log2FoldChange     lfcSE      stat       pvalue         padj
##          <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
## T28F4.4   76388.16        1.69646 0.0472487   35.9048 2.56953e-282 1.83156e-278
## stn-1      6426.76        1.70544 0.0506905   33.6442 3.78939e-248 1.35054e-244
## arx-1      9389.14        1.95859 0.0587145   33.3579 5.60367e-244 1.33143e-240
## twk-49    48952.82        1.32816 0.0418461   31.7393 4.46729e-221 7.96071e-218
## prp-40     3237.18        1.83227 0.0590478   31.0304 2.09880e-211 2.99204e-208
## trt-1      3598.79        2.03207 0.0671503   30.2616 3.67666e-201 4.36787e-198
## kqt-1      6545.16        1.93614 0.0649398   29.8143 2.54585e-195 2.59240e-192
## pigo-1     6608.04        1.46538 0.0504554   29.0431 1.87914e-185 1.67432e-182
## C03D6.1   29648.59        1.26099 0.0438481   28.7582 7.15561e-182 5.66724e-179
## F25H2.16  18758.88        1.45743 0.0518534   28.1067 8.12292e-174 5.79002e-171


The next thing we set out to do was make a volcano plot with this data. This is a very common plot for differential transcriptomics, and you could probably figure it out on your own with ggplot, but lets see what chatGPT can do. The prompt might be:
* Make a volcano plot from the output of DESeq2 in R with ggplot
Which might give you an answer similar to:

# Create a dataframe with log2 fold changes and adjusted p-values
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])

# Plot volcano plot
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj))) +
  geom_point(size = 1) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red") +  # Add significance threshold
  labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot")
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).


This looks pretty good, but we can do better! Lets ask chatGPT to add some text labels for out top hits and change some point colors.

library(ggrepel)
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])
volcano_data$gene <- rownames(volcano_data)  # Add gene names as a new column



# Plot volcano plot with customized axes and colors
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj), color = ifelse(log2FoldChange > 0 & padj < 0.05, "red", ifelse(padj < 0.05, "blue", "gray")))) +
  geom_point(size = 1) +  # Plot points
  geom_hline(yintercept = -log10(0.005), linetype = "dashed", color = "black") +  # Add significance threshold
  geom_text_repel(data = head(volcano_data[order((volcano_data$padj)),], 5),  # Top 5 differentially expressed genes
                  aes(label = gene), 
                  size = 3, 
                  vjust = 0.1,
                  hjust = 1) +  # Add gene labels
  labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot") +
  theme_classic() +
  scale_color_manual(values = c("red", "darkgray", "blue")) +  # Set color palette
  guides(color = FALSE)  # Remove legend
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).




Data exploration on your own

To finish, I want to to try and design and write some very simple projects yourself. You can use the fish data from before, you can use my RNAseq data, or you can use the example datasets provided by dplyr. dplyr is loaded when we load tidyverse, and comes with a couple of small dataframes. One of these is called “starwars” lets take a look at this dataframe:

starwars
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>


Another data set is called storms:

storms
## # A tibble: 19,537 × 13
##    name   year month   day  hour   lat  long status      category  wind pressure
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>          <dbl> <int>    <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical d…       NA    25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical d…       NA    25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical d…       NA    25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical d…       NA    25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical d…       NA    25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical d…       NA    25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical d…       NA    25     1011
##  8 Amy    1975     6    28    18  34   -77   tropical d…       NA    30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical s…       NA    35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical s…       NA    40     1002
## # ℹ 19,527 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## #   hurricane_force_diameter <int>


What kinds of questions can we ask with these data sets? The possibilities are endless, but try and think of some easy questions on your own, plan out how you might graph them, and then try and use the skills you’ve learned with chatGPT to fill in the blanks to make a quick graph.































I’ve chosen to look at the storms dataset, and ask “are severe storms getting more frequent?” To do this, I’ll first map out what I want to do in comment blocks:

# Look at my data set
head(storms)
## # A tibble: 6 × 13
##   name   year month   day  hour   lat  long status       category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropical de…       NA    25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropical de…       NA    25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropical de…       NA    25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropical de…       NA    25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropical de…       NA    25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropical de…       NA    25     1012
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## #   hurricane_force_diameter <int>
# Subset this data for only storms of Category 4 or greater
big_storms <- filter(storms, category >= 4)

# Count how many big storms there were per year
storms_per_year <- dplyr::count(big_storms, year)

# Plot this data with year as the x axis and number of storms on the y axis
ggplot(storms_per_year, aes(x = year, y = n)) +
  geom_line(color = "blue") +  
  labs(title = "Number of Storms (Category 4+) by Year",
       x = "Year",
       y = "Number of Storms") +
       theme_classic()

# Add a trend line
ggplot(storms_per_year, aes(x = year, y = n)) +
  geom_line(color = "blue") +  
  labs(title = "Number of Storms (Category 4+) by Year",
       x = "Year",
       y = "Number of Storms") +
       theme_classic() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'




Home