Large Language Model chatbots like chatGPT and Google Gemini are increasingly available and sophisticated. They can do any number of tasks, including help with coding projects and data analysis. Let’s explore a couple ways these tools can be helpful in data analysis, using the data from the last few weeks.
First let’s reload the lizard data. Make sure you set your working directory like we did before.
# Make sure this is set to the correct directory on your computer
setwd("~/r4grads/lizard_data/")
We’ll need our all_data
object again. If you have your
"lizard_data_merged.csv"
file that we wrote out last time,
you can read in:
all_data <- read.csv("lizard_data_merged.csv")
If you’ve lost track of this file, we can create it again:
# read in each of the data files
sample_data <- read.csv("lizard_data.csv")
physio_data <- read.csv("physio_data.csv")
# Make all of the substitutions that we made last time
sample_data$Species <- gsub("S. Occidentalis", "S. occidentalis", sample_data$Species)
sample_data$Species <- gsub("S. occidentlis", "S. occidentalis", sample_data$Species)
sample_data$Species <- gsub("^graciosus$", "S. graciosus", sample_data$Species)
# Merge the data
all_data <- merge(sample_data, physio_data)
We also need to load up the tidyverse set of packages:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Here’s our violin plots from last week
ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "CTmax") + # Get rid of the x label, sometimes the categories are self explanatory
theme_classic() +
theme(legend.position="none")
This shows us that we probably have some significantly different CTmax values among the different species, but to actually claim that in a paper, we need to run a statistical test like an ANOVA. I know that ANOVAs can be run in R, but I don’t remember how to run them. Instead of Googling that, let’s see if chatGPT can write some code for us.
Open up chatGPT and login. For most of these chatbots, you’ll need a free account to access them. You can make a new account, or link your Google, Microsoft or Apple account to the service and log in that way. If you already have a Gmail and google account you can use Google Gemini which should give you the same results. The process is fairly easy, but if you don’t want to deal with making an account feel free to just follow along for this part.
Once we’re into the chatbot, lets ask it our question: “How do I run
an ANOVA in R?”
You should get something like this, with additional explanation that went farther down the page:
Cool! This gives us some dummy code with some groups made of randomly sampled numbers. We could go through and figure out how to apply this to our data, but I’m lazy. Instead, let’s give chatGPT more detail about our data and what we want. How would you structure a query to tell chatGPT about the data and that we want to run an ANOVA to determine if CTmax varies by species?
I would ask that question like this:
I have an R object called "all_data" that looks like this:
Lizard.ID. Species Elevation..m. svl Ctmax
1 10_o S. occidentalis 904 75 40.50
2 11_o S. occidentalis 904 66 43.72
3 12_j S. jarrovii 1646 53 42.08
I want to run an ANOVA to determine if "Ctmax" is statistically significantly different among "Species". There are 4 different species: "S. occidentalis" "S. jarrovii" "S. graciosus" "S. virgatus"
How do I do this?
I ran head(all_data)
to see what the top of the
object looks like
Note that I use the names of objects and column names explicitly
The information about the species is probably unnecessary, but it
doesn’t hurt to include. I got that by running
unique(all_data$Species)
When entering many lines into chatGPT, if can help to hit
shift + enter
to add in blank lines. If you hit just
enter
, you will ask your query.
I get this:
Let’s try that out:
anova_result <- aov(Ctmax ~ Species, data = all_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 3 1029.2 343.1 69.14 <2e-16 ***
## Residuals 169 838.6 5.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I’m going to skip the homogeneity of variances tests now, but you should not with your own data. Instead, I’m going to skip straight to the post-hoc test to see which groups are different:
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Ctmax ~ Species, data = all_data)
##
## $Species
## diff lwr upr p adj
## S. jarrovii-S. graciosus -7.4417672 -8.8446220 -6.03891246 0.0000000
## S. occidentalis-S. graciosus -0.8296082 -1.8071751 0.14795878 0.1268432
## S. virgatus-S. graciosus -3.4921839 -6.9144652 -0.06990263 0.0435667
## S. occidentalis-S. jarrovii 6.6121591 5.2811303 7.94318792 0.0000000
## S. virgatus-S. jarrovii 3.9495833 0.4100905 7.48907620 0.0220714
## S. virgatus-S. occidentalis -2.6625758 -6.0560467 0.73089523 0.1790067
Pretty cool!
What if we want to add in some annotations on our violin plots to show the significance? How would you ask chatGPT for this?
This is my query:
I have this violin plot code:
ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) +
geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width=0.1)+
labs(x = NULL, y = "CTmax") + # Get rid of the x label, sometimes the categories are self explanatory
theme_classic() +
theme(legend.position="none")
and ran ANOVA & Tukey tests:
anova_result <- aov(Ctmax ~ Species, data = all_data)
summary(anova_result)
TukeyHSD(anova_result)
In the violin plot, how can I put askerisks or letters above each of the 4 species to indicate which are significantly from each other?
The output is long, but the relevant code I get out is this:
install.packages("ggsignif")
# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)
# Extracting the significant comparisons
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj" # I added this manually
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]
# Plot
library(ggplot2)
library(ggsignif)
ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) +
geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width = 0.1) +
labs(x = NULL, y = "CTmax") +
theme_classic() +
theme(legend.position = "none") +
# Adding significance annotations
geom_signif(comparisons = list(c("S. occidentalis", "S. jarrovii"),
c("S. occidentalis", "S. graciosus"),
c("S. jarrovii", "S. graciosus"),
c("S. virgatus", "S. graciosus")),
map_signif_level = TRUE)
There’s an obvious problem here, the lines obove the violin plots are completely overlapping. I’ll tell chatgpt:
with ggsignif, I get completely overlapping bars above my violins
and I get out this modified code:
# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)
# Extracting the significant comparisons
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj" # I added this manually
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]
# Plot
library(ggplot2)
library(ggsignif)
ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) +
geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width = 0.1) +
labs(x = NULL, y = "CTmax") +
theme_classic() +
theme(legend.position = "none") +
# Adding significance annotations
geom_signif(comparisons = list(c("S. occidentalis", "S. jarrovii"),
c("S. occidentalis", "S. graciosus"),
c("S. jarrovii", "S. graciosus"),
c("S. virgatus", "S. graciosus")),
map_signif_level = TRUE,
y_position = c(45, 46, 47, 48)) # Manually adjust y_position
We can fix those lines, and we also need to fix the ordering of the lines. If you look back at your Tukey test summary, you’ll see that we’re not showing the correct significance values for the comparisons - S. virgatus-S. jarrovii should be significant but doesn’t show on the plot and S. occidentalis-S. graciosus should not be significant. When coding with chatGPT you always need to be careful that you’re getting out what you expect. It is easy to get things that look right at first glance but that misrepresent your data.
I asked a few more queries, and the closest I was able to get was this:
# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)
# Extracting the significant comparisons
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj" # I added this manually
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]
# Plot
library(ggplot2)
library(ggsignif)
ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) +
geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width = 0.1) +
labs(x = NULL, y = "CTmax") +
theme_classic() +
theme(legend.position = "none") +
# Adding significance annotations
geom_signif(comparisons = list(c("S. jarrovii", "S. graciosus"),
c("S. virgatus", "S. graciosus"),
c("S. occidentalis", "S. jarrovii"),
c("S. virgatus", "S. jarrovii")),
map_signif_level = TRUE,
y_position = c(45 + 2, 46 + 3, 47 + 4.5, 48 + 5.5)) # Raised by 2 units
Superficially, this looks good, but it has the wrong significance symbols above the lines. This was also the case before, but we couldn’t tell that. After an hour or so of chatGPT giving me increasingly wrong answers, I looked at the code more carefully, realized that ggsignif was calculating incorrect (or at least different) significance values internally, and it was not using the values we calculated from the Tukey test at all. I told that to chatGPT and then finally got this, which manually supplies the annotations to go over the lines:
# Tukey HSD test results
tukey_result <- TukeyHSD(anova_result)
tukey_df <- as.data.frame(tukey_result$Species)
colnames(tukey_df)[colnames(tukey_df) == "p adj"] <- "p.adj" # Rename for consistency
# Extracting significant comparisons
significant_pairs <- tukey_df[tukey_df$p.adj < 0.05, ]
# Calculate y_positions for the significance bars dynamically
y_positions <- seq(from = max(all_data$Ctmax) + 2,
length.out = nrow(significant_pairs),
by = 2) # Adjust 'by' as needed to create space between bars
# Create a custom data frame for manual p-values
comparisons <- rownames(significant_pairs)
manual_pvalues <- significant_pairs$p.adj
significance_levels <- ifelse(manual_pvalues < 0.001, "***",
ifelse(manual_pvalues < 0.01, "**",
ifelse(manual_pvalues < 0.05, "*", "NS")))
# Create a data frame to pass to geom_signif
signif_data <- data.frame(comparison = comparisons,
p_value = manual_pvalues,
y = y_positions,
significance = significance_levels)
# Plot
library(ggplot2)
library(ggsignif)
ggplot(data = all_data, mapping = aes(x = Species, y = Ctmax)) +
geom_violin(trim = FALSE, mapping = aes(fill = Species)) +
geom_boxplot(width = 0.1) +
labs(x = NULL, y = "CTmax") +
theme_classic() +
theme(legend.position = "none") +
# Adding the significance lines with the manually supplied p-values
geom_signif(comparisons = list(c("S. jarrovii", "S. graciosus"),
c("S. virgatus", "S. graciosus"),
c("S. occidentalis", "S. jarrovii"),
c("S. virgatus", "S. jarrovii")),
annotations = signif_data$significance, # Manually supply the significance labels
y_position = signif_data$y, # Manually supply y_positions
map_signif_level = FALSE) # Don't automatically calculate significance levels
This looks really good and is finally correct, but it took a good bit of me looking at the code, testing various failed suggestions from chatGPT, and ultimately identifying the problem myself and telling chatGPT what the issue was, which finally got the right result. chatGPT is a great resource, but even moderately complex coding can easily be too much for it, and being able to at least identify what is wrong will become essential.
You now know some of the very basics of R and all the many wonderful things you can do with it. When starting a project from scratch however, it can be a bit daunting on where to begin, so lets think about some solid practices for planning out your very own data analysis.
When starting a new project I like to say out loud to myself, or just write down in simple terms what exactly it is I’m trying to accomplish. This may sound a little silly but can save you lots of time in the end. Let’s look at an example of manipulating some RNAseq data.
# Load my data and metadata into R
# Inspect my data to make sure it looks OK
# Filter for genotype and temperature
# Run DEseq2
# Plot a Volcano Plot of the results
Let’s start with the first step, loading my data into R. I already put this data on my GitHub, so I’ll download it from there. Feel free to grab my data as well and follow along if you wish.
# Load my data into R
meta_data <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_metadata.csv")
gene_counts <- read.csv("https://raw.githubusercontent.com/ohfunk/L1_bulk_22/main/L1_bulk_22_gene_counts_matrix.tsv", sep = "\t", row.names = "Gene")
# Inspect my data to make sure it looks OK
head(meta_data)
## file geno temp sample
## 1 counts_BP175_16C_1_S1 BP175 16C 1
## 2 counts_BP175_16C_2_S2 BP175 16C 2
## 3 counts_BP175_16C_3_S3 BP175 16C 3
## 4 counts_BP175_22C_1_S4 BP175 22C 1
## 5 counts_BP175_22C_2_S5 BP175 22C 2
## 6 counts_BP175_22C_3_S6 BP175 22C 3
# Filter for genotype and temperature
# Run DEseq2
# Plot a Volcano Plot of the results
This looks pretty good, now lets look at the gene_counts:
# Inspect my data to make sure it looks OK
head(gene_counts)
## counts_BP175_16C_1_S1 counts_BP175_16C_2_S2 counts_BP175_16C_3_S3
## aap-1 160 146 112
## aat-1 4 4 0
## aat-2 4 0 2
## aat-3 3 2 1
## aat-4 3 2 0
## aat-5 0 0 0
## counts_BP175_22C_1_S4 counts_BP175_22C_2_S5 counts_BP175_22C_3_S6
## aap-1 63 60 55
## aat-1 4 0 6
## aat-2 41 10 3
## aat-3 1 0 9
## aat-4 0 0 0
## aat-5 0 1 0
## counts_BP175_25C_1_S7 counts_BP175_25C_2_S8 counts_BP175_25C_3_S9
## aap-1 114 84 91
## aat-1 4 2 2
## aat-2 4 8 0
## aat-3 14 7 13
## aat-4 0 0 0
## aat-5 0 0 0
## counts_N2_16C_1_S10 counts_N2_16C_2_S11 counts_N2_16C_3_S12
## aap-1 320 351 357
## aat-1 8 5 0
## aat-2 2 0 0
## aat-3 15 11 4
## aat-4 2 0 0
## aat-5 0 0 0
## counts_N2_22C_1_S13 counts_N2_22C_2_S14 counts_N2_22C_3_S15
## aap-1 208 150 282
## aat-1 2 0 0
## aat-2 2 0 1
## aat-3 1 0 2
## aat-4 0 0 0
## aat-5 0 0 0
## counts_N2_25C_1_S16 counts_N2_25C_2_S17 counts_N2_25C_3_S18
## aap-1 158 173 168
## aat-1 6 0 2
## aat-2 4 6 6
## aat-3 16 12 2
## aat-4 0 0 0
## aat-5 0 0 0
## counts_WH171_16C_1_S19 counts_WH171_16C_2_S20 counts_WH171_16C_3_S21
## aap-1 259 203 200
## aat-1 0 4 1
## aat-2 0 0 2
## aat-3 6 4 5
## aat-4 0 1 2
## aat-5 0 0 0
## counts_WH171_22C_1_S22 counts_WH171_22C_2_S23 counts_WH171_22C_3_S24
## aap-1 180 190 209
## aat-1 0 0 0
## aat-2 16 14 13
## aat-3 4 0 2
## aat-4 0 0 0
## aat-5 0 0 0
## counts_WH171_25C_1_S25 counts_WH171_25C_2_S26 counts_WH171_25C_3_S27
## aap-1 150 149 188
## aat-1 0 2 2
## aat-2 0 6 2
## aat-3 37 34 19
## aat-4 0 0 0
## aat-5 0 0 0
# Filter for genotype and temperature
# Run DEseq2
# Plot a Volcano Plot of the results
Looks pretty good! However there’s a lot of different samples here. Lets ask chatGPT for help filtering out only the samples with genotype: N2 or BP175 and temperature: 16C
This time chatGPT suggests using the grep function. This is a command line tool from Unix but we can use it in R as well. Lets try grep on our gene_counts object.
# Filter for genotype and temperature
gene_counts_cleaned <- gene_counts[, grep("N2_16C|BP175_16C", names(gene_counts))]
colnames(gene_counts_cleaned)
## [1] "counts_BP175_16C_1_S1" "counts_BP175_16C_2_S2" "counts_BP175_16C_3_S3"
## [4] "counts_N2_16C_1_S10" "counts_N2_16C_2_S11" "counts_N2_16C_3_S12"
# Run DESeq2
# Plot a Volcano Plot of the results
Great, looks like we have just 6 samples now, 3 of N2 and 3 of WH171. Lets do this for out meta_data too. We could try and use this grep function again (feel free to try and make it work for rows in meta_data) but we can also go back to the filter() function we learned before.
# Filter for genotype and temperature
meta_data_cleaned <- filter(meta_data, geno == 'N2' & temp == '16C' | geno =='BP175' & temp =='16C')
meta_data_cleaned
## file geno temp sample
## 1 counts_BP175_16C_1_S1 BP175 16C 1
## 2 counts_BP175_16C_2_S2 BP175 16C 2
## 3 counts_BP175_16C_3_S3 BP175 16C 3
## 4 counts_N2_16C_1_S10 N2 16C 1
## 5 counts_N2_16C_2_S11 N2 16C 2
## 6 counts_N2_16C_3_S12 N2 16C 3
# Run DESeq2
# Plot a Volcano Plot of the results
Now it’s time to run our differential expression. This is a BIG oversimplification of some complex biostatistics. If you were doing this on your own you would want to read up on the DESeq2 package and understand the nuances of what you were applying to the data.
However, using chatbots can give you a great start on getting there, so lets see if chatGPT can help us with running DESeq2. Our prompt would probably be something like:
Your results may vary, but I got the below back from chatGPT:
# Install and load required packages
install.packages("DESeq2")
install.packages("tidyverse")
library(DESeq2)
library(tidyverse)
# Load metadata
meta_data <- read.csv("path/to/meta_data.csv") # Replace "path/to/meta_data.csv" with your file path
# Load counts matrix
gene_counts_cleaned <- read.csv("path/to/gene_counts_cleaned.csv", row.names = 1) # Replace "path/to/gene_counts_cleaned.csv" with your file path
# Check dimensions and structure of your data
dim(meta_data)
dim(gene_counts_cleaned)
head(meta_data)
head(gene_counts_cleaned)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
colData = meta_data_cleaned,
design = ~ your_formula_here)
# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition
# Run DESeq2 analysis
dds <- DESeq(dds)
# Get differential expression results
res <- results(dds)
# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]
# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)
# Explore or visualize the results further as needed
A pretty good start! Clearly we don’t need to reload the packages with library, and we already have meta_data and gene_counts_cleaned as variables we’ve created, but you can see how chatGPT is trying to give us it’s best guess as to what we’re doing, and even uses head() and dim() to QC our data much like we planned before. If I strip out the unnecessary pieces it should looks something like:
library(DESeq2)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
colData = meta_data_cleaned,
design = ~ your_formula_here)
# Replace "your_formula_here" with the formula representing your experimental design.
# For example, if your design has a single factor "condition", the formula would be:
# design = ~ condition
# Run DESeq2 analysis
dds <- DESeq(dds)
# Get differential expression results
res <- results(dds)
# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]
# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)
If you read some of the commented code, you can see that it has a
dummy variable “your_formula_here” that we’ll need to replace. Because
we didn’t specify what the condition was we were looking at, we can
remind our selves of what the meta_data table looks like with
View(meta_data) and choose the appropriate column, in this case “geno”
and substitute it in for your_formula_here
suppressMessages(library(DESeq2))
## Warning: package 'DESeq2' was built under R version 4.3.3
## Warning: package 'GenomeInfoDb' was built under R version 4.3.3
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = gene_counts_cleaned,
colData = meta_data_cleaned,
design = ~ geno)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# Run DESeq2 analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Get differential expression results
res <- results(dds)
# Extract significant results
sig_results <- res[which(res$padj < 0.05), ]
# Check top DE genes
topGenes <- head(sig_results[order(sig_results$padj), ], n = 10)
Now that DESeq is finished running, lets take a look at our topGenes object that it created:
topGenes
## log2 fold change (MLE): geno N2 vs BP175
## Wald test p-value: geno N2 vs BP175
## DataFrame with 10 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## T28F4.4 76388.16 1.69646 0.0472487 35.9048 2.56953e-282 1.83156e-278
## stn-1 6426.76 1.70544 0.0506905 33.6442 3.78939e-248 1.35054e-244
## arx-1 9389.14 1.95859 0.0587145 33.3579 5.60367e-244 1.33143e-240
## twk-49 48952.82 1.32816 0.0418461 31.7393 4.46729e-221 7.96071e-218
## prp-40 3237.18 1.83227 0.0590478 31.0304 2.09880e-211 2.99204e-208
## trt-1 3598.79 2.03207 0.0671503 30.2616 3.67666e-201 4.36787e-198
## kqt-1 6545.16 1.93614 0.0649398 29.8143 2.54585e-195 2.59240e-192
## pigo-1 6608.04 1.46538 0.0504554 29.0431 1.87914e-185 1.67432e-182
## C03D6.1 29648.59 1.26099 0.0438481 28.7582 7.15561e-182 5.66724e-179
## F25H2.16 18758.88 1.45743 0.0518534 28.1067 8.12292e-174 5.79001e-171
The next thing we set out to do was make a volcano plot with this data. This is a very common plot for differential transcriptomics, and you could probably figure it out on your own with ggplot, but lets see what chatGPT can do. The prompt might be:
Which might give you an answer similar to:
# Create a dataframe with log2 fold changes and adjusted p-values
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])
# Plot volcano plot
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(size = 1) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red") + # Add significance threshold
labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot")
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).
This looks pretty good, but we can do better! Lets ask chatGPT to add some text labels for out top hits and change some point colors.
library(ggrepel)
volcano_data <- as.data.frame(results(dds)[, c("log2FoldChange", "padj")])
volcano_data$gene <- rownames(volcano_data) # Add gene names as a new column
# Plot volcano plot with customized axes and colors
ggplot(volcano_data, aes(x = log2FoldChange, y = -log10(padj), color = ifelse(log2FoldChange > 0 & padj < 0.05, "red", ifelse(padj < 0.05, "blue", "gray")))) +
geom_point(size = 1) + # Plot points
geom_hline(yintercept = -log10(0.005), linetype = "dashed", color = "black") + # Add significance threshold
geom_text_repel(data = head(volcano_data[order((volcano_data$padj)),], 5), # Top 5 differentially expressed genes
aes(label = gene),
size = 3,
vjust = 0.1,
hjust = 1) + # Add gene labels
labs(x = "log2 Fold Change", y = "-log10(Adjusted p-value)", title = "Volcano Plot") +
theme_classic() +
scale_color_manual(values = c("red", "darkgray", "blue")) + # Set color palette
guides(color = FALSE) # Remove legend
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 39806 rows containing missing values or values outside the scale range
## (`geom_point()`).
To finish, I want to to try and design and write some very simple projects yourself. You can use the lizard data from before, you can use my RNAseq data, or you can use the example datasets provided by dplyr. dplyr is loaded when we load tidyverse, and comes with a couple of small dataframes. One of these is called “starwars” lets take a look at this dataframe:
starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Another data set is called storms:
storms
## # A tibble: 19,537 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
## # ℹ 19,527 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>
What kinds of questions can we ask with these data sets? The possibilities are endless, but try to think of some easy questions on your own, plan out how you might graph them, and then try and use the skills you’ve learned with chatGPT to fill in the blanks to make a quick graph.
I’ve chosen to look at the storms dataset and ask “are severe storms getting more frequent?” To do this, I’ll first map out what I want to do in comment blocks:
# Look at my data set
head(storms)
## # A tibble: 6 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical de… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical de… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical de… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical de… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical de… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical de… NA 25 1012
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>
# Subset this data for only storms of Category 4 or greater
big_storms <- filter(storms, category >= 4)
# Count how many big storms there were per year
storms_per_year <- dplyr::count(big_storms, year)
# Plot this data with year as the x axis and number of storms on the y axis
ggplot(storms_per_year, aes(x = year, y = n)) +
geom_line(color = "blue") +
labs(title = "Number of Storms (Category 4+) by Year",
x = "Year",
y = "Number of Storms") +
theme_classic()
# Add a trend line
ggplot(storms_per_year, aes(x = year, y = n)) +
geom_line(color = "blue") +
labs(title = "Number of Storms (Category 4+) by Year",
x = "Year",
y = "Number of Storms") +
theme_classic() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'