Selection Scan with BAYESCAN

The file conversion will be similar to how we did it for Structure using PGDSpider. We need the following files:

You can copy these files from: /project/inbre-train/2021_popgen_wkshp/data/Bayescan. Copy all files from this folder.


cd /path/to/your/gscratch/folder

cp /project/inbre-train/2021_popgen_wkshp/data/Bayescan/data/* .


ls -lh

-rw-rw-r-- 1 vchhatre  670 Dec  2 10:56 crovir.pop
-rw-r--r-- 1 vchhatre  336 Dec  2 10:57 vcf2bscan.sh
-rw-r--r-- 1 vchhatre 1.5K Dec  2 10:57 vcf2bscan.spid
-rw-rw-r-- 1 vchhatre 2.6M Dec  2 10:58 ruber_reduced_ref.vcf

1.1 Exclude Loci with More than 2 Alleles

You might recall that the VCF file contains some loci with more than 2 alleles. We need to exclude those loci from further analysis because they violate the assumptions of the method (Bayescan) we are about to use.

module load gcc swset perl vcftools

vcftools --vcf ruber_reduced_ref.vcf

VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf ruber_reduced_ref.vcf

After filtering, kept 33 out of 33 Individuals
After filtering, kept 4562 out of a possible 4562 Sites
Run Time = 0.00 seconds


vcftools --vcf ruber_reduced_ref.vcf --min-alleles 2 --max-alleles 2

VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf ruber_reduced_ref.vcf
    --max-alleles 2
    --min-alleles 2

After filtering, kept 33 out of 33 Individuals
After filtering, kept 4502 out of a possible 4562 Sites
Run Time = 0.00 seconds


vcftools --vcf ruber_reduced_ref.vcf --min-alleles 2 --max-alleles 2 --recode --out ruber_4502snp

VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf ruber_reduced_ref.vcf
    --max-alleles 2
    --min-alleles 2
    --out ruber_4502snp
    --recode

After filtering, kept 33 out of 33 Individuals
Outputting VCF file...
After filtering, kept 4502 out of a possible 4562 Sites
Run Time = 0.00 seconds


vcftools --vcf ruber_4502snp.recode.vcf 

VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --vcf ruber_4502snp.recode.vcf

After filtering, kept 33 out of 33 Individuals
After filtering, kept 4502 out of a possible 4502 Sites
Run Time = 0.00 seconds

mv ruber_4502snp.recode.vcf ruber_4502snp.vcf

1.2 Check Popfile

vim crovir.pop

SD_Field_0201   north
SD_Field_0255   south
SD_Field_0386   south
SD_Field_0491   south
SD_Field_0492   admixed
SD_Field_0493   south
SD_Field_0557   south
SD_Field_0598   admixed
SD_Field_0599   south
SD_Field_0642   admixed
SD_Field_0666   admixed
SD_Field_0983   north
SD_Field_1079   south
SD_Field_1205   north
SD_Field_1220   south
SD_Field_1225   south
SD_Field_1226   south
SD_Field_1381   north
SD_Field_1878   north
SD_Field_1880   north
SD_Field_1899   north
SD_Field_1961   south
SD_Field_1988   south
SD_Field_1991   south
SD_Field_2127   north
SD_Field_2287   admixed
SD_Field_2383   south
SD_Field_2427   south
SD_Field_2789   north
SD_Field_2914   south
SD_Field_2968   north
SD_Field_3027   north
SD_Field_3124   north

1.3 Modify the SPID File

vim vcf2bscan.spid

# VCF Parser questions
PARSER_FORMAT=VCF

# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=FALSE
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=false
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=true

# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN

# Specify which data type should be included in the GESTE / BayeScan file  (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP

1.4 Modify the vcf2bscan.sh Script

#!/bin/bash

pgdspider   -inputfile      \
        -inputformat VCF    \
        -outputfile         \
        -outputformat GESTE_BAYE_SCAN   \
        -spid vcf2bscan.spid

sbatch vcf2bscan.sh

1.5 Bayescan Input File

vim ruber.bayescan

[loci]=4502     
 
[populations]=3 
 
[pop]=1 
 1      24      2       16 8
 2      24      2       0 24
 3      24      2       23 1
 4      24      2       24 0
 5      18      2       0 18
 6      8       2       0 8 
 7      4       2       1 3 
 8      24      2       0 24
 9      24      2       24 0
 10     24      2       4 20
 11     24      2       13 11
 12     24      2       1 23
 13     2       2       0 2 
 14     22      2       0 22
 15     24      2       0 24
 16     10      2       10 0
 17     20      2       7 13
 18     22      2       19 3

sed -n 6,4507p ruber.bayescan | cut -f3 | sort | uniq -c

   4502  2

2. Run Bayescan

module load gcc swset bayescan

bayescan -h

 --------------------------- 
 | BayeScan 2.0 usage:     | 
 --------------------------- 
 -help        Prints this help 
 --------------------------- 
 | Input                   | 
 --------------------------- 
 alleles.txt  Name of the genotypes data input file 
 -d discarded Optional input file containing list of loci to discard
 -snp         Use SNP genotypes matrix
 --------------------------- 
 | Output                  | 
 --------------------------- 
 -od .        Output file directory, default is the same as program file
 -o alleles   Output file prefix, default is input file without the extension
 -fstat       Only estimate F-stats (no selection)
 -all_trace   Write out MCMC trace also for alpha paremeters (can be a very large file)
 --------------------------- 
 | Parameters of the chain | 
 --------------------------- 
 -threads n   Number of threads used, default is number of cpu available 
 -n 5000      Number of outputted iterations, default is 5000 
 -thin 10     Thinning interval size, default is 10 
 -nbp 20      Number of pilot runs, default is 20 
 -pilot 5000  Length of pilot runs, default is 5000 
 -burn 50000  Burn-in length, default is 50000 
 --------------------------- 
 | Parameters of the model | 
 --------------------------- 
 -pr_odds 10  Prior odds for the neutral model, default is 10 
 -lb_fis 0    Lower bound for uniform prior on Fis (dominant data), default is 0
 -hb_fis 1    Higher bound for uniform prior on Fis (dominant data), default is 1
 -beta_fis    Optional beta prior for Fis (dominant data, m_fis and sd_fis need to be set)
 -m_fis 0.05  Optional mean for beta prior on Fis (dominant data with -beta_fis)
 -sd_fis 0.01 Optional std. deviation for beta prior on Fis (dominant data with -beta_fis)
 -aflp_pc 0.1 Threshold for the recessive genotype as a fraction of maximum band intensity, default is 0.1
 --------------------------- 
 | Output files            | 
 --------------------------- 
 -out_pilot   Optional output file for pilot runs
 -out_freq    Optional output file for allele frequencies

vim run_bayescan.sh


module load gcc swset bayescan/2.1


bayescan ruber.bayescan -snp \
    -od ruber_output    \
    -threads 16 \
    -n 5000 \
    -thin 10    \
    -nbp 20 \
    -pilot 5000 \
    -burn 50000 \
    -pr_odds 1000   \
    -out_freq


mkdir ruber_output

sbatch run_bayescan.sh

3. Understanding Bayescan Output

Bayescan outputs several files depending upon parameters selected during the run. You will see the following files in the results folder:

cd ruber_output

ls -lh

-rw-rw-r-- 1 vchhatre 380K Dec  2 14:38 ruber.baye_Verif.txt
-rw-rw-r-- 1 vchhatre  601 Dec  2 15:17 ruber.baye_AccRte.txt
-rw-rw-r-- 1 vchhatre 252K Dec  2 15:17 ruber.baye.sel
-rw-rw-r-- 1 vchhatre 228K Dec  2 19:16 ruber.baye_fst.txt

vim ruber.baye_fst.txt

:%s/\s\+/ /g
:%s/ /\t/g
:wq

module load gcc swset r

R

bscan <- read.table("ruber.baye_fst.txt", header=T)

head(bscan)

  locus       prob log10.PO.     qval       alpha     fst
1     1 0.00080016  -3.09648 0.998716 -0.00014427 0.32671
2     2 0.00080016  -3.09650 0.998720 -0.00025595 0.32669
3     3 0.00100020  -2.99950 0.998560 -0.00021504 0.32671
4     4 0.00080016  -3.09650 0.998720 -0.00038205 0.32668
5     5 0.00060012  -3.22150 0.998840 -0.00012257 0.32671
6     6 0.00060012  -3.22150 0.998840  0.00085670 0.32693

Here: - prob is the posterior probability of selection acting on a locus - log10.PO. is the log10 of the posterior odds of the model testing for selection. The value in this column is set to 1000 when the posterior probability is 1. - alpha is a coefficient indicating the strength and direction of selection. Positive values indicate diversifying selection and negative values, purifying selection. - qval is the multiple testing p-value obtained after testing each locus for selection.

range(bscan$qval)

[1] 0.90958 0.99897

vim run_bayescan.sh

-pr_odds 100
-od ruber_output_prior100

mkdir ruber_output_prior100

sbatch run_bayescan.sh

4. Visualizing Outliers

Because our current data set does not seem to have any outliers showing extreme allele frequency differences, we will switch to a different data set for illustration purposes. You can find this data inside the data/ folder for this week on Github.

4.1 Download the Data

cd Bayescan

wget https://github.com/wyoibc/popgen_workshop/tree/master/week5/selection/data/test_baye_fst.txt

4.2 Check for Outliers


test <- read.table("test.baye_fst.txt", header=T)

head(test)

        prob  log10.PO.     qval       alpha      fst
1 0.00000000 -1000.0000 0.997644  0.0000e+00 0.022319
2 0.00000000 -1000.0000 0.997640  0.0000e+00 0.022319
3 0.00000000 -1000.0000 0.997640  0.0000e+00 0.022319
4 0.00020004    -3.6988 0.994060  1.5625e-04 0.022324
5 0.00020004    -3.6988 0.994060 -5.8927e-05 0.022318
6 0.00000000 -1000.0000 0.997640  0.0000e+00 0.022319

dim(test)

[1] 107309      5

range(test$qval)

[1] 0.000000 0.997644

library(ggplot2)

ggplot(data=test) + geom_histogram(aes(x=qval), binwidth=0.01)

sum(test$qval < 0.05)

[1] 222

sum(test$qval < 0.001)

145

subset(test, qval < 0.001)

         prob log10.PO.       qval  alpha      fst
1566   0.9918    2.0825 4.6531e-04 1.7248 0.114760
1568   0.9882    1.9229 9.2708e-04 1.7285 0.115480
1700   0.9892    1.9618 8.5156e-04 1.7330 0.115720
1702   0.9964    2.4420 2.5378e-04 1.7605 0.117830
3159   0.9912    2.0516 5.8297e-04 1.6601 0.108270
3192   1.0000 1000.0000 0.0000e+00 1.9705 0.138040
6013   1.0000 1000.0000 0.0000e+00 2.8917 0.278710
6650   1.0000 1000.0000 0.0000e+00 2.0830 0.153020
6931   0.9996    3.3977 2.2613e-05 2.2409 0.174450
12697  1.0000 1000.0000 0.0000e+00 1.8961 0.131350
14156  1.0000 1000.0000 0.0000e+00 2.1488 0.159040
14171  1.0000 1000.0000 0.0000e+00 2.2797 0.176420
14201  0.9998    3.6988 9.0108e-06 1.7465 0.116600
14954  1.0000 1000.0000 0.0000e+00 2.6822 0.239490
18125  1.0000 1000.0000 0.0000e+00 2.1727 0.164620
18300  1.0000 1000.0000 0.0000e+00 1.9129 0.131840
18322  1.0000 1000.0000 0.0000e+00 1.6018 0.101940
18394  1.0000 1000.0000 0.0000e+00 1.8083 0.121450
18396  1.0000 1000.0000 0.0000e+00 1.8114 0.121840
24167  1.0000 1000.0000 0.0000e+00 2.2193 0.167970
24395  1.0000 1000.0000 0.0000e+00 1.8848 0.128070
26773  1.0000 1000.0000 0.0000e+00 1.7408 0.114220
27418  1.0000 1000.0000 0.0000e+00 2.4339 0.196440
27423  1.0000 1000.0000 0.0000e+00 2.0684 0.148980
31272  1.0000 1000.0000 0.0000e+00 2.4578 0.203560
34814  1.0000 1000.0000 0.0000e+00 1.8200 0.122140
34830  1.0000 1000.0000 0.0000e+00 1.8360 0.124590
36086  1.0000 1000.0000 0.0000e+00 1.8240 0.121930
39272  1.0000 1000.0000 0.0000e+00 3.5160 0.396580
39876  1.0000 1000.0000 0.0000e+00 1.9194 0.133520
39889  1.0000 1000.0000 0.0000e+00 1.8886 0.130840
39890  0.9994    3.2215 3.7296e-05 1.8842 0.130450
39910  1.0000 1000.0000 0.0000e+00 1.8167 0.123000
42021  0.9974    2.5838 2.0762e-04 2.3243 0.188730
42351  1.0000 1000.0000 0.0000e+00 2.1141 0.154780
43349  0.9984    2.7951 1.0955e-04 3.3251 0.364780
43516  1.0000 1000.0000 0.0000e+00 1.7734 0.118360
43760  1.0000 1000.0000 0.0000e+00 1.8764 0.129290
43771  1.0000 1000.0000 0.0000e+00 2.1364 0.159980
43775  1.0000 1000.0000 0.0000e+00 2.1471 0.161570
43778  1.0000 1000.0000 0.0000e+00 2.1421 0.160850
45050  0.9998    3.6988 9.0108e-06 1.7325 0.114640
45053  0.9926    2.1275 4.0884e-04 1.6045 0.103530
46669  1.0000 1000.0000 0.0000e+00 1.8368 0.126170
46861  0.9986    2.8532 8.5501e-05 1.8843 0.130930
48480  1.0000 1000.0000 0.0000e+00 1.8787 0.129220
49162  1.0000 1000.0000 0.0000e+00 1.7046 0.111360
50013  1.0000 1000.0000 0.0000e+00 1.8834 0.128270
51912  1.0000 1000.0000 0.0000e+00 1.7774 0.118170
51917  1.0000 1000.0000 0.0000e+00 1.8019 0.120470
51920  1.0000 1000.0000 0.0000e+00 1.8123 0.121500
51936  0.9976    2.6187 1.8935e-04 1.5646 0.099173
53114  1.0000 1000.0000 0.0000e+00 1.7699 0.117520
54009  1.0000 1000.0000 0.0000e+00 2.6047 0.228290
54544  0.9990    2.9995 6.3947e-05 1.7931 0.120470
57035  1.0000 1000.0000 0.0000e+00 2.5100 0.212890
58556  1.0000 1000.0000 0.0000e+00 2.1872 0.168830
58559  0.9998    3.6988 9.0108e-06 1.9893 0.144050
59329  1.0000 1000.0000 0.0000e+00 1.6750 0.108510
59877  1.0000 1000.0000 0.0000e+00 1.8073 0.121550
59878  1.0000 1000.0000 0.0000e+00 1.8124 0.122040
61797  0.9998    3.6988 9.0108e-06 2.4507 0.205690
66090  0.9998    3.6988 9.0108e-06 1.8892 0.131120
66817  0.9992    3.0965 5.6210e-05 1.7872 0.120200
70550  1.0000 1000.0000 0.0000e+00 2.5538 0.215940
70551  1.0000 1000.0000 0.0000e+00 4.2592 0.559140
70553  1.0000 1000.0000 0.0000e+00 3.1107 0.311520
70554  1.0000 1000.0000 0.0000e+00 3.0005 0.290330
70556  1.0000 1000.0000 0.0000e+00 2.2939 0.177830
70557  1.0000 1000.0000 0.0000e+00 4.0050 0.502850
70561  1.0000 1000.0000 0.0000e+00 3.0837 0.309730
70564  1.0000 1000.0000 0.0000e+00 3.1913 0.330880
70578  1.0000 1000.0000 0.0000e+00 2.5996 0.220910
70606  1.0000 1000.0000 0.0000e+00 2.3432 0.184770
70607  1.0000 1000.0000 0.0000e+00 2.3465 0.185260
70616  1.0000 1000.0000 0.0000e+00 2.2087 0.170270
70617  1.0000 1000.0000 0.0000e+00 2.1985 0.167620
70656  1.0000 1000.0000 0.0000e+00 2.0993 0.153160
70684  1.0000 1000.0000 0.0000e+00 2.5772 0.218370
70738  1.0000 1000.0000 0.0000e+00 2.1389 0.159600
71234  1.0000 1000.0000 0.0000e+00 1.9073 0.131420
71265  0.9992    3.0965 5.6210e-05 1.7616 0.117380
71646  0.9996    3.3977 2.2613e-05 1.7175 0.113440
71734  0.9914    2.0617 5.2385e-04 1.5526 0.098594
71796  1.0000 1000.0000 0.0000e+00 1.9577 0.137990
72584  1.0000 1000.0000 0.0000e+00 2.0383 0.146730
73227  1.0000 1000.0000 0.0000e+00 1.6270 0.104190
74211  0.9996    3.3977 2.2613e-05 1.8806 0.129400
74218  1.0000 1000.0000 0.0000e+00 1.8432 0.125520
75864  1.0000 1000.0000 0.0000e+00 1.9789 0.140180
75866  1.0000 1000.0000 0.0000e+00 2.7676 0.249840
75878  1.0000 1000.0000 0.0000e+00 2.1729 0.161350
75883  1.0000 1000.0000 0.0000e+00 2.1389 0.157320
75885  0.9994    3.2215 3.7296e-05 1.8061 0.122380
75888  1.0000 1000.0000 0.0000e+00 2.4178 0.194010
75890  1.0000 1000.0000 0.0000e+00 2.6357 0.227440
75898  1.0000 1000.0000 0.0000e+00 2.3120 0.181250
75905  1.0000 1000.0000 0.0000e+00 1.9456 0.135850
75912  0.9926    2.1275 4.0884e-04 1.7759 0.121080
75926  1.0000 1000.0000 0.0000e+00 1.9559 0.135670
75927  1.0000 1000.0000 0.0000e+00 2.1450 0.157700
75928  1.0000 1000.0000 0.0000e+00 2.1193 0.154380
75930  1.0000 1000.0000 0.0000e+00 2.1279 0.155580
75937  1.0000 1000.0000 0.0000e+00 1.9697 0.137130
75938  1.0000 1000.0000 0.0000e+00 1.9587 0.135820
75944  1.0000 1000.0000 0.0000e+00 2.1021 0.152210
76005  1.0000 1000.0000 0.0000e+00 2.2737 0.173680
76042  1.0000 1000.0000 0.0000e+00 2.4141 0.193430
76057  1.0000 1000.0000 0.0000e+00 1.8734 0.126780
76066  1.0000 1000.0000 0.0000e+00 1.8605 0.125540
77901  1.0000 1000.0000 0.0000e+00 2.8892 0.270770
80748  1.0000 1000.0000 0.0000e+00 1.7523 0.116020
82861  1.0000 1000.0000 0.0000e+00 2.5742 0.218460
82866  1.0000 1000.0000 0.0000e+00 2.4787 0.203660
82964  0.9906    2.0227 6.4552e-04 1.6380 0.107080
84287  1.0000 1000.0000 0.0000e+00 2.1068 0.153700
85403  1.0000 1000.0000 0.0000e+00 3.4112 0.378160
85513  1.0000 1000.0000 0.0000e+00 1.7578 0.116680
90464  0.9896    1.9783 7.8197e-04 1.4717 0.091847
92350  1.0000 1000.0000 0.0000e+00 1.9689 0.139440
92465  0.9928    2.1394 3.0525e-04 1.6781 0.110270
92826  1.0000 1000.0000 0.0000e+00 1.7041 0.111780
93197  0.9978    2.6565 1.7234e-04 1.6590 0.107790
93198  0.9978    2.6565 1.7234e-04 1.6682 0.108520
93199  0.9980    2.6980 1.2443e-04 1.6750 0.109240
93639  1.0000 1000.0000 0.0000e+00 2.2083 0.171480
93640  1.0000 1000.0000 0.0000e+00 2.1260 0.160500
94661  0.9992    3.0965 5.6210e-05 1.7318 0.114580
96296  0.9978    2.6565 1.7234e-04 1.7500 0.116770
97405  1.0000 1000.0000 0.0000e+00 2.2977 0.180220
97407  1.0000 1000.0000 0.0000e+00 2.3241 0.183260
97998  1.0000 1000.0000 0.0000e+00 2.0376 0.147430
98445  0.9984    2.7951 1.0955e-04 1.5071 0.094143
98528  0.9896    1.9783 7.8197e-04 1.6129 0.104540
99021  1.0000 1000.0000 0.0000e+00 1.9148 0.133740
99396  0.9970    2.5215 2.2862e-04 1.6005 0.102600
99397  0.9994    3.2215 3.7296e-05 1.6167 0.103720
99398  0.9986    2.8532 8.5501e-05 1.6158 0.103810
99430  1.0000 1000.0000 0.0000e+00 2.5618 0.220810
99694  1.0000 1000.0000 0.0000e+00 2.9127 0.274370
103772 1.0000 1000.0000 0.0000e+00 1.6848 0.109560
105015 0.9996    3.3977 2.2613e-05 2.4267 0.203410
105025 1.0000 1000.0000 0.0000e+00 2.1251 0.158070
105030 1.0000 1000.0000 0.0000e+00 2.1370 0.159030
106061 1.0000 1000.0000 0.0000e+00 2.0283 0.145150

Selection Scan with BAYESCAN

Table of Contents

1. Preparing Input File

1.1 Exclude Loci with More than 2 Alleles

1.2 Check Popfile

1.3 Modify the SPID File

1.4 Modify the `vcf2bscan.sh` Script

1.5 Bayescan Input File

2. Run Bayescan

3. Understanding Bayescan Output

4. Visualizing Outliers

4.1 Download the Data

4.2 Check for Outliers

4.3 Visualize Outliers

4.4 Chromosome-Wise Distribution of Outliers

Table of Contents

1. Preparing Input File

1.1 Exclude Loci with More than 2 Alleles

1.2 Check Popfile

1.3 Modify the SPID File

1.4 Modify the vcf2bscan.sh Script

1.5 Bayescan Input File

2. Run Bayescan

3. Understanding Bayescan Output

4. Visualizing Outliers

4.1 Download the Data

4.2 Check for Outliers

4.3 Visualize Outliers

4.4 Chromosome-Wise Distribution of Outliers

1.4 Modify the `vcf2bscan.sh` Script