November 12, 2021
The current version of VCF is 4.3. However, your files are likely formatted to version 4.1 because that’s the version available on Teton. The VCF format specification has an extensive manual available here.
The header is usually quite large and contains definitions of each field. The most important header line is the one that starts with #CHROM
. Let’s look at it.
zgrep "#CHROM" vcf/CroVir.vcf.gz > vcf.header
vim vcf.header
:%s/\t//g
We first sent the header to a text file and then replaced all tabs with line breaks so the file is easier to read.
This is one of the most important fields in the VCF file. It consists of one or more of the following:
DP
: Total depth. Number of sequence reads that confirm a given variant in an individual.
AD
: Allelic Depth. Number of sequence reads that confirm a given allele. Two values provided, one per allele
GT
: This is the called genotype based on allelic depth, quality scores and error rate
GQ
: Genotype Quality - based on PHRED scaled quality scores
VCFTools is a very fast and wonderful program to parse and filter VCF files. Let’s run a few examples of what this program can do:
module load swset gcc vcftools perl
vcftools --gzvcf CroVir.vcf.gz
vcftools --gzvcf CroVir.vcf.gz --min-alleles 2 --max-alleles 2
vcftools --gzvcf CroVir.vcf.gz --freq
vcftools --gzvcf CroVir.vcf.gz --chr 2
vcftools --gzvcf CroVir.vcf.gz --thin 20000