Demultiplexing Samples

Wyoming INBRE Data Science Core

November 5, 2021

1. Housekeeping: Where Are All Files?
2. Demultiplex using Barcodes
3. Truncate Quality Scores
4. Other pipelines

1. Housekeeping: Where Are All Files

Let’s take inventory of our files, which should be:

A fastq file containing sequence data on all individuals
Barcodes that link sequence reads with the individuals they came from


ls -lh

-rwxrwxr-x 1 vchhatre 6.0G Nov  4 13:06 all_ruber.fastq
-rw-rw-r-- 1 vchhatre  863 Nov  4 13:09 barcodes2.txt

If you are coming to the workshop for the first time today, please copy these files as follows:


cd /gscratch/YOUR_USER_NAME/

mkdir rattlesnake

cd rattlesnake/

cp /project/inbre-train/2021_popgen_wkshp/data/all_ruber.fastq .

cp /project/inbre-train/2021_popgen_wkshp/data/barcodes2.txt .

Take a quick look inside the barcodes file to know what information we have:


head barcodes2.txt

SD_Field_1453   CTCTCCAG    8
SD_Field_0983   TAATTG  6
SD_Field_1880   ATCTCGT 7
SD_Field_0201   GACAACT 7
SD_Field_2127   CTCGCAA 7
SD_Field_1878   TGGACACT    8
SD_Field_2287   TGTCAAT 7
SD_Field_0598   TCCTGCT 7
SD_Field_1899   GAACTT  6
SD_Field_1205   ATGCT   5

Note that the third field is the length of barcodes. We will need that information in a bit.

2. Demultiplex using Barcodes

We need to write a simple bash script which will do the following:

Read barcodes and their respective sample names
Parse fastq sequence data file and identify reads for a given barcode
Isolate identified reads into their own files and name these files after the sample name
Simultaneously delete the barcodes from each read.

In order to accomplish this, we will make use of two unix utilities:

- ``grep``

- ``sed``

2.1 The `demux.sh` Script

Please type everything out instead of copy pasting. Yes, you will make mistakes and the script won’t work on first try. But you will learn something.


#!/bin/bash

#SBATCH --time=1:00:00
#SBATCH --mem=20G
#SBATCH --nodes=1
#SBATCH --mail-type=NONE
#SBATCH -J demux
#SBATCH --account=YOUR_PROJECT


## Navigate to the data folder
cd /gscratch/YOUR_ACCOUNT/rattlesnake/


## Create shortcut for the fastq file
ruber="all_ruber.fastq"


## Create output directory
mkdir -p demux_out


## Demultiplexing begins

while read name bar leng
do
  grep -A 2 -B 1 "^$bar" $ruber | sed "s/^$bar//" >> demux_out/${name}.fastq
done < barcodes2.txt

At the end of this run, you should have 40 fastq files named after each of the 40 individuals and their barcodes removed.
If you recall from introduction to fastq format, the fourth line of each read record contains quality scores per nucleotide. We removed barcodes, but we haven’t yet removed quality scores for those barcodes. We need to do that before moving on.

3. Truncate Quality Scores

We will be using a strategy similar to section 2.1 above to achieve this. This is a two step process:

Figure out how to capture the fourth line of each read record (i.e. quality score line)
Use unix sed tool to keep all quality scores except for those belonging to the removed barcodes

3.1 The `truncate.sh` Script

The truncate.sh Script

#!/bin/bash


cd /gscratch/YOUR_ACCOUNT/rattlesnake

mkdir -p finalfq

while read name barcode leng
do
  sed -E '0~4 s/^.{'"$leng"'}//' demux_out/${name}.fastq >> finalfq/${name}.fastq
done < barcodes2.txt

This script will produce 40 fastq files in the finalfq folder and each of these files will have the same length for lines 2 and 4 per read entry.

4. Other pipelines

Many pipelines exist that demultiplex data and run additional processing, each with their own set of advantages. These often include the ability to identify barcodes with a tolerance for Ns or mismatches, which our approach does not handle.

One alternative to explore is the program Ultraplex program, which is designed to demultiplex reads, trim adapters, and trim bases/reads by quality scores all from a single command–note that we have not yet tested it ourselves, but it looks useful. What tool you will want to use for each of these steps will depend heavily on your data type. For most RAD projects, you will receive data that needs to be demultiplexed, and you may want to use a program like iPyRad, Stacks, or Ultraplex that combines multiple steps into an easy to use pipeline. If you are sequencing whole genomes on an Illumina platform, my experience so far has been that these are returned already demultiplexed, so you can skip this step, but will need to then handle the remaining steps of trimming, etc.

Table of Contents