Demultiplexing Samples

Wyoming INBRE Data Science Core

November 5, 2021

Table of Contents

1. Housekeeping: Where Are All Files

Let’s take inventory of our files, which should be:

ls -lh

-rwxrwxr-x 1 vchhatre 6.0G Nov  4 13:06 all_ruber.fastq
-rw-rw-r-- 1 vchhatre  863 Nov  4 13:09 barcodes2.txt

If you are coming to the workshop for the first time today, please copy these files as follows:

cd /gscratch/YOUR_USER_NAME/

mkdir rattlesnake

cd rattlesnake/

cp /project/inbre-train/2021_popgen_wkshp/data/all_ruber.fastq .

cp /project/inbre-train/2021_popgen_wkshp/data/barcodes2.txt .

head barcodes2.txt

SD_Field_1453   CTCTCCAG    8
SD_Field_0983   TAATTG  6
SD_Field_1880   ATCTCGT 7
SD_Field_0201   GACAACT 7
SD_Field_2127   CTCGCAA 7
SD_Field_1878   TGGACACT    8
SD_Field_2287   TGTCAAT 7
SD_Field_0598   TCCTGCT 7
SD_Field_1899   GAACTT  6
SD_Field_1205   ATGCT   5

2. Demultiplex using Barcodes

We need to write a simple bash script which will do the following:

In order to accomplish this, we will make use of two unix utilities:

- ``grep``

- ``sed``

2.1 The Script


#SBATCH --time=1:00:00
#SBATCH --mem=20G
#SBATCH --nodes=1
#SBATCH --mail-type=NONE
#SBATCH -J demux

## Navigate to the data folder
cd /gscratch/YOUR_ACCOUNT/rattlesnake/

## Create shortcut for the fastq file

## Create output directory
mkdir -p demux_out

## Demultiplexing begins

while read name bar leng
  grep -A 2 -B 1 "^$bar" $ruber | sed "s/^$bar//" >> demux_out/${name}.fastq
done < barcodes2.txt

3. Truncate Quality Scores

We will be using a strategy similar to section 2.1 above to achieve this. This is a two step process:

3.1 The Script

The Script


cd /gscratch/YOUR_ACCOUNT/rattlesnake

mkdir -p finalfq

while read name barcode leng
  sed -E '0~4 s/^.{'"$leng"'}//' demux_out/${name}.fastq >> finalfq/${name}.fastq
done < barcodes2.txt

This script will produce 40 fastq files in the finalfq folder and each of these files will have the same length for lines 2 and 4 per read entry.

4. Other pipelines

Many pipelines exist that demultiplex data and run additional processing, each with their own set of advantages. These often include the ability to identify barcodes with a tolerance for Ns or mismatches, which our approach does not handle.

One alternative to explore is the program Ultraplex program, which is designed to demultiplex reads, trim adapters, and trim bases/reads by quality scores all from a single command–note that we have not yet tested it ourselves, but it looks useful. What tool you will want to use for each of these steps will depend heavily on your data type. For most RAD projects, you will receive data that needs to be demultiplexed, and you may want to use a program like iPyRad, Stacks, or Ultraplex that combines multiple steps into an easy to use pipeline. If you are sequencing whole genomes on an Illumina platform, my experience so far has been that these are returned already demultiplexed, so you can skip this step, but will need to then handle the remaining steps of trimming, etc.