This workflow shows the basic step of demultiplex, filtering, and trimming primers for the raw fastq files, before any otu/feature picking. This workflow can only be used to process 16S sequencing fastq files generated using a special protol from David Miles lab, which only use the barcode, that is associated with the forward primer. In another word, the reverse (downstream) primer is not barcoded.
In this workflow, raw paired end read fastq files were first demultiplexed using the barcode to pick up reads that have barcode in the begining of R1. Then the unmatched reads (unmatched_R1.fastq, unmatched_R2.fastq) were demultiplexed using barcode as reverse barcode, to pick up reads that have barcode in the begining of R2. fastq_multx is able to do the demultiplex without merging up the 2 paired end read. The demultiplexed reads (sample01_R1.fastq, sample-1_R2.fastq, ...) were then filtered using a python script, to remove reads that don't have the primers in the right place. Primers were then cut off from each end by specified length, and the 2 fastq files that belong to the same sample were concatenated together. In the very last step, FastQC is used to check the quality of reads, to determine the length to use in DADA2.
This workflow requires around 40G disk space. The actual disk space might vary depands on your sample number. Make sure your hvae at lease 50G of empty disk space before you start.
This workflow is writen in Jupyter notebook. If you choose to run directly in shell command, you need to write shell schripts and remove the "!" in frount of each command. The "!" is only for jupyter notebook.
Prerequisite tools:
fastq-multx:
Can be installed using the following command:
conda install -c bioconda fastq-multx
paired_end_reads_filter_by_primer.py:
This script has to be put under the same directory as your jupyter notebook. Ask Trevor (chhzhu@ucdavis.edu) for this script.
Biopython is also required to run this script successfully. It can be installed using:
conda install bio
fastqc:
Is you are using brew, you can use:
brew install fastqc
If you can't install fastqc, contact Trevor (chhzhu@ucdavis.edu)
fastx-toolkit:
Can be installed using conda:
conda install -c bioconda fastx-toolkit
!mkdir -p demultx_R1
!mkdir -p demultx_R2
!fastq-multx -B 2017_AZ_barcodes_FF.txt -m 0 -x -b\
FFUBS-Run_S1_L001_R1_001.fastq \
FFUBS-Run_S1_L001_R2_001.fastq \
-o demultx_R1/%_R1.fastq \
-o demultx_R1/%_R2.fastq
This command generates around 17G fastq files for 40 samples. If your are trying to demultiplex for samples, make sure you enough disk space.
!fastq-multx -B /2017_AZ_barcodes_FF.txt -m 0 -x -b\
demultx_R1/unmatched_R2.fastq \
demultx_R1/unmatched_R1.fastq \
-o demultx_R2/%_R2.fastq \
-o demultx_R2/%_R1.fastq
This command geneerates around 21G fastq files for 40 samples.
When this command is done, your can delate the raw fastq files, and the unmatched_R1.fastq and unmatched_R2.fastq in both demultx_R1 and demultx_R2 directories. This will save your some disk space
!mkdir -p filt_demultx_R1
!mkdir -p filt_demultx_R2
!ls demultx_R1/FF*_R1.fastq | cut -f2 -d '/' |cut -f1 -d '.' >filt_R1.txt
!ls demultx_R1/FF*_R2.fastq | cut -f2 -d '/' |cut -f1 -d '.' >filt_R2.txt
!python paired_end_reads_filter_by_primer.py \
--input-forward-list filt_R1.txt \
--input-reverse-list filt_R2.txt \
--input-path demultx_R1 \
--output-path filt_demultx_R1 \
--barcodes 2017_AZ_barcodes_FF.txt \
--forward-primer GTGTGCCAGCMGCCGCGGTAA \
--reverse-primer GGACTACNVGGGTWTCTAAT
!python paired_end_reads_filter_by_primer.py \
--input-forward-list filt_R2.txt \
--input-reverse-list filt_R1.txt \
--input-path demultx_R2 \
--output-path filt_demultx_R2 \
--barcodes 2017_AZ_barcodes_FF.txt \
--forward-primer GTGTGCCAGCMGCCGCGGTAA \
--reverse-primer GGACTACNVGGGTWTCTAAT
Although step1 and step2 pick up sequences that only starts with the barcodes for each sample. However, some sequences that have barcodes at the begining, don't have primer right after, or don' have the reverse primer at the begining of the other read in the pair. The purpose of this step is filter out those reads, and only keep the reads that not only have barcodes, but also have both forward and reverse primer at the correct location of the sequences.
!ls filt_demultx_R1/FF*_R1.filt.fastq | cut -f2 -d '/' |cut -f1 -d '.' >trim_R1.txt
!ls filt_demultx_R1/FF*_R2.filt.fastq | cut -f2 -d '/' |cut -f1 -d '.' >trim_R2.txt
!mkdir trim_demultx_R1
!mkdir trim_demultx_R2
%%bash
while read trim
do
fastx_trimmer -f 30 -i filt_demultx_R1/$trim.filt.fastq -o trim_demultx_R1/$trim.trim.fastq
done < trim_R1.txt
%%bash
while read trim
do
fastx_trimmer -f 21 -i filt_demultx_R1/$trim.filt.fastq -o trim_demultx_R1/$trim.trim.fastq
done < trim_R2.txt
%%bash
while read trim
do
fastx_trimmer -f 21 -i filt_demultx_R2/$trim.filt.fastq -o trim_demultx_R2/$trim.trim.fastq
done < trim_R1.txt
%%bash
while read trim
do
fastx_trimmer -f 30 -i filt_demultx_R2/$trim.filt.fastq -o trim_demultx_R2/$trim.trim.fastq
done < trim_R2.txt
!mkdir -p alldemultx
!ls trim_demultx_R1/FF*.fastq | cut -f2 -d '/' |cut -f1 -d '_' > sample_list.txt
%%bash
while read id
do
cat trim_demultx_R1/${id}_R1.trim.fastq trim_demultx_R2/${id}_R1.trim.fastq > alldemultx/${id}_R
done < sample_list.txt
%%bash
while read id
do
cat trim_demultx_R1/${id}_R2.trim.fastq trim_demultx_R2/${id}_R2.trim.fastq > alldemultx/${id}_R
done < sample_list.txt
!mkdir -p fastqc
!cat alldemultx/FF*_R1.combo.fastq > R1.all.fastq
!cat alldemultx/FF*_R2.combo.fastq > R2.all.fastq
!fastqc R1.all.fastq R2.all.fastq -o fastqc
!rm R1.all.fastq
!rm R2.all.fastq
!ls fastqc