Read Preparation

Although the original SMURF paper relied on a quality filtering protocol, we have elected to recommend the use of existing denoising algorithms. In Sidle, we recommend using existing tools to perform pre-processing. Here, we provide example code for processing samples, however, this can be accomplished any number of ways.

Note

The following steps represent a suggested pipeline for preparing reads to be used with sidle. Alternative pipelines are possible and may be more useful for your specific circumstances.

As an initial example of read preparation, your pipeline may include the following steps. To make navigation easy, what stage are you at?

Demultiplex your reads

Fully Multiplexed sequences

The first step of sample processing is demultiplexing your sequences into sample x region pairs. You may have fully multiplexed sequences, in which case, you will need to demultiplex into both samples and regions. If this is the case, you will likely have two or three fastq files, which likely represent forward, reverse, and index reads. Please refer to the QIIME 2 documentation for demultiplexing EMP sequences and demultiplexing with cutadapt.

Now, you’re ready to denoise the regions.

Sequences barcoded by sample (mixed regions per sample)

If your samples are demultiplexed to include a single set of files (probably forward and reverse), you will need to demultiplex the files into regions. You can use the cutadapt [1] plugin in QIIME 2 to demultiplex into regions and remove the primers in the same step. If you have paired-end reads, you will want to use the trim-paired command; if you have Ion-Torrent or are using a subset of reads, you might choose to use trim-single. (See the q2 cutadapt and cutadapt docs for more details.) Using the --p-discard-untrimmed flag will remove any sequence which does not have the primer region, allowing you to find our sequences of interest.

This will give you a table per region. Next, continue on to denoise the regions.

Sequences barcoded by sample and region

If you are lucky, or planned ahead, you may have added a unique barcode to each region of a sample. So, if, for example, you sequenced 5 samples with 6 regions, after demultiplexing, you should have 30 demultiplexed files and a map which describes the relationship between the name and region.

You’ll need that later. We suggest a file with at least three columns: sample-id, original-sample-id, and regional-id. (Note that sample-id is a QIIME-required header column.) You could potentially combine this with a manifest or barcode file that you used to import and/or demultiplex your data. (If you don’t if your data is demultiplexed already, check in with your sequencing center! Make friends with your sequencing center. Invite them to drink a hot beverage with you and talk cool research.)

Depending on your region lengths and sequencing length, you may need to map the forward and reverse reads separately. So, you may need final sets of demultiplexed reads which represent the

  • paired-end reads for regions which can be merged
  • forward reads only for regions which cannot be merged
  • reverse reads only for regions which cannot be merged

Note

If you have a region which does not overlap, the forward and reverse reads should be processed as separate kmers and then reconstructed, rather than concatenated during denoising.

Next, make sure that you trim your primers. If you plan to denoise all the data together, this should be done sequentially for each primer pair that you plan to use for each block of sequences. (i.e. you should trim the paired reads, forward reads, and reverse reads separately.)

This will give you a table where each sample is named whatever you’ve linked to your barcode. From here, you can either filter your sequences before denoising, or proceed combining all regions, and then filter the table later. If you denoise with Dada2, you may find better performance if you leave the sequences together; this will not affect denoising with deblur.

Denoise reads with your favorite algorithm

Reads should be denoised, possibly merged, and trimmed to a standard length for kmer-based alignment. Depending on the algorithmic approach, Dada2 [2] and Deblur [3] might both be good approaches. (If you’re interested in an independent comparison of the two methods outside reconstruction, Nearing et al provides an independent benchmark [4]).

However, there are some limitations. Ion Torrent and 454 pyrosequencing results should be denoised with dada2. Dada2 tends to retain more high quality sequences than deblur, but may inflate the number of features. Because of the algorithm, it also has a longer run time.

Illumina data which has already been joined or quality filtered should be denoised with deblur. It’s a faster algorithm and highly parallelizable but it’s also more conservative.

DADA2

There are several helpful tutorials on the QIIME 2 website that describe running dada2 on forward reads and dada2 on paired reads. Minimal pre-processing should be applied before DADA2: simply demultiplex your data and pass it into the command.

Once DADA2 has been run, you will need to trim the reads to a consistent length. This can be done using the qiime dada2 parameters during denoising, or with the trim-dada2-posthoc method in q2-sidle.

As an example of the command, we can download the feature table and representative sequences from the qiime2 Moving Pictures Tutorial and then practice.

wget https://docs.qiime2.org/2020.6/data/tutorials/moving-pictures/table-dada2.qza .
wget https://docs.qiime2.org/2020.6/data/tutorials/moving-pictures/rep-seqs-dada2.qza .

If you look at the sequence summary (viewable here), you’ll find the sequences have already been trimmed to 120nt. However, for the alignment we plan to do, it may be useful to trim them to 100nt.

qiime sidle trim-dada2-posthoc \
 --i-table table-dada2.qza \
 --i-representative-sequences rep-seqs-dada2.qza \
 --p-trim-length 100 \
 --o-trimmed-table table-dada2-100nt.qza \
 --o-trimmed-representative-sequences rep-seq-dada2-100nt.qza

You can check the length by tabulating the sequences.

qiime feature-table tabulate-seqs \
 --i-data rep-seq-dada2-100nt.qza \
 --o-visualization rep-seq-dada2-100nt.qzv

You should find the sequences all trimmed to 100nt, and ready for alignment.

Deblur

If you have sequenced using Illumina, Deblur may be easier to use and is recommended by the authors/original developers of SMURF. You can find a tutorial for deblurring single end reads or paired end reads on the QIIME webpage. Simply set your Deblur trim length to the final kmer length you’ll use and proceed.

Check your tables

Before you proceed, make sure that you have what you need. For alignment and reconstruction to work correctly, you will need one feature table and one representative sequence set for each region you plan to align. The ASVs in a feature table should have a consistent length. All the samples in the table should have the same names.

If you need to, trim all the ASVs in your table to a consistent length or rename your samples.

TL;DR: Read Preparation

A quick flowchart for figuring out how to demultiplex and pre-process your reads.

Demultiplexing

Paired End Command

Single End Command

qiime cutadapt trim-single \
 --i-demultiplexed-sequences all_regions_fwd.qza \
 --p-front TCCTACGGGAGGCAGCAG \
 --p-discard-untrimmed \
 --p-error-rate 0.15 \
 --o-trimmed-sequences trimmed-regions/316_484_fwd_demux.qza

Denoising and Table Preparation

Syntax

qiime sidle trim-dada2-posthoc \
 --i-table [table filepath] \
 --i-representative-sequences [sequence filepath] \
 --p-trim-length [trim length] \
 --o-trimmed-table [trimmed table] \
 --o-trimmed-representative-sequences [trimmed sequences]

Example

qiime sidle trim-dada2-posthoc \
 --i-table table-dada2.qza \
 --i-representative-sequences rep-seqs-dada2.qza \
 --p-trim-length 100 \
 --o-trimmed-table table-dada2-100nt.qza \
 --o-trimmed-representative-sequences rep-seq-dada2-100nt.qza

Read Preparation References

[1]Martin, M. (2011). “Cutadapt removes adapter sequences from high-throughput sequencing reads”. EMBnet.journal 17:10. doi: https://doi.org/10.14806/ej.17.1.200
[2]Callahan, B; McMurdie, P; Rosen, M; et al (2016) “Dada2: High resolution sample inference from Illumina amplicon dada.” Nature Methods. 13: 581. doi: https://doi.org/10.1038/nmeth.3869
[3]Amir, A; McDonald, D; Navas-Molina, JA et al. (2017) “Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns”. mSystems. 2:e00191 doi: 10.1128/mSystems.00191-16
[4]Nearing, J.T.; Douglas, G.M.; Comeau, A.M.; Langille, M.G.I. (2018) “Denoising the Denoisers: an independent evaluation of microbiome sequencing error-correction approaches.” Peer J. 6: e5364 doi: 10.7717/peerj.5364