Database Preparation

If you do not already have a database for reconstruction, you will need to prepare your own. Database preparation will only need to be done once for each database, primer pair and kmer length, since the reference files can be reused.

Note

Database selection should be considered carefully if you plan to reconstruct a phylogenetic tree. Currently, only greengenes 13_8 and Silva 128 are compatible with tree building. Other versions may fail or the tree may not be constructed correctly.

To construct the database, you will need:

  • A working sidle installation
  • A list of the forward and reverse files used to amplify each region of interest
  • A reference database of your choice
  • Patience to run the extraction

You can start the tutorial by downloading the tutorial database sequences and taxonomy. They are intended as a reasonably sized dataset for the tutorial and should not be re-used for your data. These have already been imported into qiime2 and as Artifacts.

mkdir -p sidle_tutorial
cd sidle_tutorial
wget https://github.com/jwdebelius/q2-sidle/raw/main/docs/tutorial_data/database.zip
unzip database.zip
cd database

After you run this command, you will find two input files in your folder: sidle-db-full-sequences.qza, which is the full length database sequences and sidle-db-taxonomy.qza, the taxonomy for the sequences.

qiime feature-table tabulate-seqs \
 --i-data sidle-db-full-sequences.qza \
 --o-visualization sidle-db-full-sequences.qzv

You can view the artifact using qiime2 view. You should find that there are 5649 sequences when you check the data.

Filtering the Database

Database preparation can optionally begin by filtering the database to remove sequences with too many degenerate nucleotides or with taxonomic assignments that will not be used.

Degenerate filtering limits memory consumption throughout. The authors of SMURF [1] recommend filtering the database to remove sequences with more than 3 degenerate nucleotides. This represents about X% of the greengenes 13_8 database at 99% [2] specificity. (The the RESCRIPt formatted Silva 128 database is filtered to exclude sequences with more than 5 degenerates [3], [4]). Increasing the number of allowed degenerates (the --p-max-degen parameter) will allow more sequences through the filter, and may mean more matches in downstream alignment. However, this comes at a substantial increase in the run time and memory needed, since degenerate sequences have to be expanded, meaning more alignments are required.

For this tutorial, we’ll start by filtering to remove anything with more than 3 degenerate nucleotides, since this was the recommended threshold in the original algorithm. This is done using the RESCRIPt [4] plugin.

qiime rescript cull-seqs \
 --i-sequences sidle-db-full-sequences.qza \
 --p-num-degenerates 3 \
 --o-clean-sequences sidle-db-full-degen-filtered-sequences.qza

Try summarizing your database again. There should be 5440 sequences remaining. How many do you have?

Some users may also want to filter out sequences which may not be relevant to their analysis, for example, mitochondria or chloroplasts or sequences which are undefined at a high taxonomic level. (Phylum or class, for example.) You can learn more about filtering by taxonomy in the QIIME2 tutorial, but as a brief example, we’ll show filtering a greengenes database for features missing a phylum (p__;) or kingdom(k__;) designation.

qiime taxa filter-seqs \
 --i-sequences sidle-db-full-degen-filtered-sequences.qza \
 --i-taxonomy sidle-db-taxonomy.qza \
 --p-exclude "p__;,k__;" \
 --p-mode contains \
 --o-filtered-sequences sidle-db-full-degen-filtered-phylum-def-sequences.qza

If you summarize the sequences, you should find that you now have 5408 sequences remaining.

For databases with more complicated strings that include taxonomy, it will be necessary to include the level designation to avoid removing taxa which may be undefined at lower levels.

Note

The taxonomic filtering should be considered carefully and pre-filtering should be very permissive. Many common databases lack clear taxonomic resolution at lower taxonomic levels (family, genus, species) and these sequences still provide meaningful information in reconstruction.

Once you have finished pre-filtering, you are ready to start extracting regions.

Prepare a regional database for each primer set

The next step is to extract a region of the database. Alignment with the SMURF algorithm relies on extracting the exact kmer to be aligned with your ASVs, so the primer pair and read length must match exactly. Unlike other techniques, there is, unfortunately, no “good enough” approach. To maximize memory efficiency, the database is also prepared by expanding degenerate nucleotides and collapsing duplicated kmers into a single sequence.

First, the region is extracted from the pre-filtered database using the extract-reads function from the feature classifier plugin. As an example, we’ll look at extracting a region between 316F and 484R using the second primer pair from the SMURF paper (5’-TCCTACGGGAGGCAGCAG-3’) and (5’-TATTACCGCGGCTGCTGG-3’).

qiime feature-classifier extract-reads \
 --i-sequences sidle-db-full-degen-filtered-phylum-def-sequences.qza \
 --p-f-primer TCCTACGGGAGGCAGCAG \
 --p-r-primer TATTACCGCGGCTGCTGG \
 --o-reads sidle-db-filt-316F-484R.qza

For this example, we used the default settings, although these are slightly different from the original SMURF algorithm: In QIIME, the primers are extracted if they have at least an 80% match with the sequence by default; the Matlab implementation of SMURF used a maximum difference of 2 nucleotides [1]. If you wish to use a limit closer to the original algorithm, this can be changed using the --p-identity parameter; however, for the sake of this tutorial, we’ll use the defaults.

Once the reads have been extracted, they need to be prepared to be used in alignment. This step will expand any degenerate reads that have been extracted, collapse duplicate reads, and trim them to a consistent length. For the full pipeline to work correctly, the primers need to be specified in this step, so once again, you’ll need to pass your primers. You’ll also need to specify a trim length; let’s use 100nt. Finally, we need to specify a regional identifier in the database using the --region parameter. This should be the same regional parameter that you use during alignment. We’ll call it “WonderWoman” because (a) Diana Prince is amazing and (b) the regional name doesn’t matter.

qiime sidle prepare-extracted-region \
 --i-sequences sidle-db-filt-316F-484R.qza \
 --p-region "WonderWoman" \
 --p-fwd-primer TCCTACGGGAGGCAGCAG \
 --p-rev-primer TATTACCGCGGCTGCTGG \
 --p-trim-length 100 \
 --o-collapsed-kmers sidle-db-wonder-woman-100nt-kmers.qza \
 --o-kmer-map sidle-db-wonder-woman-100nt-map.qza

The command will output the sequences (--o-collapsed-kmers) with degenerate sequences expanded and duplicated sequences removed and a mapping between the original sequence name and the kmer name (--o-kmer-map). You can use qiime to visualize your kmer map, which gives you the relationship between the original database sequence name (db-seq), an expanded name which accounts for degenerates (seq-name), the collapsed regional identifier (kmer), the primers (fwd-primer and rev-primer), the region identifier (region), and the sequence length (trim-length).

qiime metadata tabulate \
 --m-input-file sidle-db-wonder-woman-100nt-map.qza \
 --o-visualization sidle-db-wonder-woman-100nt-map.qzv

In some cases, the reference region and sequence length may not be long enough to cover the full amplicon. If that’s the case, you can extract the read starting from the reverse primer by setting the trim length to a negative value. You can even reverse complement the resultant amplicons using the --reverse_complement_result flag. Let’s do an example using the same primer-pair region as before, but call the region “Batman”. Note we’ve swapped the forward and reverse primer sequences.

qiime sidle prepare-extracted-region \
 --i-sequences sidle-db-filt-316F-484R.qza \
 --p-region "Batman" \
 --p-fwd-primer TATTACCGCGGCTGCTGG \
 --p-rev-primer TCCTACGGGAGGCAGCAG \
 --p-trim-length -100 \
 --p-reverse-complement-result \
 --o-collapsed-kmers sidle-db-batman-100nt-kmers.qza \
 --o-kmer-map sidle-db-batman-100nt-map.qza

As an exercise, try using the 486-650 primers (3-CAGCAGCCGCGGTAATAC-5 forward; 3-CGCATTTCACCGCTACAC-5 reverse) to extract and prepare a 100nt region called “GreenLantern” as we outlined above. Use the same naming convention as the other two extracted regions (sidle-db-green-lantern-100nt-kmers.qza).

Recap

We’ve essentially constructed three amplicon regional databases. Again, Batman is simply a reverse-compliment extraction example of the same region we extracted for WonderWoman. This mimics the case in which we have paired-end reads that we could not merge, so we treat them as separate region data as mentioned above.

Hero (Region Name) Region Forward Primer Reverse Primer
WonderWoman 316F-484R TCCTACGGGAGGCAGCAG TATTACCGCGGCTGCTGG
Batman 316F-484R (rev compliment) TATTACCGCGGCTGCTGG TCCTACGGGAGGCAGCAG
GreenLantern 486F-650R CAGCAGCCGCGGTAATAC CGCATTTCACCGCTACAC

Now, you have a database that’s ready to use for alignment and reconstruction.

TL;DR: Database Preparation

Database Filtering

  • Filtering only needs to be performed once per dataset
  • Degenerate filtering speeds up preparation and alignment
  • You can exclude sequences during database generation that you don’t want included in the final table

Degenerate Filtering

Syntax

qiime rescript cull-seqs \
 --i-sequences [unfiltered sequences].qza \
 --p-num-degenerates [number of degenerate nucleotides] \
 --o-clean-sequences [filtered sequences].qza

Example

qiime sidle filter-degenerate-sequences \
 --i-sequences sidle-db-full-sequences.qza \
 --p-max-degen 3 \
 --o-filtered-sequences sidle-db-full-degen-filtered-sequences.qza

Taxonomic Filtering

Please see the qiime filtering tutorial for more information.

Syntax

qiime taxa filter-seqs \
 --i-sequences [unfiltered sequences].qza \
 --i-taxonomy [taxonomic descriptions].qza \
 --p-exclude [criteria to exclude] \
 --p-mode contains \
 --o-filtered-sequences [filtered sequences].qza

Example

qiime taxa filter-seqs \
 --i-sequences sidle-db-full-degen-filtered-sequences.qza \
 --i-taxonomy ref-taxonomy.qza \
 --p-exclude "p__;,k__;" \
 --p-mode contains \
 --o-filtered-sequences filtered-defined-phylum.qza

Database Region Preparation

  • The primers used to extract regions must be the same as the primers used to amplify your sequences in that region
  • The extraction command must be re-run for each primer-pair and database
  • Read preparation needs to be re-run for each primer-pair, read length, and database
  • A negative trim length to qiime sidle prepare-extracted-region will trim from the reverse primer (right)

Read Extraction

Please see the qiime feature classifier documentation for more information.

Syntax

qiime feature-classifier extract-reads \
 --i-sequences [full length sequences] \
 --p-f-primer [forward primer] \
 --p-r-primer [reverse primer] \
 --o-reads [extracted region]

Example

qiime feature-classifier extract-reads \
 --i-sequences filtered-defined-phylum.qza \
 --p-f-primer TGGCGGACGGGTGAGTAA \
 --p-r-primer CTGCTGCCTCCCGTAGGA \
 --o-reads filtered-defined-phylum-extract-316F-484R.qza

Regional Database Preparation

Syntax

qiime sidle prepare-extracted-region \
 --i-sequences [extracted sequences].qza \
 --p-region [region label] \
 --p-fwd-primer [forward primer for region] \
 --p-rev-primer [reverse primer for region] \
 --p-trim-length [kmer length] \
 --o-collapsed-kmers [kmer sequences].qza \
 --o-kmer-map [kmer to database map].qza

Example

For forward reads (trim from the left)

qiime sidle prepare-extracted-region \
 --i-sequences filtered-defined-phylum-extract-316F-484R.qza \
 --p-region "WonderWoman" \
 --p-fwd-primer TCCTACGGGAGGCAGCAG \
 --p-rev-primer TATTACCGCGGCTGCTGG \
 --p-trim-length 100 \
 --o-collapsed-kmers sidle-db-wonder-woman-100nt-kmers.qza \
 --o-kmer-map sidle-db-wonder-woman-100nt-map.qza

For reverse reads (trim from the right and in this case, reverse complement). The primers should be flipped (we’ll trim from the forward primer)

qiime sidle prepare-extracted-region \
 --i-sequences filtered-defined-phylum-extract-316F-484R.qza \
 --p-region "Batman" \
 --p-fwd-primer TATTACCGCGGCTGCTGG \
 --p-rev-primer TCCTACGGGAGGCAGCAG \
 --p-trim-length -100 \
 --p-reverse-complement-result \
 --o-collapsed-kmers sidle-db-batman-100nt-kmers.qzv \
 --o-kmer-map sidle-db-batman-100nt-map.qzv
Database References
[1](1, 2) Fuks, C; Elgart, M; Amir, A; et al (2018) “Combining 16S rRNA gene variable regions enables high-resolution microbial community profiling.” Microbiome. 6:17. doi: 10.1186/s40168-017-0396-x
[2]McDonald, D; Price, NM; Goodrich, J, et al (2012). “An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.” ISME J. 6: 610. doi: 10.1038/ismej.2011.139
[3]Quast, C.; Pruesse, E; Yilmaz, P; et al. (2013) “The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.” Nucleic Acids Research. 41:D560. doi: 10.1093/nar/gks1219
[4](1, 2) Michael S Robeson II, Devon R O’Rourke, Benjamin D Kaehler, et al. “RESCRIPt: Reproducible sequence taxonomy reference database management for the masses.”” bioRxiv 2020.10.05.326504; doi: 10.1101/2020.10.05.326504