PICRUSt with de novo Variants Workflow

Chenghao Zhu

2018/04/28

  • This workflow uses the DADA2(or any other de novo clustering algorithm) output to run the PICRUSt function prediction.
  • This workflow was created by the Langille lab. If you want to know more detail, please visit their github tutorial page https://github.com/LangilleLab/microbiome_helper/wiki/PICRUSt-Tutorial-with-de-novo-Variants#add-metadata-to-ko-table
  • This workflow requires the sequence alignment tool and the fasttree software in QIIME2. So MAKE SURE that you activate the qiime2 enviroment before you start this jupyter notebook. The qiime2 version that I used when I wrote thsi workflow is qiime2-2018.2, although the 2018.4 version is already released.

Create New Reference Tree

In [ ]:
!biom convert \
    -i dada2_out/feature_table_rarefied.tsv \
    -o dada2_out/feature_table.biom \
    --to-json
In [ ]:
!biom add-metadata \
    -i dada2_out/feature_table.biom \
    -o dada2_out/feature_table_taxa.biom \
    --observation-metadata-fp dada2_out/taxonomy_for_biom.tsv
In [ ]:
!cat dada2_out/rep_seqs.fasta img_gg_starting_files/gg_13_5_img_subset.fasta > study_seqs_gg_13_5_img_subset.fasta
In [ ]:
!qiime tools import \
    --input-path study_seqs_gg_13_5_img_subset.fasta \
    --output-path rep_seqs_gg_img.qza \
    --type 'FeatureData[Sequence]'
In [ ]:
!qiime alignment mafft \
    --i-sequences rep_seqs_gg_img.qza \
    --o-alignment rep_seqs_gg_img_aligned.qza 
In [ ]:
!qiime alignment mask \
    --i-alignment rep_seqs_gg_img_aligned.qza \
    --o-masked-alignment rep_seqs_gg_img_masked.qza
In [ ]:
!qiime phylogeny fasttree \
    --i-alignment rep_seqs_gg_img_masked.qza \
    --o-tree unrooted-tree.qza
In [ ]:
!qiime tools export \
    unrooted-tree.qza \
    --output-dir .

PICRUSt genome prediction

Start by formatting input tree and trait table. The below commands perform a few different checks of the input files and prepares them for the ASR step.

In [ ]:
!format_tree_and_trait_table.py \
    -t tree.nwk \
    -i img_gg_starting_files/gg_13_5_img_16S_counts.txt \
    -o format/16S
In [ ]:
!format_tree_and_trait_table.py \
    -t tree.nwk \
    -i img_gg_starting_files/img_400_ko.tab \
    -o format/KO -m img_gg_starting_files/gg_13_5_img_fixed.txt

Run ancestral state reconstruction steps for 16S rRNA gene counts to get estimates for every internal node in the tree. The default method is phylogenetic independent contrasts although there are other methods available as well.

In [ ]:
!ancestral_state_reconstruction.py \
    -i format/16S/trait_table.tab \
    -t format/16S/pruned_tree.newick \
    -o asr/16S_asr_counts.tab \
    -c asr/asr_ci_16S.tab

Now that we have the trait predictions for all internal nodes we can extend these 16S counts predictions to the tips of the tree (i.e. our study sequences of interest). Note that the ASR files for the KEGG orthologs have been pre-calculated.

In [ ]:
!predict_traits.py \
    -i format/16S/trait_table.tab \
    -t format/16S/reference_tree.newick \
    -r asr/16S_asr_counts.tab -o predicted/16S_precalculated.tab \
    -a -c asr/asr_ci_16S.tab

You could also run this command to run ASR for the KO table. These commands can take > 30 minutes so we have made the output files of predict_traits.py available in img_gg_starting_files/precalc_predict if you want to skip these commands.

In [ ]:
!ancestral_state_reconstruction.py \
    -i format/KO/trait_table.tab \
    -t format/KO/pruned_tree.newick \
    -o asr/KO_asr_counts.tab \
    -c asr/asr_ci_KO.tab
In [ ]:
!predict_traits.py \
    -i format/KO/trait_table.tab \
    -t format/KO/reference_tree.newick \
    -r asr/KO_asr_counts.tab \
    -o predicted/KO_precalculated.tab \
    -a -c asr/asr_ci_KO.tab

The output of these commands is the table of trait counts across study and reference variants for both 16S copy number and KOs. Since there is variability in the alignment step it is likely your output will differ at least slightly from the results shown below. You can use our pre-calculated predictions if you want to re-generate the below plots, which are in img_gg_starting_files/precalc_predict. If you want to use the previously generated predicted tables for the below commands then you should first run this command to overwrite the tables in predicted/: cp img_gg_starting_files/precalc_predict/* predicted/.

Add metadata to KO table

You will need to add the metadata to the KEGG prediction table before proceeding. For this dataset this can easily be done with the below commands (the echo command is just adding a newline to the end of the trait table).

In [ ]:
%%bash
echo "" >> predicted/KO_precalculated.tab
cat img_gg_starting_files/metadata/precalc_meta >> predicted/KO_precalculated.tab

Get predictions per sample based on the abundance of each variant.

Now that we have the predicted trait abundances per variant (i.e. for their predicted genome) we can get the predicted trait abundances for each sample (i.e. the predicted metagenome). The first step in this process is to normalize the variant abundances by the predicted 16S copy number. This is important so that the abundances of gene families and other functions are comparable in downstream steps.

Note that the precalculated trait tables that we generated above are specified with the -c option in the below commands.

In [ ]:
!normalize_by_copy_number.py \
    -i dada2_out/feature_table_taxa.biom \
    -c predicted/16S_precalculated.tab \
    -o seqtab_tax_rarified_norm.biom

Now that the variant table is normalized by 16S copy number we can determine the predicted number functions in each sample. Essentially this command multiplies each variant's normalized abundance by the abundance of each function in the precalculated table.

In [ ]:
!predict_metagenomes.py \
    -i seqtab_tax_rarified_norm.biom \
    -c predicted/KO_precalculated.tab \
    -o seqtab_tax_rarified_norm_ko.biom

The above command generated predicted metagenomes for KEGG orthologs (i.e. the gene families). However, often we are interested in more readily interpretable categories like pathways. To collapse the KEGG orthologs to KEGG pathways the below command can be used.

In [ ]:
!categorize_by_function.py \
    -i seqtab_tax_rarified_norm_ko.biom \
    -o seqtab_tax_rarified_norm_ko_l3.biom \
    -c KEGG_Pathways \
    -l 3
In [ ]:
!biom head -i seqtab_tax_rarified_norm_ko_l2.biom
In [ ]:
!biom convert \
    -i seqtab_tax_rarified_norm_ko_l1.biom \
    -o seqtab_tax_rarified_norm_ko_l1.tsv \
    --to-tsv
In [ ]:
!head seqtab_tax_rarified_norm_ko_l1.tsv

Breakdown of Taxonomic Contributions

Although predicted metagenomes are useful often users want to know which taxa are actually conferring specific functions. To this end the below command will output a textfile that states what abundance of a function is being conferred by a given variant (and their taxonomic assignment). The below command will output these predicted contributions for two KEGG orthologs: K08602 and K03699.

In [ ]:
!metagenome_contributions.py \
    -i seqtab_tax_rarified_norm.biom \
    -l K14203,K11038,K14188,K01315,K01401,K14200,K14201,K14202,K14204,K14205,K03367,K03740,K03739,K11632,K11631,K14198,K14199,K14194,K14195,K14196,K14197,K14192,K14193,K11041,K11040,K11043 \
    -o metagenome_contributions.txt \
    -c predicted/KO_precalculated.tab