1. Validate inputs and sample metadata
Pepkio verifies FASTQ integrity (MD5), read length, paired-end structure, and platform. Sample matrix, host species, batch, and contrasts are recorded in sample_manifest.csv; sub-threshold yield is flagged (Quince et al., 2017). MGI read-header normalization when scoped at kickoff.
Tools and outputs
Tools used: md5sum; custom validation scripts
Output: sample_manifest.csv with sample IDs, platform, read counts, host species, and QC flags
2. QC and trim raw reads
FastQC assesses per-base quality, adapter content, and duplication; fastp trims adapters and low-quality ends when needed (Chen et al., 2018). Low Q30 yield or extreme adapter contamination is flagged before host removal. MultiQC aggregates per-sample metrics (Ewels et al., 2016).
Tools and outputs
Tools used: FastQC 0.12.1; fastp 0.24.0; MultiQC 1.25.1
Output: fastqc/ reports; fastp.json / fastp.html; multiqc_report.html
3. Remove host and contaminant reads
KneadData trims, filters, and removes host DNA with Bowtie2 against GRCh38 or GRCm39 (Beghini et al., 2021). Host-depletion rate and post-filter read counts are reported because host DNA confounds shallow metagenomics (Treichel et al., 2026; Franzosa et al., 2018). Insufficient microbial reads after depletion are flagged before profiling.
Tools and outputs
Tools used: KneadData 0.12.0; Bowtie2 2.5.4
Output: Host-depleted FASTQ; host_depletion_summary.csv with pre/post read counts and pct host removed
4. Profile taxonomy with MetaPhlAn 4
MetaPhlAn 4 maps reads to clade-specific marker genes in the SGB catalog and estimates relative abundance at species and strain level where markers support it (Blanco-Míguez et al., 2023). Unclassified read fraction is compared against depth expectations before downstream testing.
Tools and outputs
Tools used: MetaPhlAn 4.1.0
Output: Per-sample MetaPhlAn profiles; merged metaphlan4_species.tsv
5. Classify reads with Kraken2 and re-estimate abundance with Bracken
Kraken2 assigns reads by k-mer matching against PlusPF (Wood et al., 2019); Bracken re-estimates species-level abundance (Lu et al., 2017). Results are cross-checked against MetaPhlAn to flag phantom-taxa patterns at high depth (Johnson et al., 2022; McGill et al., 2024).
Tools and outputs
Tools used: Kraken2 2.1.3; Bracken 2.9
Output: Kraken2 reports; merged kraken2_bracken_species.tsv
6. Quantify functional potential with HUMAnN 3
HUMAnN 3 maps reads to ChocoPhlAn pangenomes and UniRef gene families, then aggregates MetaCyc pathway abundances (Franzosa et al., 2018). Samples below agreed pathway depth are flagged before differential testing (Treichel et al., 2026).
Tools and outputs
Tools used: HUMAnN 3.9
Output: pathabundance.tsv; genefamilies.tsv; per-sample HUMAnN logs with mapping statistics
7. Co-assemble and bin MAGs when scoped
MEGAHIT co-assembles host-depleted reads; MetaBAT2 bins contigs; CheckM2 assesses completeness and contamination (Li et al., 2015; Kang et al., 2019; Chklovski et al., 2023). MAG chimerism limits are documented because even high-quality MAGs may not represent a single strain (Treichel et al., 2026; Meyer et al., 2022). Optional; scoped at kickoff.
Tools and outputs
Tools used: MEGAHIT 1.2.9; MetaBAT2 2.17; CheckM2 1.0.2
Output: {sample_or_cohort}.contigs.fa; {bin}.fa MAGs; mag_qc_summary.csv with completeness, contamination, and CheckM2 lineage
8. Compute alpha and beta diversity
Alpha diversity (Shannon, Simpson, observed richness) and beta diversity (Bray-Curtis, Aitchison distance after CLR transform) are computed on rarefied or transformed abundance tables (McMurdie & Holmes, 2013). PCoA or NMDS ordination and PERMANOVA test community separation by metadata factors (Anderson, 2001).
Tools and outputs
Tools used: phyloseq 1.48.0; vegan 2.6-8.1
Output: alpha_diversity.csv; beta_diversity_distance_matrix.csv; PERMANOVA results table
9. Test differential abundance
Taxa and pathways are tested with ANCOM-BC or MaAsLin2 for compositional data with covariate adjustment (Lin & Peddada, 2020; Mallick et al., 2021). Benjamini-Hochberg FDR unless a pre-specified alternative is agreed at kickoff.
Tools and outputs
Tools used: ANCOM-BC 2.4.0; MaAsLin2 1.18.0
Output: da_results_<contrast>.csv with feature-level coefficients, standard errors, p-values, and q-values
10. Package figures, scripts, and Methods draft
MultiQC aggregates QC metrics across samples. Figure-ready plots, commented scripts, README, and a Methods draft listing exact tool versions and database builds are packaged per agreed retention policy (Meyer et al., 2022).
Tools and outputs
Tools used: R/Python plotting scripts; MultiQC 1.25.1
Output: PDF/SVG figures; final deliverable bundle with scripts, README, Methods draft, and HTML QC report