1. Validate inputs and sample metadata
Pepkio verifies BAM, CRAM, gVCF, or FASTQ integrity, confirms coordinate sorting and indexing, and checks @RG read groups required by GATK (Van der Auwera et al., 2013). Sample sex, batch, sequencing center, and reference build are recorded in sample_manifest.csv; reference-build mismatches are flagged before calling.
Tools and outputs
Tools used: md5sum; samtools quickcheck; custom validation scripts
Output: sample_manifest.csv with sample IDs, input type, reference build, read counts or BAM size, and QC flags
2. Audit alignment prerequisites
Pepkio inspects duplicate marking and base quality score recalibration (BQSR) status. Calling without MarkDuplicates or ApplyBQSR increases false-positive SNVs and indels (Van der Auwera et al., 2013; Koboldt, 2020). Missing prerequisites trigger scoped Picard MarkDuplicates and GATK BQSR re-runs or documented deviation notes.
Tools and outputs
Tools used: Picard 3.2.0 CollectAlignmentSummaryMetrics; GATK 4.6.0.0 ValidateSamFile; custom BAM audit scripts
Output: bam_prerequisite_audit.csv with duplicate-marking status, BQSR tag presence, mapping rate, and re-preprocessing recommendations
3. Assess coverage and callability
mosdepth and Picard CollectWgsMetrics or CollectHsMetrics report mean depth, breadth at ≥15×/≥20×, and on-target fraction (Rehm et al., 2021; Koboldt, 2020). Callable breadth—not mean depth alone—determines locus-level calling; SNV reproducibility plateaus above ~30× while indel reproducibility continues to improve with depth (Pan et al., 2022). Sub-threshold samples are flagged before discovery.
Tools and outputs
Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectWgsMetrics or CollectHsMetrics
Output: coverage_summary.csv; mosdepth.global.dist.txt or target-coverage tables; coverage histogram and CDF plots
4. Define calling intervals and resources
Intervals are set to whole genome, client exome or panel BED, or agreed subsets. GATK resource bundle 4.6 known sites for BQSR verification and VQSR training match the reference build (Broad Institute, 2024). Interval lists and BED files are version-controlled in the deliverable bundle.
Tools and outputs
Tools used: GATK 4.6.0.0 PreprocessIntervals (when needed); custom interval validation scripts
Output: calling_intervals.list or .bed; interval_validation_report.txt
5. Call germline SNVs and indels per sample
GATK HaplotypeCaller runs in -ERC GVCF mode, emitting gVCF blocks across uncalled regions (Van der Auwera et al., 2013). DeepVariant 1.8.0 or Strelka2 2.9.10 are optional when scoped at kickoff (Poplin et al., 2018; Barbitoff et al., 2022; Kim et al., 2018). Per-sample Ti/Tv ratios are checked against ~2.0–2.1 for whole-genome germline SNVs (Koboldt, 2020).
Tools and outputs
Tools used: GATK 4.6.0.0 HaplotypeCaller; DeepVariant 1.8.0 or Strelka2 2.9.10 when scoped
Output: {sample}.g.vcf.gz and .tbi; per-sample variant count and Ti/Tv summary
6. Joint-genotype cohorts when applicable
For multi-sample projects, gVCFs import into GenomicsDB and joint-genotype with GenotypeGVCFs (Van der Auwera et al., 2013). Single-sample projects skip this step. GLnexus merge is scoped for DeepVariant cohorts when required (Regier et al., 2018).
Tools and outputs
Tools used: GATK 4.6.0.0 GenomicsDBImport; GATK GenotypeGVCFs; GLnexus when scoped for DeepVariant
Output: {cohort}.joint.vcf.gz; GenomicsDB workspace; sample count and site-level summary
7. Filter variants to high-confidence calls
When cohort size supports it, GATK VariantRecalibrator applies VQSR trained on known polymorphism resources (Broad Institute, 2024). Smaller cohorts use documented GATK hard filters with all FILTER labels retained (Koboldt, 2020). DeepVariant projects use quality thresholds documented in the Methods draft. Filtered and pass-only VCFs export separately.
Tools and outputs
Tools used: GATK 4.6.0.0 VariantRecalibrator / ApplyVQSR or VariantFiltration; bcftools 1.21
Output: {cohort}.filtered.vcf.gz; {cohort}.pass-only.vcf.gz; VQSR tranche plots or hard-filter summary table
8. QC variant call sets
Pepkio computes Ti/Tv on pass sites, variant counts by type, and FILTER distributions. Caller and aligner choice affected reproducibility more than sequencing platform in a multi-center WGS benchmark (Pan et al., 2022); multi-caller benchmarks are documented when scoped. Problematic regions (segmental duplications, homopolymers) are flagged.
Tools and outputs
Tools used: bcftools 1.21 stats; custom QC scripts; hap.py when benchmark comparison is scoped
Output: variant_qc_summary.csv; sample_qc_summary.csv; filter_summary.csv; Ti/Tv plot; FILTER status distribution chart; optional benchmark comparison table
9. Annotate variants with transcript consequences
Ensembl VEP annotates consequence terms, gene symbols, and population or clinical fields when databases are configured (McLaren et al., 2016). ClinVar, gnomAD, or client gene lists are added when scoped at kickoff.
Tools and outputs
Tools used: Ensembl VEP 112; bcftools 1.21
Output: variant_annotation_master.tsv; annotated .vcf.gz when requested
10. Package deliverables
MultiQC aggregates QC metrics when multiple stages are in scope (Ewels et al., 2016). Scripts, README, Methods draft with exact versions and filter parameters, and HTML QC report are packaged per agreed retention policy.
Tools and outputs
Tools used: MultiQC 1.25; custom packaging scripts
Output: deliverable bundle with scripts, figures, QC report, and Methods draft