1. Validate inputs, capture kit metadata, and sample manifest
Pepkio verifies FASTQ integrity (MD5 checksums), read length, paired-end structure, and read groups (@RG tags) required by GATK (Van der Auwera et al., 2013). Capture kit bait, target, and calling intervals are recorded—bait, target, and calling files are not interchangeable (Corominas et al., 2022). Sample metadata is logged in sample_manifest.csv; sub-threshold yield is flagged before alignment.
Tools and outputs
Tools used: md5sum; custom validation scripts
Output: sample_manifest.csv with sample IDs, capture kit metadata, interval file paths, read counts, and QC flags
2. QC raw reads
FastQC assesses per-base quality, adapter content, and duplication; fastp trims adapters and low-quality ends when needed. Libraries with low Q30 yield or extreme adapter contamination are flagged before alignment.
Tools and outputs
Tools used: FastQC 0.12.1; fastp 0.23.4
Output: fastqc/ reports; fastp.json / fastp.html trim statistics; trimmed FASTQs when trimming is applied
3. Align reads to the reference genome
Paired-end reads from Illumina, MGI DNBSEQ, or other scoped platforms are aligned to GRCh38 (or agreed build) with BWA-MEM2 using Picard/GATK-compatible settings (Vasimuddin et al., 2019; Broad Institute, 2024), producing coordinate-sorted BAM. Mapping rate and insert-size distribution are compared against expected ranges for capture libraries.
Tools and outputs
Tools used: BWA-MEM2 2.2.1; samtools 1.21
Output: {sample}.sorted.bam; alignment summary metrics
4. Mark duplicates and index BAMs
PCR and optical duplicates are marked with Picard MarkDuplicates so downstream callers do not double-count clustered molecules (Van der Auwera et al., 2013). Elevated duplicate rates trigger review of library prep metadata.
Tools and outputs
Tools used: Picard 3.2.0 MarkDuplicates; samtools 1.21 index
Output: {sample}.dedup.bam and .bai; duplicate metrics table
5. Recalibrate base quality scores
GATK BaseRecalibrator and ApplyBQSR adjust per-base qualities using known polymorphism sites from the GATK resource bundle (Van der Auwera et al., 2013). Recalibration reports are inspected for covariate drift across cycles and read groups.
Tools and outputs
Tools used: GATK 4.6.0.0 BaseRecalibrator; GATK ApplyBQSR
Output: {sample}.recal.bam; recalibration report PDF
6. Assess on-target coverage and hybrid-selection metrics
Picard CollectHsMetrics and mosdepth report mean target coverage, PCT_TARGET_BASES_20X, PCT_TARGET_BASES_100X, and per-target depth distributions using kit-specific bait and target intervals (Rehder et al., 2021). On-target breadth—not genome-wide mean depth—determines callability; samples below agreed thresholds are flagged before calling. At 20× stringency, Dillon et al. (2018) estimated the likelihood of missing a clinically relevant variant in a phenotype gene list was maximally 8%.
Tools and outputs
Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectHsMetrics
Output: coverage_summary.csv; hs_metrics.txt; mosdepth.targets.dist.txt; on-target coverage histogram and CDF plots
7. Call germline SNVs and indels per sample
GATK HaplotypeCaller runs in -ERC GVCF mode with -L capture calling intervals, emitting gVCF blocks restricted to exome targets (Van der Auwera et al., 2013; Broad Institute, 2024). DeepVariant 1.8.0 is available as an alternative single-sample caller when scoped at kickoff (Poplin et al., 2018).
Tools and outputs
Tools used: GATK 4.6.0.0 HaplotypeCaller
Output: {sample}.g.vcf.gz and .tbi; per-sample variant count summary
8. Joint-genotype cohorts when applicable
For multi-sample projects, gVCFs are imported into GenomicsDB and joint-genotyped with GenotypeGVCFs over capture intervals, rescuing variants weakly supported in individual samples (Van der Auwera et al., 2013). Single-sample projects skip this step.
Tools and outputs
Tools used: GATK 4.6.0.0 GenomicsDBImport; GATK GenotypeGVCFs
Output: {cohort}.joint.vcf.gz; GenomicsDB workspace; sample count and site-level summary
9. Filter variants to high-confidence calls
When cohort size supports it, GATK VariantRecalibrator applies VQSR (Broad Institute, 2024); smaller cohorts use documented GATK hard filters. Dataset-specific genotype-quality filters may supplement VQSR when scoped (Carson et al., 2014).
Tools and outputs
Tools used: GATK 4.6.0.0 VariantRecalibrator / ApplyVQSR or VariantFiltration; bcftools 1.21
Output: {cohort}.filtered.vcf.gz; {cohort}.pass-only.vcf.gz; VQSR tranche plots or hard-filter summary
10. Annotate variants and package deliverables
Ensembl VEP annotates consequences, gene symbols, and population or clinical fields when reference databases are configured (McLaren et al., 2016). Known pseudogene-prone loci (e.g., SMN1, CYP21A2, PKD1, STRC) are flagged for manual review when variants are reported (Corominas et al., 2022; Mandelker et al., 2016). MultiQC aggregates QC metrics; final scripts, README, Methods draft, and HTML QC report are packaged per agreed retention policy.
Tools and outputs
Tools used: Ensembl VEP 112; bcftools 1.21; MultiQC 1.25
Output: variant_annotation_master.tsv; MultiQC report; final deliverable bundle with scripts and Methods draft