Genomics & Variant Analysis

Whole-Exome Sequencing (WES) Analysis Service — Capture-Aware Germline Variant Calls from Raw FASTQs with On-Target Depth QC

Whole-exome sequencing (WES) discovers coding-region SNVs and indels from hybrid-capture libraries at lower cost than WGS (Corominas et al., 2022). Pepkio delivers version-pinned FASTQ-to-VCF analysis with on-target depth QC and bespoke workflow support for academic, biotech, and pharma clients—depth targets aligned with published lab QC standards (~75–100× mean on-target for ≥95% of bases at ≥10×; Rehder et al., 2021). Scripts, figures, and a Methods draft included.

Key facts

Key facts about Whole-Exome Sequencing
FactValue
Supported platforms / instrumentsIllumina NovaSeq X / 6000 / NextSeq 2000, HiSeq 2500/4000; MGI DNBSEQ-T7 / G400 / G99 when scoped at kickoff; Agilent SureSelect, Twist Exome 2.0, IDT xGen Exome, Roche KAPA HyperExome capture kits when bait/target interval files are provided; Element Biosciences AVITI and Ultima Genomics UG100 when scoped at kickoff
Input requirementsPaired-end FASTQ (≥2×100 bp or 2×150 bp typical); germline ≥50–100× on-target depth for research (75–100× mean on-target typical for lab-QC-aligned projects per Rehder et al., 2021); somatic tumor–normal scoped separately (≥200× tumor typical, confirmed at kickoff). Capture kit bait, target, and calling .interval_list files required. Cohort VQSR: typically ≥30 exomes or one high-quality WGS per GATK guidance (Broad Institute, 2024)
Reference builds supportedHuman GRCh38 primary (GATK resource bundle 4.6); legacy GRCh37/hg19 on request; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)BWA-MEM2 2.2.1; GATK 4.6.0.0; Picard 3.2.0; samtools 1.21; bcftools 1.21; mosdepth 0.3.3; Ensembl VEP 112; fastp 0.23.4; FastQC 0.12.1; MultiQC 1.25. DeepVariant 1.8.0 optional
Typical turnaround time2–4 weeks (single-sample germline); 4–8 weeks (multi-sample cohort with joint genotyping) — confirmed at kickoff
Deliverable formats.bam, .g.vcf.gz, filtered .vcf.gz, hs_metrics.txt, annotation .tsv; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice referenceVan der Auwera et al. (2013), Current Protocols in Bioinformatics; Rehder et al. (2021), Genetics in Medicine; Kong et al. (2018), Genetics in Medicine
Custom / bespoke analysisNon-standard inputs, outputs, and methods scoped at kickoff—e.g., client BAMs, custom gene lists, off-target calling, pedigree filtering, GWAS-ready plink exports, or tumor–normal somatic extensions

What is whole-exome sequencing (WES)?

WES aligns hybrid-capture-enriched short reads, recalibrates base qualities, and calls SNVs and indels within exome target intervals—not across the full genome. Unlike WGS, WES concentrates depth on coding exons and adjacent intronic regions at lower per-base cost; unlike fixed gene panels, WES surveys most protein-coding genes in one assay (Rehder et al., 2021). In published clinical vendor benchmarks at ≥120× mean depth, SNV sensitivity was 98.9–99.9% and analytic PPV exceeded 99.1% for SNVs and homozygous indels; heterozygous indels showed lower accuracy (Kong et al., 2018). Pepkio starts from FASTQs or client BAMs and returns filtered VCFs with capture-kit–aware on-target QC. Custom inputs and deliverables are agreed at kickoff. See the whole-exome sequencing glossary.

When should you use whole-exome sequencing (WES)?

WES fits when coding-region SNV/indel discovery is the primary goal and uniform non-coding coverage is not required. The table contrasts WES with WGS and targeted gene panels.

Comparison of WES, WGS, and targeted gene panels
ApproachBest forLimitationsApproximate cost range
WES (capture)Coding-region variant discovery, rare-disease gene identification, cancer predisposition screening at lower sequencing cost than WGSGaps in non-coding, mitochondrial, and poorly captured targets; pseudogene mis-mapping in homologous regions (Rehm et al., 2021; Corominas et al., 2022)Lower sequencing cost than WGS; capture kit and on-target depth drive sensitivity
WGS (short-read)Non-coding variant discovery, uniform genome-wide coverage, CNV/SV add-ons without capture biasHigher per-sample sequencing and storage cost than WESLibrary prep + sequencing + bioinformatics vary by depth and cohort size
Targeted gene panelKnown gene sets, ultra-high depth for low-VAF mosaicism in specific lociLimited to panel genes; misses novel loci outside designLowest per-sample sequencing cost; panel breadth drives sensitivity
  • Undiagnosed monogenic disease: Dillon et al. (2018) found WES diagnoses in genes absent from at least one of three commercial panels in 42% of WES-positive children.
  • Cancer predisposition in BRCA-negative families: Subramanian et al. (2020) sequenced 516 BRCA1/2-negative high-grade serous ovarian carcinoma germlines to 126× mean depth with 98.4% of bases >20×, screening rare LoF variants across 1,307 genes.
  • Rare germline susceptibility in case-control cohorts: Liu et al. (2021) validated rare deleterious germline variants in ATM and MPZL2 and three novel lung cancer loci from WES of 1,045 cases and 885 controls, with replication in 26,803 cases and 555,107 controls.

How the analysis works — step by step

  1. 1. Validate inputs, capture kit metadata, and sample manifest

    Pepkio verifies FASTQ integrity (MD5 checksums), read length, paired-end structure, and read groups (@RG tags) required by GATK (Van der Auwera et al., 2013). Capture kit bait, target, and calling intervals are recorded—bait, target, and calling files are not interchangeable (Corominas et al., 2022). Sample metadata is logged in sample_manifest.csv; sub-threshold yield is flagged before alignment.

    Tools and outputs

    Tools used: md5sum; custom validation scripts

    Output: sample_manifest.csv with sample IDs, capture kit metadata, interval file paths, read counts, and QC flags

  2. 2. QC raw reads

    FastQC assesses per-base quality, adapter content, and duplication; fastp trims adapters and low-quality ends when needed. Libraries with low Q30 yield or extreme adapter contamination are flagged before alignment.

    Tools and outputs

    Tools used: FastQC 0.12.1; fastp 0.23.4

    Output: fastqc/ reports; fastp.json / fastp.html trim statistics; trimmed FASTQs when trimming is applied

  3. 3. Align reads to the reference genome

    Paired-end reads from Illumina, MGI DNBSEQ, or other scoped platforms are aligned to GRCh38 (or agreed build) with BWA-MEM2 using Picard/GATK-compatible settings (Vasimuddin et al., 2019; Broad Institute, 2024), producing coordinate-sorted BAM. Mapping rate and insert-size distribution are compared against expected ranges for capture libraries.

    Tools and outputs

    Tools used: BWA-MEM2 2.2.1; samtools 1.21

    Output: {sample}.sorted.bam; alignment summary metrics

  4. 4. Mark duplicates and index BAMs

    PCR and optical duplicates are marked with Picard MarkDuplicates so downstream callers do not double-count clustered molecules (Van der Auwera et al., 2013). Elevated duplicate rates trigger review of library prep metadata.

    Tools and outputs

    Tools used: Picard 3.2.0 MarkDuplicates; samtools 1.21 index

    Output: {sample}.dedup.bam and .bai; duplicate metrics table

  5. 5. Recalibrate base quality scores

    GATK BaseRecalibrator and ApplyBQSR adjust per-base qualities using known polymorphism sites from the GATK resource bundle (Van der Auwera et al., 2013). Recalibration reports are inspected for covariate drift across cycles and read groups.

    Tools and outputs

    Tools used: GATK 4.6.0.0 BaseRecalibrator; GATK ApplyBQSR

    Output: {sample}.recal.bam; recalibration report PDF

  6. 6. Assess on-target coverage and hybrid-selection metrics

    Picard CollectHsMetrics and mosdepth report mean target coverage, PCT_TARGET_BASES_20X, PCT_TARGET_BASES_100X, and per-target depth distributions using kit-specific bait and target intervals (Rehder et al., 2021). On-target breadth—not genome-wide mean depth—determines callability; samples below agreed thresholds are flagged before calling. At 20× stringency, Dillon et al. (2018) estimated the likelihood of missing a clinically relevant variant in a phenotype gene list was maximally 8%.

    Tools and outputs

    Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectHsMetrics

    Output: coverage_summary.csv; hs_metrics.txt; mosdepth.targets.dist.txt; on-target coverage histogram and CDF plots

  7. 7. Call germline SNVs and indels per sample

    GATK HaplotypeCaller runs in -ERC GVCF mode with -L capture calling intervals, emitting gVCF blocks restricted to exome targets (Van der Auwera et al., 2013; Broad Institute, 2024). DeepVariant 1.8.0 is available as an alternative single-sample caller when scoped at kickoff (Poplin et al., 2018).

    Tools and outputs

    Tools used: GATK 4.6.0.0 HaplotypeCaller

    Output: {sample}.g.vcf.gz and .tbi; per-sample variant count summary

  8. 8. Joint-genotype cohorts when applicable

    For multi-sample projects, gVCFs are imported into GenomicsDB and joint-genotyped with GenotypeGVCFs over capture intervals, rescuing variants weakly supported in individual samples (Van der Auwera et al., 2013). Single-sample projects skip this step.

    Tools and outputs

    Tools used: GATK 4.6.0.0 GenomicsDBImport; GATK GenotypeGVCFs

    Output: {cohort}.joint.vcf.gz; GenomicsDB workspace; sample count and site-level summary

  9. 9. Filter variants to high-confidence calls

    When cohort size supports it, GATK VariantRecalibrator applies VQSR (Broad Institute, 2024); smaller cohorts use documented GATK hard filters. Dataset-specific genotype-quality filters may supplement VQSR when scoped (Carson et al., 2014).

    Tools and outputs

    Tools used: GATK 4.6.0.0 VariantRecalibrator / ApplyVQSR or VariantFiltration; bcftools 1.21

    Output: {cohort}.filtered.vcf.gz; {cohort}.pass-only.vcf.gz; VQSR tranche plots or hard-filter summary

  10. 10. Annotate variants and package deliverables

    Ensembl VEP annotates consequences, gene symbols, and population or clinical fields when reference databases are configured (McLaren et al., 2016). Known pseudogene-prone loci (e.g., SMN1, CYP21A2, PKD1, STRC) are flagged for manual review when variants are reported (Corominas et al., 2022; Mandelker et al., 2016). MultiQC aggregates QC metrics; final scripts, README, Methods draft, and HTML QC report are packaged per agreed retention policy.

    Tools and outputs

    Tools used: Ensembl VEP 112; bcftools 1.21; MultiQC 1.25

    Output: variant_annotation_master.tsv; MultiQC report; final deliverable bundle with scripts and Methods draft

What Pepkio delivers

Processed data files

  • .bam/.bai, recalibrated BAM, per-sample .g.vcf.gz, joint and filtered .vcf.gz (cohorts)
  • variant_annotation_master.tsv, coverage_summary.csv, hs_metrics.txt, sample_qc_summary.csv

Figures (PDF/SVG)

  • FastQC heatmaps, on-target depth histogram/CDF, insert-size distribution
  • Ti/Tv summary, variant consequence bar chart
  • VQSR tranche curves when applicable, per-target coverage heatmap when scoped

Tables

  • variant_annotation_master.tsv (CHROM, POS, REF, ALT, QUAL, FILTER, SYMBOL, Consequence, IMPACT, plus gnomAD/ClinVar when configured)
  • sample_qc_summary.csv (mapping rate, mean target coverage, pct targets ≥20×/≥100×, duplicate rate, pass-variant counts)

Code

  • Commented bash, R, and Python scripts with environment lock files
  • Delivery via private Git or agreed file transfer

Documentation

  • HTML/PDF QC report, README, Methods draft with software versions and capture kit intervals
  • Post-delivery reviewer support within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

On-target QC: CollectHsMetrics default
Exome projects use hybrid-selection metrics on kit-specific intervals, not CollectWgsMetrics (Rehder et al., 2021; Broad Institute, 2024).
Calling intervals: kit-specific lists
Calling restricted to capture intervals; genome-wide calling inflates off-target noise (Corominas et al., 2022). Off-target calling scoped separately.
Caller: GATK HaplotypeCaller gVCF + joint genotyping
gVCF mode supports cohort joint genotyping via GenomicsDBImport (Van der Auwera et al., 2013; Regier et al., 2018). DeepVariant 1.8.0 optional for single-sample projects (Poplin et al., 2018).
Filtering: VQSR or hard filters
VQSR when cohort size supports it (≥1 WGS or ~30 exomes; Broad Institute, 2024); hard filters otherwise. Supplemental GQ filters when scoped (Carson et al., 2014).
Pseudogene-prone loci flagged for review
Homologous genes (SMN1, CYP21A2, PKD1, STRC) produce mis-mapped reads in short-read WES (Corominas et al., 2022; Mandelker et al., 2016). Flagged loci are documented; orthogonal validation scoped when indicated.

Common questions

What is the minimum on-target depth and sample count for WES analysis?

For germline SNV/indel discovery on GRCh38, Pepkio recommends ≥50–100× on-target depth for research cohorts and ~75–100× mean on-target depth for lab-QC-aligned projects (Rehder et al., 2021). VQSR typically requires at least one WGS or ~30 exomes (Broad Institute, 2024). Thresholds are confirmed at kickoff.

Can you analyze low-quality or low-yield WES libraries?

Yes, with caveats. Low Q30 yield or mean on-target depth below ~20× reduce sensitivity for heterozygous indels and rare variants (Kong et al., 2018). Sub-threshold samples are flagged; re-sequencing or partial analysis on priority gene lists is discussed at kickoff.

Do you support Illumina, MGI DNBSEQ, and Agilent, Twist, IDT, and KAPA capture kits?

Illumina NovaSeq X, 6000, NextSeq 2000, and HiSeq FASTQs use the standard BWA-MEM2 + GATK workflow. MGI DNBSEQ-T7, G400, and G99 FASTQs are processed when scoped at kickoff with adapter/QC validation in the report. Agilent SureSelect, Twist Exome 2.0, IDT xGen Exome, and KAPA HyperExome require kit-specific interval files. Element AVITI and Ultima UG100 when scoped at kickoff.

How long does WES analysis take at Pepkio?

Single-sample germline projects typically complete in 2–4 weeks; multi-sample cohorts with joint genotyping and VQSR typically require 4–8 weeks. Timelines are confirmed at kickoff.

How do you handle batch effects and joint genotyping in multi-batch exome cohorts?

Version-pinned pipelines aligned with functional-equivalence principles reduce batch-driven call differences (Regier et al., 2018). Pepkio stratifies sequencing center, flowcell, and capture batch in QC reports. Joint genotyping via gVCF aggregation improves sensitivity within capture targets (Van der Auwera et al., 2013). Batch-specific correction beyond standard QC is scoped at kickoff.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of code, scripts, and results. Pepkio delivers commented bash, R, and Python scripts with environment lock files. Jupyter or R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after alignment QC, on-target coverage assessment, and before final delivery. A PhD-level scientific contact leads the project.

What does post-delivery reviewer support include?

Methods clarification, QC thresholds, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Methods and Supplementary drafts included; substantial new requests scoped separately.

Is co-authorship required?

No. Pepkio does not require co-authorship unless explicitly discussed. Acknowledgment of bioinformatics support is standard practice.

Do you need bait and target interval files from my sequencing provider?

Yes. Bait, target, and calling .interval_list files must match your capture kit version (Corominas et al., 2022). Mismatched intervals produce incorrect coverage and calls. Kit metadata is documented in the Methods draft.

Can you call variants outside capture targets (off-target reads)?

Default calling is restricted to capture intervals. Off-target or genome-wide calling is a separately scoped milestone; depth and sensitivity outside targets are documented in the QC report.

Can Pepkio perform tumor–normal paired exome or custom non-standard WES analyses?

Tumor–normal somatic exome and bespoke workflows are scoped at kickoff—client BAMs, custom gene lists, pedigree filtering, plink exports, or CNV integration. Somatic projects typically require ≥200× tumor on-target depth and distinct callers.

Related services

  • Whole-genome sequencingUniform genome-wide coverage when non-coding variants or off-target gaps in WES are a concern.
  • Variant callingCaller selection, filter tuning, and joint genotyping when alignment is already complete.
  • CNV and structural variationExonic copy-number and structural variant calling from WES alignments with capture-aware binning.
  • Long-read DNA sequencingPhased variants and structural events in pseudogene-prone or repeat-rich loci that short-read WES cannot fully resolve.
  • Custom consultingPre-sequencing depth, capture kit selection, and cohort-size planning before library prep.
References
  1. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics. 2013;43(1):11.10.1–11.10.33. https://doi.org/10.1002/0471250953.bi1110s43 (PMID: 25431634)
  2. Rehder CE, Bean LJ, Bick D, et al. Next-generation sequencing for constitutional variants in the clinical laboratory, 2021 revision: a technical standard of the American College of Medical Genetics and Genomics (ACMG). Genetics in Medicine. 2021;23(8):1399–1415. https://doi.org/10.1038/s41436-021-01139-4 (PMID: 33927380)
  3. Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genomic Medicine. 2021;6(1):47. https://doi.org/10.1038/s41525-020-00154-9 (PMID: 33110627)
  4. Regier AA, Farjoun Y, Larson DE, et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 2018;9:4038. https://doi.org/10.1038/s41467-018-06159-4 (PMID: 30279509)
  5. Kong SW, Lee IH, Liu X, et al. Measuring coverage and accuracy of whole-exome sequencing in clinical context. Genetics in Medicine. 2018;20(10):1244–1250. https://doi.org/10.1038/gim.2017.269 (PMID: 29789557)
  6. Dillon OJ, Lunke S, Stark Z, et al. Exome sequencing has higher diagnostic yield compared to simulated disease-specific panels in children with suspected monogenic disorders. European Journal of Human Genetics. 2018;26(7):974–984. https://doi.org/10.1038/s41431-018-0099-1 (PMID: 29453417)
  7. Corominas J, Smeekens SP, Nelen MR, et al. Clinical exome sequencing—mistakes and caveats. Human Mutation. 2022;43(8):976–1000. https://doi.org/10.1002/humu.24360 (PMID: 35191116)
  8. Subramanian DN, Zethoven M, McInerny S, et al. Exome sequencing of familial high-grade serous ovarian carcinoma reveals heterogeneity for rare candidate susceptibility genes. Nature Communications. 2020;11:1640. https://doi.org/10.1038/s41467-020-15461-z (PMID: 32242007)
  9. Liu Y, Xia J, McKay J, et al. Rare deleterious germline variants and risk of lung cancer. npj Precision Oncology. 2021;5:82. https://doi.org/10.1038/s41698-021-00146-7 (PMID: 33594163)
  10. McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
  11. Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
  12. Mandelker D, Schmidt RJ, Ankala A, et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genetics in Medicine. 2016;18(12):1282–1289. https://doi.org/10.1038/gim.2016.58 (PMID: 27228465)
  13. Carson AR, Smith EN, Matsui H, et al. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinformatics. 2014;15:125. https://doi.org/10.1186/1471-2105-15-125 (PMID: 24884706)
  14. Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. IEEE IPDPS. 2019. https://doi.org/10.1109/IPDPS.2019.00041 (BWA-MEM2)
  15. Broad Institute. GATK 4.6.0.0 release notes, exome germline pipeline, and VQSR documentation. 2024. https://github.com/broadinstitute/gatk/releases/tag/4.6.0.0; https://broadinstitute.github.io/warp/docs/Pipelines/Exome_Germline_Single_Sample_Pipeline/README; https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR
  16. Ensembl. VEP documentation. 2024. https://www.ensembl.org/info/docs/tools/vep/index.html

Let's Talk About Your Science

Tell us:

  • • Your biological question
  • • Data type and size
  • • Timeline constraints

We'll tell you:

  • • What's feasible
  • • How long it will take
  • • Exactly what it will cost
Contact Us

Contact us to start with a free consultation. Need everyday bench calculators? Try our free lab tools.