Genomics & Variant Analysis

Whole-Exome Sequencing (WES) Analysis Service — Capture-Aware Germline Variant Calls from Raw FASTQs with On-Target Depth QC

Whole-exome sequencing (WES) discovers coding-region SNVs and indels from hybrid-capture libraries at lower cost than WGS (Corominas et al., 2022). Pepkio delivers version-pinned FASTQ-to-VCF analysis with on-target depth QC and bespoke workflow support for academic, biotech, and pharma clients—depth targets aligned with published lab QC standards (~75–100× mean on-target for ≥95% of bases at ≥10×; Rehder et al., 2021). Scripts, figures, and a Methods draft included.

Key facts

Key facts about Whole-Exome Sequencing
Fact	Value
Supported platforms / instruments	Illumina NovaSeq X / 6000 / NextSeq 2000, HiSeq 2500/4000; MGI DNBSEQ-T7 / G400 / G99 when scoped at kickoff; Agilent SureSelect, Twist Exome 2.0, IDT xGen Exome, Roche KAPA HyperExome capture kits when bait/target interval files are provided; Element Biosciences AVITI and Ultima Genomics UG100 when scoped at kickoff
Input requirements	Paired-end FASTQ (≥2×100 bp or 2×150 bp typical); germline ≥50–100× on-target depth for research (75–100× mean on-target typical for lab-QC-aligned projects per Rehder et al., 2021); somatic tumor–normal scoped separately (≥200× tumor typical, confirmed at kickoff). Capture kit bait, target, and calling .interval_list files required. Cohort VQSR: typically ≥30 exomes or one high-quality WGS per GATK guidance (Broad Institute, 2024)
Reference builds supported	Human GRCh38 primary (GATK resource bundle 4.6); legacy GRCh37/hg19 on request; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)	BWA-MEM2 2.2.1; GATK 4.6.0.0; Picard 3.2.0; samtools 1.21; bcftools 1.21; mosdepth 0.3.3; Ensembl VEP 112; fastp 0.23.4; FastQC 0.12.1; MultiQC 1.25. DeepVariant 1.8.0 optional
Typical turnaround time	2–4 weeks (single-sample germline); 4–8 weeks (multi-sample cohort with joint genotyping) — confirmed at kickoff
Deliverable formats	.bam, .g.vcf.gz, filtered .vcf.gz, hs_metrics.txt, annotation .tsv; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice reference	Van der Auwera et al. (2013), Current Protocols in Bioinformatics; Rehder et al. (2021), Genetics in Medicine; Kong et al. (2018), Genetics in Medicine
Custom / bespoke analysis	Non-standard inputs, outputs, and methods scoped at kickoff—e.g., client BAMs, custom gene lists, off-target calling, pedigree filtering, GWAS-ready plink exports, or tumor–normal somatic extensions

What is whole-exome sequencing (WES)?

WES aligns hybrid-capture-enriched short reads, recalibrates base qualities, and calls SNVs and indels within exome target intervals—not across the full genome. Unlike WGS, WES concentrates depth on coding exons and adjacent intronic regions at lower per-base cost; unlike fixed gene panels, WES surveys most protein-coding genes in one assay (Rehder et al., 2021). In published clinical vendor benchmarks at ≥120× mean depth, SNV sensitivity was 98.9–99.9% and analytic PPV exceeded 99.1% for SNVs and homozygous indels; heterozygous indels showed lower accuracy (Kong et al., 2018). Pepkio starts from FASTQs or client BAMs and returns filtered VCFs with capture-kit–aware on-target QC. Custom inputs and deliverables are agreed at kickoff. See the whole-exome sequencing glossary.

When should you use whole-exome sequencing (WES)?

WES fits when coding-region SNV/indel discovery is the primary goal and uniform non-coding coverage is not required. The table contrasts WES with WGS and targeted gene panels.

Comparison of WES, WGS, and targeted gene panels
Approach	Best for	Limitations	Approximate cost range
WES (capture)	Coding-region variant discovery, rare-disease gene identification, cancer predisposition screening at lower sequencing cost than WGS	Gaps in non-coding, mitochondrial, and poorly captured targets; pseudogene mis-mapping in homologous regions (Rehm et al., 2021; Corominas et al., 2022)	Lower sequencing cost than WGS; capture kit and on-target depth drive sensitivity
WGS (short-read)	Non-coding variant discovery, uniform genome-wide coverage, CNV/SV add-ons without capture bias	Higher per-sample sequencing and storage cost than WES	Library prep + sequencing + bioinformatics vary by depth and cohort size
Targeted gene panel	Known gene sets, ultra-high depth for low-VAF mosaicism in specific loci	Limited to panel genes; misses novel loci outside design	Lowest per-sample sequencing cost; panel breadth drives sensitivity

Undiagnosed monogenic disease: Dillon et al. (2018) found WES diagnoses in genes absent from at least one of three commercial panels in 42% of WES-positive children.
Cancer predisposition in BRCA-negative families: Subramanian et al. (2020) sequenced 516 BRCA1/2-negative high-grade serous ovarian carcinoma germlines to 126× mean depth with 98.4% of bases >20×, screening rare LoF variants across 1,307 genes.
Rare germline susceptibility in case-control cohorts: Liu et al. (2021) validated rare deleterious germline variants in ATM and MPZL2 and three novel lung cancer loci from WES of 1,045 cases and 885 controls, with replication in 26,803 cases and 555,107 controls.

How the analysis works — step by step

1. Validate inputs, capture kit metadata, and sample manifest
Pepkio verifies FASTQ integrity (MD5 checksums), read length, paired-end structure, and read groups (@RG tags) required by GATK (Van der Auwera et al., 2013). Capture kit bait, target, and calling intervals are recorded—bait, target, and calling files are not interchangeable (Corominas et al., 2022). Sample metadata is logged in sample_manifest.csv; sub-threshold yield is flagged before alignment.
Tools and outputs
Tools used: md5sum; custom validation scripts
Output: sample_manifest.csv with sample IDs, capture kit metadata, interval file paths, read counts, and QC flags
2. QC raw reads
FastQC assesses per-base quality, adapter content, and duplication; fastp trims adapters and low-quality ends when needed. Libraries with low Q30 yield or extreme adapter contamination are flagged before alignment.
Tools and outputs
Tools used: FastQC 0.12.1; fastp 0.23.4
Output: fastqc/ reports; fastp.json / fastp.html trim statistics; trimmed FASTQs when trimming is applied
3. Align reads to the reference genome
Paired-end reads from Illumina, MGI DNBSEQ, or other scoped platforms are aligned to GRCh38 (or agreed build) with BWA-MEM2 using Picard/GATK-compatible settings (Vasimuddin et al., 2019; Broad Institute, 2024), producing coordinate-sorted BAM. Mapping rate and insert-size distribution are compared against expected ranges for capture libraries.
Tools and outputs
Tools used: BWA-MEM2 2.2.1; samtools 1.21
Output: {sample}.sorted.bam; alignment summary metrics
4. Mark duplicates and index BAMs
PCR and optical duplicates are marked with Picard MarkDuplicates so downstream callers do not double-count clustered molecules (Van der Auwera et al., 2013). Elevated duplicate rates trigger review of library prep metadata.
Tools and outputs
Tools used: Picard 3.2.0 MarkDuplicates; samtools 1.21 index
Output: {sample}.dedup.bam and .bai; duplicate metrics table
5. Recalibrate base quality scores
GATK BaseRecalibrator and ApplyBQSR adjust per-base qualities using known polymorphism sites from the GATK resource bundle (Van der Auwera et al., 2013). Recalibration reports are inspected for covariate drift across cycles and read groups.
Tools and outputs
Tools used: GATK 4.6.0.0 BaseRecalibrator; GATK ApplyBQSR
Output: {sample}.recal.bam; recalibration report PDF
6. Assess on-target coverage and hybrid-selection metrics
Picard CollectHsMetrics and mosdepth report mean target coverage, PCT_TARGET_BASES_20X, PCT_TARGET_BASES_100X, and per-target depth distributions using kit-specific bait and target intervals (Rehder et al., 2021). On-target breadth—not genome-wide mean depth—determines callability; samples below agreed thresholds are flagged before calling. At 20× stringency, Dillon et al. (2018) estimated the likelihood of missing a clinically relevant variant in a phenotype gene list was maximally 8%.
Tools and outputs
Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectHsMetrics
Output: coverage_summary.csv; hs_metrics.txt; mosdepth.targets.dist.txt; on-target coverage histogram and CDF plots
7. Call germline SNVs and indels per sample
GATK HaplotypeCaller runs in -ERC GVCF mode with -L capture calling intervals, emitting gVCF blocks restricted to exome targets (Van der Auwera et al., 2013; Broad Institute, 2024). DeepVariant 1.8.0 is available as an alternative single-sample caller when scoped at kickoff (Poplin et al., 2018).
Tools and outputs
Tools used: GATK 4.6.0.0 HaplotypeCaller
Output: {sample}.g.vcf.gz and .tbi; per-sample variant count summary
8. Joint-genotype cohorts when applicable
For multi-sample projects, gVCFs are imported into GenomicsDB and joint-genotyped with GenotypeGVCFs over capture intervals, rescuing variants weakly supported in individual samples (Van der Auwera et al., 2013). Single-sample projects skip this step.
Tools and outputs
Tools used: GATK 4.6.0.0 GenomicsDBImport; GATK GenotypeGVCFs
Output: {cohort}.joint.vcf.gz; GenomicsDB workspace; sample count and site-level summary
9. Filter variants to high-confidence calls
When cohort size supports it, GATK VariantRecalibrator applies VQSR (Broad Institute, 2024); smaller cohorts use documented GATK hard filters. Dataset-specific genotype-quality filters may supplement VQSR when scoped (Carson et al., 2014).
Tools and outputs
Tools used: GATK 4.6.0.0 VariantRecalibrator / ApplyVQSR or VariantFiltration; bcftools 1.21
Output: {cohort}.filtered.vcf.gz; {cohort}.pass-only.vcf.gz; VQSR tranche plots or hard-filter summary
10. Annotate variants and package deliverables
Ensembl VEP annotates consequences, gene symbols, and population or clinical fields when reference databases are configured (McLaren et al., 2016). Known pseudogene-prone loci (e.g., SMN1, CYP21A2, PKD1, STRC) are flagged for manual review when variants are reported (Corominas et al., 2022; Mandelker et al., 2016). MultiQC aggregates QC metrics; final scripts, README, Methods draft, and HTML QC report are packaged per agreed retention policy.
Tools and outputs
Tools used: Ensembl VEP 112; bcftools 1.21; MultiQC 1.25
Output: variant_annotation_master.tsv; MultiQC report; final deliverable bundle with scripts and Methods draft

What Pepkio delivers

Processed data files

.bam/.bai, recalibrated BAM, per-sample .g.vcf.gz, joint and filtered .vcf.gz (cohorts)
variant_annotation_master.tsv, coverage_summary.csv, hs_metrics.txt, sample_qc_summary.csv

Figures (PDF/SVG)

FastQC heatmaps, on-target depth histogram/CDF, insert-size distribution
Ti/Tv summary, variant consequence bar chart
VQSR tranche curves when applicable, per-target coverage heatmap when scoped

Tables

variant_annotation_master.tsv (CHROM, POS, REF, ALT, QUAL, FILTER, SYMBOL, Consequence, IMPACT, plus gnomAD/ClinVar when configured)
sample_qc_summary.csv (mapping rate, mean target coverage, pct targets ≥20×/≥100×, duplicate rate, pass-variant counts)

Code

Commented bash, R, and Python scripts with environment lock files
Delivery via private Git or agreed file transfer

Documentation

HTML/PDF QC report, README, Methods draft with software versions and capture kit intervals
Post-delivery reviewer support within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

On-target QC: CollectHsMetrics default: Exome projects use hybrid-selection metrics on kit-specific intervals, not CollectWgsMetrics (Rehder et al., 2021; Broad Institute, 2024).
Calling intervals: kit-specific lists: Calling restricted to capture intervals; genome-wide calling inflates off-target noise (Corominas et al., 2022). Off-target calling scoped separately.
Caller: GATK HaplotypeCaller gVCF + joint genotyping: gVCF mode supports cohort joint genotyping via GenomicsDBImport (Van der Auwera et al., 2013; Regier et al., 2018). DeepVariant 1.8.0 optional for single-sample projects (Poplin et al., 2018).
Filtering: VQSR or hard filters: VQSR when cohort size supports it (≥1 WGS or ~30 exomes; Broad Institute, 2024); hard filters otherwise. Supplemental GQ filters when scoped (Carson et al., 2014).
Pseudogene-prone loci flagged for review: Homologous genes (SMN1, CYP21A2, PKD1, STRC) produce mis-mapped reads in short-read WES (Corominas et al., 2022; Mandelker et al., 2016). Flagged loci are documented; orthogonal validation scoped when indicated.

Common questions

What is the minimum on-target depth and sample count for WES analysis?

For germline SNV/indel discovery on GRCh38, Pepkio recommends ≥50–100× on-target depth for research cohorts and ~75–100× mean on-target depth for lab-QC-aligned projects (Rehder et al., 2021). VQSR typically requires at least one WGS or ~30 exomes (Broad Institute, 2024). Thresholds are confirmed at kickoff.

Can you analyze low-quality or low-yield WES libraries?

Yes, with caveats. Low Q30 yield or mean on-target depth below ~20× reduce sensitivity for heterozygous indels and rare variants (Kong et al., 2018). Sub-threshold samples are flagged; re-sequencing or partial analysis on priority gene lists is discussed at kickoff.

Do you support Illumina, MGI DNBSEQ, and Agilent, Twist, IDT, and KAPA capture kits?

Illumina NovaSeq X, 6000, NextSeq 2000, and HiSeq FASTQs use the standard BWA-MEM2 + GATK workflow. MGI DNBSEQ-T7, G400, and G99 FASTQs are processed when scoped at kickoff with adapter/QC validation in the report. Agilent SureSelect, Twist Exome 2.0, IDT xGen Exome, and KAPA HyperExome require kit-specific interval files. Element AVITI and Ultima UG100 when scoped at kickoff.

How long does WES analysis take at Pepkio?

Single-sample germline projects typically complete in 2–4 weeks; multi-sample cohorts with joint genotyping and VQSR typically require 4–8 weeks. Timelines are confirmed at kickoff.

How do you handle batch effects and joint genotyping in multi-batch exome cohorts?

Version-pinned pipelines aligned with functional-equivalence principles reduce batch-driven call differences (Regier et al., 2018). Pepkio stratifies sequencing center, flowcell, and capture batch in QC reports. Joint genotyping via gVCF aggregation improves sensitivity within capture targets (Van der Auwera et al., 2013). Batch-specific correction beyond standard QC is scoped at kickoff.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of code, scripts, and results. Pepkio delivers commented bash, R, and Python scripts with environment lock files. Jupyter or R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after alignment QC, on-target coverage assessment, and before final delivery. A PhD-level scientific contact leads the project.

What does post-delivery reviewer support include?

Methods clarification, QC thresholds, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Methods and Supplementary drafts included; substantial new requests scoped separately.

Is co-authorship required?

No. Pepkio does not require co-authorship unless explicitly discussed. Acknowledgment of bioinformatics support is standard practice.

Do you need bait and target interval files from my sequencing provider?

Yes. Bait, target, and calling .interval_list files must match your capture kit version (Corominas et al., 2022). Mismatched intervals produce incorrect coverage and calls. Kit metadata is documented in the Methods draft.

Can you call variants outside capture targets (off-target reads)?

Default calling is restricted to capture intervals. Off-target or genome-wide calling is a separately scoped milestone; depth and sensitivity outside targets are documented in the QC report.

Can Pepkio perform tumor–normal paired exome or custom non-standard WES analyses?

Tumor–normal somatic exome and bespoke workflows are scoped at kickoff—client BAMs, custom gene lists, pedigree filtering, plink exports, or CNV integration. Somatic projects typically require ≥200× tumor on-target depth and distinct callers.

Related services

Whole-genome sequencing — Uniform genome-wide coverage when non-coding variants or off-target gaps in WES are a concern.
Variant calling — Caller selection, filter tuning, and joint genotyping when alignment is already complete.
CNV and structural variation — Exonic copy-number and structural variant calling from WES alignments with capture-aware binning.
Long-read DNA sequencing — Phased variants and structural events in pseudogene-prone or repeat-rich loci that short-read WES cannot fully resolve.
Custom consulting — Pre-sequencing depth, capture kit selection, and cohort-size planning before library prep.

References

Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics. 2013;43(1):11.10.1–11.10.33. https://doi.org/10.1002/0471250953.bi1110s43 (PMID: 25431634)
Rehder CE, Bean LJ, Bick D, et al. Next-generation sequencing for constitutional variants in the clinical laboratory, 2021 revision: a technical standard of the American College of Medical Genetics and Genomics (ACMG). Genetics in Medicine. 2021;23(8):1399–1415. https://doi.org/10.1038/s41436-021-01139-4 (PMID: 33927380)
Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genomic Medicine. 2021;6(1):47. https://doi.org/10.1038/s41525-020-00154-9 (PMID: 33110627)
Regier AA, Farjoun Y, Larson DE, et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 2018;9:4038. https://doi.org/10.1038/s41467-018-06159-4 (PMID: 30279509)
Kong SW, Lee IH, Liu X, et al. Measuring coverage and accuracy of whole-exome sequencing in clinical context. Genetics in Medicine. 2018;20(10):1244–1250. https://doi.org/10.1038/gim.2017.269 (PMID: 29789557)
Dillon OJ, Lunke S, Stark Z, et al. Exome sequencing has higher diagnostic yield compared to simulated disease-specific panels in children with suspected monogenic disorders. European Journal of Human Genetics. 2018;26(7):974–984. https://doi.org/10.1038/s41431-018-0099-1 (PMID: 29453417)
Corominas J, Smeekens SP, Nelen MR, et al. Clinical exome sequencing—mistakes and caveats. Human Mutation. 2022;43(8):976–1000. https://doi.org/10.1002/humu.24360 (PMID: 35191116)
Subramanian DN, Zethoven M, McInerny S, et al. Exome sequencing of familial high-grade serous ovarian carcinoma reveals heterogeneity for rare candidate susceptibility genes. Nature Communications. 2020;11:1640. https://doi.org/10.1038/s41467-020-15461-z (PMID: 32242007)
Liu Y, Xia J, McKay J, et al. Rare deleterious germline variants and risk of lung cancer. npj Precision Oncology. 2021;5:82. https://doi.org/10.1038/s41698-021-00146-7 (PMID: 33594163)
McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
Mandelker D, Schmidt RJ, Ankala A, et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genetics in Medicine. 2016;18(12):1282–1289. https://doi.org/10.1038/gim.2016.58 (PMID: 27228465)
Carson AR, Smith EN, Matsui H, et al. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinformatics. 2014;15:125. https://doi.org/10.1186/1471-2105-15-125 (PMID: 24884706)
Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. IEEE IPDPS. 2019. https://doi.org/10.1109/IPDPS.2019.00041 (BWA-MEM2)
Broad Institute. GATK 4.6.0.0 release notes, exome germline pipeline, and VQSR documentation. 2024. https://github.com/broadinstitute/gatk/releases/tag/4.6.0.0; https://broadinstitute.github.io/warp/docs/Pipelines/Exome_Germline_Single_Sample_Pipeline/README; https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR
Ensembl. VEP documentation. 2024. https://www.ensembl.org/info/docs/tools/vep/index.html

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Whole-Exome Sequencing (WES) Analysis Service — Capture-Aware Germline Variant Calls from Raw FASTQs with On-Target Depth QC

Key facts

What is whole-exome sequencing (WES)?

When should you use whole-exome sequencing (WES)?

How the analysis works — step by step

1. Validate inputs, capture kit metadata, and sample manifest

2. QC raw reads

3. Align reads to the reference genome

4. Mark duplicates and index BAMs

5. Recalibrate base quality scores

6. Assess on-target coverage and hybrid-selection metrics

7. Call germline SNVs and indels per sample

8. Joint-genotype cohorts when applicable

9. Filter variants to high-confidence calls

10. Annotate variants and package deliverables