Genomics & Variant Analysis

Variant Calling Analysis Service — Joint-Genotyped SNV and Indel Calls from Aligned BAMs to Filtered, Annotated VCFs

Variant calling discovers germline SNVs and indels from aligned NGS reads (Van der Auwera et al., 2013). Pepkio delivers version-pinned gVCF-to-VCF workflows with custom and bespoke support for academic, biotech, and pharma clients starting from BAMs, gVCFs, or scoped FASTQs—VQSR typically requires at least one WGS or ~30 exomes per GATK guidance (Broad Institute, 2024). Scripts, figures, and a Methods draft included.

Key facts

Key facts about Variant Calling
Fact	Value
Supported platforms / instruments	Illumina NovaSeq X / 6000 / NextSeq 2000, HiSeq 2500/4000; MGI DNBSEQ-T7 / G400 / MGISEQ-2000 (BGI) short-read BAMs; Element Biosciences AVITI and Ultima Genomics UG100 when scoped at kickoff. Input: coordinate-sorted .bam/.cram, per-sample .g.vcf.gz, or paired-end FASTQ when upstream alignment is in scope
Input requirements	WGS BAMs: ≥30× mean autosomal depth and >95% callability recommended (Rehm et al., 2021). WES/panel BAMs: ~100× mean on-target depth typical (Illumina, 2024). Cohort VQSR: typically ~30 exomes or one high-quality WGS per GATK guidance (Broad Institute, 2024). Valid @RG read groups required
Reference builds supported	Human GRCh38 primary (GATK resource bundle 4.6); legacy GRCh37/hg19 on request; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)	GATK 4.6.0.0; DeepVariant 1.8.0; Strelka2 2.9.10 (optional); Picard 3.2.0; samtools 1.21; bcftools 1.21; mosdepth 0.3.3; Ensembl VEP 112; BWA-MEM2 2.2.1 when alignment is in scope; MultiQC 1.25
Typical turnaround time	1–3 weeks (single-sample from BAM); 3–6 weeks (multi-sample cohort with joint genotyping and VQSR) — confirmed at kickoff
Deliverable formats	.g.vcf.gz, joint and filtered .vcf.gz, variant_annotation_master.tsv, QC tables; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice reference	Van der Auwera et al. (2013), Current Protocols in Bioinformatics; Van der Auwera & O'Connor (2020), Genomics in the Cloud; Barbitoff et al. (2022), BMC Genomics
Custom / bespoke analysis	Non-standard inputs, outputs, and methods scoped at kickoff—e.g., re-calling vendor VCFs, multi-caller benchmarks, GLnexus merge for DeepVariant cohorts, pedigree filtering, plink exports, or integration with existing gVCFs

What is variant calling?

Variant calling identifies SNVs and short indels from aligned reads—not library prep or basecalling. GATK HaplotypeCaller assembles local haplotypes and assigns genotypes; joint genotyping rescues variants weakly supported per sample (Van der Auwera et al., 2013). Unlike CNV/SV calling at segment scale, SNV/indel calling operates at single-base resolution. Documented pipelines pin software versions and filter logic for reproducibility (Regier et al., 2018; Pan et al., 2022). GATK processed ~180,000 WGS samples in TOPMed; DeepVariant was selected for 500,000+ UK Biobank exomes (Abdelwahab et al., 2023). Pepkio starts from client BAMs, gVCFs, or scoped FASTQs and returns filtered, annotated VCFs. Custom scope is agreed at kickoff. See the variant calling glossary.

When should you use variant calling?

Dedicated variant calling fits when alignments exist, vendor VCFs need re-calling, or cohorts require harmonized joint genotyping across batches.

Comparison of dedicated variant calling, full WGS/WES spoke, and vendor default VCF
Approach	Best for	Limitations	Approximate cost range
Dedicated variant calling (BAM/gVCF in)	Re-calling, cohort harmonization, filter tuning, multi-caller benchmarks, plink or GWAS-ready exports	Requires upstream BAM QC audit; cannot recover failed library prep; somatic tumor–normal separately scoped	Bioinformatics-only; lower than full FASTQ-to-VCF when alignment exists
Full WGS/WES spoke	End-to-end FASTQ-to-VCF with modality-specific alignment and depth QC	Higher scope when BAMs already exist	Library prep + sequencing + bioinformatics vary by depth and sample count
Vendor or sequencer default VCF	Fast turnaround bundled with sequencing	Opaque filters, version drift, poor cross-center harmonization (Regier et al., 2018; Pan et al., 2022)	Often bundled with sequencing; re-analysis avoided only if calls are adequate

National rare-disease diagnosis: Stark et al. (2023) applied WGS across 290 critically ill families in Acute Care Genomics, reporting 47% diagnostic yield and variant classes including structural and deep intronic changes.
Cancer germline predisposition at cohort scale: Lu et al. (2015) catalogued rare germline truncations across 4,034 TCGA exomes spanning 12 cancer types in 114 susceptibility genes.
Precision oncology driver discovery: Kinnersley et al. (2024) analyzed 10,478 cancer WGS profiles from the UK 100,000 Genomes Project, identifying 330 candidate driver genes; approximately 55% of patients harbored at least one clinically relevant mutation.

How the analysis works — step by step

1. Validate inputs and sample metadata
Pepkio verifies BAM, CRAM, gVCF, or FASTQ integrity, confirms coordinate sorting and indexing, and checks @RG read groups required by GATK (Van der Auwera et al., 2013). Sample sex, batch, sequencing center, and reference build are recorded in sample_manifest.csv; reference-build mismatches are flagged before calling.
Tools and outputs
Tools used: md5sum; samtools quickcheck; custom validation scripts
Output: sample_manifest.csv with sample IDs, input type, reference build, read counts or BAM size, and QC flags
2. Audit alignment prerequisites
Pepkio inspects duplicate marking and base quality score recalibration (BQSR) status. Calling without MarkDuplicates or ApplyBQSR increases false-positive SNVs and indels (Van der Auwera et al., 2013; Koboldt, 2020). Missing prerequisites trigger scoped Picard MarkDuplicates and GATK BQSR re-runs or documented deviation notes.
Tools and outputs
Tools used: Picard 3.2.0 CollectAlignmentSummaryMetrics; GATK 4.6.0.0 ValidateSamFile; custom BAM audit scripts
Output: bam_prerequisite_audit.csv with duplicate-marking status, BQSR tag presence, mapping rate, and re-preprocessing recommendations
3. Assess coverage and callability
mosdepth and Picard CollectWgsMetrics or CollectHsMetrics report mean depth, breadth at ≥15×/≥20×, and on-target fraction (Rehm et al., 2021; Koboldt, 2020). Callable breadth—not mean depth alone—determines locus-level calling; SNV reproducibility plateaus above ~30× while indel reproducibility continues to improve with depth (Pan et al., 2022). Sub-threshold samples are flagged before discovery.
Tools and outputs
Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectWgsMetrics or CollectHsMetrics
Output: coverage_summary.csv; mosdepth.global.dist.txt or target-coverage tables; coverage histogram and CDF plots
4. Define calling intervals and resources
Intervals are set to whole genome, client exome or panel BED, or agreed subsets. GATK resource bundle 4.6 known sites for BQSR verification and VQSR training match the reference build (Broad Institute, 2024). Interval lists and BED files are version-controlled in the deliverable bundle.
Tools and outputs
Tools used: GATK 4.6.0.0 PreprocessIntervals (when needed); custom interval validation scripts
Output: calling_intervals.list or .bed; interval_validation_report.txt
5. Call germline SNVs and indels per sample
GATK HaplotypeCaller runs in -ERC GVCF mode, emitting gVCF blocks across uncalled regions (Van der Auwera et al., 2013). DeepVariant 1.8.0 or Strelka2 2.9.10 are optional when scoped at kickoff (Poplin et al., 2018; Barbitoff et al., 2022; Kim et al., 2018). Per-sample Ti/Tv ratios are checked against ~2.0–2.1 for whole-genome germline SNVs (Koboldt, 2020).
Tools and outputs
Tools used: GATK 4.6.0.0 HaplotypeCaller; DeepVariant 1.8.0 or Strelka2 2.9.10 when scoped
Output: {sample}.g.vcf.gz and .tbi; per-sample variant count and Ti/Tv summary
6. Joint-genotype cohorts when applicable
For multi-sample projects, gVCFs import into GenomicsDB and joint-genotype with GenotypeGVCFs (Van der Auwera et al., 2013). Single-sample projects skip this step. GLnexus merge is scoped for DeepVariant cohorts when required (Regier et al., 2018).
Tools and outputs
Tools used: GATK 4.6.0.0 GenomicsDBImport; GATK GenotypeGVCFs; GLnexus when scoped for DeepVariant
Output: {cohort}.joint.vcf.gz; GenomicsDB workspace; sample count and site-level summary
7. Filter variants to high-confidence calls
When cohort size supports it, GATK VariantRecalibrator applies VQSR trained on known polymorphism resources (Broad Institute, 2024). Smaller cohorts use documented GATK hard filters with all FILTER labels retained (Koboldt, 2020). DeepVariant projects use quality thresholds documented in the Methods draft. Filtered and pass-only VCFs export separately.
Tools and outputs
Tools used: GATK 4.6.0.0 VariantRecalibrator / ApplyVQSR or VariantFiltration; bcftools 1.21
Output: {cohort}.filtered.vcf.gz; {cohort}.pass-only.vcf.gz; VQSR tranche plots or hard-filter summary table
8. QC variant call sets
Pepkio computes Ti/Tv on pass sites, variant counts by type, and FILTER distributions. Caller and aligner choice affected reproducibility more than sequencing platform in a multi-center WGS benchmark (Pan et al., 2022); multi-caller benchmarks are documented when scoped. Problematic regions (segmental duplications, homopolymers) are flagged.
Tools and outputs
Tools used: bcftools 1.21 stats; custom QC scripts; hap.py when benchmark comparison is scoped
Output: variant_qc_summary.csv; sample_qc_summary.csv; filter_summary.csv; Ti/Tv plot; FILTER status distribution chart; optional benchmark comparison table
9. Annotate variants with transcript consequences
Ensembl VEP annotates consequence terms, gene symbols, and population or clinical fields when databases are configured (McLaren et al., 2016). ClinVar, gnomAD, or client gene lists are added when scoped at kickoff.
Tools and outputs
Tools used: Ensembl VEP 112; bcftools 1.21
Output: variant_annotation_master.tsv; annotated .vcf.gz when requested
10. Package deliverables
MultiQC aggregates QC metrics when multiple stages are in scope (Ewels et al., 2016). Scripts, README, Methods draft with exact versions and filter parameters, and HTML QC report are packaged per agreed retention policy.
Tools and outputs
Tools used: MultiQC 1.25; custom packaging scripts
Output: deliverable bundle with scripts, figures, QC report, and Methods draft

What Pepkio delivers

Processed data files

Per-sample .g.vcf.gz; joint {cohort}.joint.vcf.gz
Filtered {cohort}.filtered.vcf.gz and {cohort}.pass-only.vcf.gz
variant_annotation_master.tsv; coverage_summary.csv; bam_prerequisite_audit.csv
variant_qc_summary.csv; sample_qc_summary.csv; filter_summary.csv

Figures (PDF/SVG)

Coverage histogram and cumulative distribution
Ti/Tv ratio summary; variant consequence bar chart
VQSR tranche curves when applicable; FILTER status pie chart
Optional multi-caller concordance plot when benchmarked

Tables

variant_annotation_master.tsv with CHROM, POS, REF, ALT, QUAL, FILTER, SYMBOL, Consequence, IMPACT, and population/clinical fields when configured
sample_qc_summary.csv with mean depth, pct bases ≥15×/≥20×, duplicate rate, Ti/Tv, pass-variant counts, and caller version
filter_summary.csv with filter labels and site counts

Code

Commented bash, R, and Python scripts per stage
Environment lock files; delivery via private Git or agreed file transfer

Documentation

HTML/PDF QC report; README; Methods draft
Bespoke milestones scoped at kickoff; post-delivery reviewer support within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

Default caller: GATK HaplotypeCaller gVCF for cohorts; DeepVariant when single-sample accuracy is prioritized: GATK gVCF plus GenomicsDBImport is standard for multi-sample joint genotyping (Van der Auwera et al., 2013; Regier et al., 2018). DeepVariant showed the best performance and highest robustness on GIAB gold-standard WES and WGS datasets in a systematic benchmark (Barbitoff et al., 2022; Poplin et al., 2018).
Filtering: VQSR when cohort size supports it; hard filters otherwise: GATK recommends at least one WGS or ~30 exomes for VQSR training (Broad Institute, 2024). Smaller cohorts use documented hard filters with all FILTER labels retained (Koboldt, 2020).
Joint genotyping for n ≥ 2 when compatible: gVCF aggregation rescues weak per-sample evidence (Van der Auwera et al., 2013). Single-sample projects skip GenomicsDBImport.
Reference build: GRCh38 primary with GATK 4.6 resource bundle: Functional-equivalence alignment standards reduce cross-center discordance (Regier et al., 2018). Legacy GRCh37/hg19 on request when BAM headers match.
Prerequisite audit before calling: BAMs missing duplicate marking or BQSR inflate false positives (Van der Auwera et al., 2013; Koboldt, 2020). Pepkio documents upstream QC status rather than delivering calls without disclosure.

Common questions

What is the minimum sequencing depth and cohort size for variant calling?

For WGS BAMs, Pepkio recommends ≥30× mean autosomal depth with >95% callability (Rehm et al., 2021). WES or panel BAMs typically require ~100× mean on-target depth (Illumina, 2024). VQSR requires at least one high-quality WGS or ~30 exomes (Broad Institute, 2024); smaller cohorts use hard filters. Thresholds are confirmed at kickoff.

Can you re-call variants from existing BAMs or gVCFs without re-alignment?

Yes—this is the default entry point. Pepkio audits duplicate marking, BQSR status, read groups, and reference build before calling (Van der Auwera et al., 2013). Compatible client .g.vcf.gz files can enter at joint genotyping. Missing BAM prerequisites can be re-processed when scoped at kickoff.

Can you analyze low-quality or low-coverage BAMs?

Yes, with caveats. Mean depth below ~20× for WGS or ~50× on-target for exome reduces sensitivity, especially for indels >5 bp (Pan et al., 2022). Sub-threshold samples are flagged in sample_qc_summary.csv. Partial calling on high-confidence intervals is possible when re-sequencing is not feasible.

Do you support Illumina, MGI DNBSEQ, Element AVITI, and Ultima BAMs?

Illumina and MGI DNBSEQ short-read BAMs use the standard GATK or DeepVariant workflow on coordinate-sorted alignments. Element AVITI and Ultima UG100 BAMs are supported when scoped at kickoff with validated aligner compatibility. PacBio and Oxford Nanopore SNV/indel calling is referred to the long-read DNA-seq spoke.

How long does a variant calling analysis service project take at Pepkio?

Single-sample projects from client BAMs typically complete in 1–3 weeks. Multi-sample cohorts with joint genotyping and VQSR typically require 3–6 weeks. Timelines depend on sample count, upstream BQSR needs, and annotation scope—all confirmed at kickoff.

How do you handle batch effects across sequencing centers or library prep batches?

Harmonized pipelines aligned to functional-equivalence standards reduce batch-driven differences (Regier et al., 2018). Pepkio stratifies sequencing center, flowcell, and batch in QC reports. Joint genotyping improves sensitivity for weakly supported variants across batches (Van der Auwera et al., 2013). Cross-batch harmonization beyond standard QC is scoped at kickoff when needed.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of code, scripts, and results. Pepkio delivers commented bash, R, and Python scripts with environment lock files. Jupyter or R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after BAM prerequisite audit, coverage assessment, and before final delivery. A PhD-level scientific contact leads the project at agreed milestones.

What does post-delivery reviewer support include?

Clarification of methods, filter logic, QC thresholds, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Substantial new reviewer requests are scoped separately.

Is co-authorship required?

No. Pepkio operates as a fee-for-service provider and does not require co-authorship unless explicitly discussed in advance.

Should we use GATK HaplotypeCaller or DeepVariant for our cohort?

GATK HaplotypeCaller gVCF is the default for multi-sample joint genotyping and VQSR (Van der Auwera et al., 2013). DeepVariant showed the best performance and highest robustness on GIAB WES and WGS benchmarks and is preferred for single-sample accuracy when scoped (Barbitoff et al., 2022; Poplin et al., 2018). Multi-caller benchmarks against vendor VCFs are scoped at kickoff when discordance is a concern (Pan et al., 2022).

Can you handle custom or non-standard variant calling analyses?

Yes. Bespoke work—multi-caller consensus, re-calling vendor VCFs, tumor–normal somatic extensions, pedigree filtering, plink exports, custom annotation, or gVCF integration—is scoped at kickoff with milestone pricing (Koboldt, 2020).

Related services

Whole-genome sequencing — Full FASTQ-to-VCF pipeline when alignment and preprocessing are not yet complete.
Whole-exome sequencing — Capture-aware alignment and calling for exome libraries.
CNV and structural variation — Copy-number and structural variant detection from the same alignments.
Long-read DNA sequencing — Long-read SNV/indel calling in repetitive or pseudogene-heavy loci where short-read callers fail.
Custom consulting — Caller selection, cohort size, and VQSR planning before sequencing or re-calling.

References

Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics. 2013;43(1):11.10.1–11.10.33. https://doi.org/10.1002/0471250953.bi1110s43 (PMID: 25431634)
Van der Auwera GA, O'Connor BD. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st ed.). O'Reilly Media; 2020. ISBN: 978-1-4919-7519-0
Koboldt DC. Best practices for variant calling in clinical sequencing. Genome Medicine. 2020;12:91. https://doi.org/10.1186/s13073-020-00791-w (PMID: 33106175)
Regier AA, Farjoun Y, Larson DE, et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 2018;9:4038. https://doi.org/10.1038/s41467-018-06159-4 (PMID: 30279509)
Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
Barbitoff YA, Abasov R, Tvorogova VE, et al. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. 2022;23:155. https://doi.org/10.1186/s12864-022-08365-3 (PMID: 35193511)
Pan B, Kusko R, Xiao W, et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biology. 2022;23:2. https://doi.org/10.1186/s13059-021-02569-8 (PMID: 34980216)
McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
Kim S, Scheffler K, Halpern AL, et al. Strelka2: fast and accurate calling of germline and somatic variants. Nature Methods. 2018;15(8):591–594. https://doi.org/10.1038/s41592-018-0051-x (PMID: 30013048)
Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genomic Medicine. 2021;6(1):47. https://doi.org/10.1038/s41525-020-00154-9 (PMID: 33110627)
Stark Z, Lunke S, Brett GR, et al. Integrated multi-omics for rapid rare disease diagnosis on a national scale. Nature Medicine. 2023;29(7):1771–1781. https://doi.org/10.1038/s41591-023-02401-9 (PMID: 37291213)
Lu C, Xie M, Wendl MC, et al. Patterns and functional implications of rare germline variants across 12 cancer types. Nature Communications. 2015;6:10086. https://doi.org/10.1038/ncomms10086 (PMID: 26689913)
Kinnersley B, Sud A, Everall A, et al. Analysis of 10,478 cancer genomes identifies candidate driver genes and opportunities for precision oncology. Nature Genetics. 2024;56(9):1868–1877. https://doi.org/10.1038/s41588-024-01785-9 (PMID: 38890488)
Abdelwahab O, Belzile F, Torkamaneh D. Performance analysis of conventional and AI-based variant callers using short and long reads. BMC Bioinformatics. 2023;24:472. https://doi.org/10.1186/s12859-023-05596-3 (PMID: 38097928)
Broad Institute. GATK 4.6.0.0 release notes and VQSR documentation. 2024. https://github.com/broadinstitute/gatk/releases/tag/4.6.0.0; https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR
Illumina. Sequencing coverage for NGS experiments. 2024. https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/coverage.html
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. https://doi.org/10.1093/bioinformatics/btw354 (PMID: 27312411)
Ensembl. VEP documentation. 2024. https://www.ensembl.org/info/docs/tools/vep/index.html

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Variant Calling Analysis Service — Joint-Genotyped SNV and Indel Calls from Aligned BAMs to Filtered, Annotated VCFs

Key facts

What is variant calling?

When should you use variant calling?

How the analysis works — step by step

1. Validate inputs and sample metadata

2. Audit alignment prerequisites

3. Assess coverage and callability

4. Define calling intervals and resources

5. Call germline SNVs and indels per sample

6. Joint-genotype cohorts when applicable

7. Filter variants to high-confidence calls

8. QC variant call sets

9. Annotate variants with transcript consequences

10. Package deliverables