Genomics & Variant Analysis

Variant Calling Analysis Service — Joint-Genotyped SNV and Indel Calls from Aligned BAMs to Filtered, Annotated VCFs

Variant calling discovers germline SNVs and indels from aligned NGS reads (Van der Auwera et al., 2013). Pepkio delivers version-pinned gVCF-to-VCF workflows with custom and bespoke support for academic, biotech, and pharma clients starting from BAMs, gVCFs, or scoped FASTQs—VQSR typically requires at least one WGS or ~30 exomes per GATK guidance (Broad Institute, 2024). Scripts, figures, and a Methods draft included.

Key facts

Key facts about Variant Calling
FactValue
Supported platforms / instrumentsIllumina NovaSeq X / 6000 / NextSeq 2000, HiSeq 2500/4000; MGI DNBSEQ-T7 / G400 / MGISEQ-2000 (BGI) short-read BAMs; Element Biosciences AVITI and Ultima Genomics UG100 when scoped at kickoff. Input: coordinate-sorted .bam/.cram, per-sample .g.vcf.gz, or paired-end FASTQ when upstream alignment is in scope
Input requirementsWGS BAMs: ≥30× mean autosomal depth and >95% callability recommended (Rehm et al., 2021). WES/panel BAMs: ~100× mean on-target depth typical (Illumina, 2024). Cohort VQSR: typically ~30 exomes or one high-quality WGS per GATK guidance (Broad Institute, 2024). Valid @RG read groups required
Reference builds supportedHuman GRCh38 primary (GATK resource bundle 4.6); legacy GRCh37/hg19 on request; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)GATK 4.6.0.0; DeepVariant 1.8.0; Strelka2 2.9.10 (optional); Picard 3.2.0; samtools 1.21; bcftools 1.21; mosdepth 0.3.3; Ensembl VEP 112; BWA-MEM2 2.2.1 when alignment is in scope; MultiQC 1.25
Typical turnaround time1–3 weeks (single-sample from BAM); 3–6 weeks (multi-sample cohort with joint genotyping and VQSR) — confirmed at kickoff
Deliverable formats.g.vcf.gz, joint and filtered .vcf.gz, variant_annotation_master.tsv, QC tables; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice referenceVan der Auwera et al. (2013), Current Protocols in Bioinformatics; Van der Auwera & O'Connor (2020), Genomics in the Cloud; Barbitoff et al. (2022), BMC Genomics
Custom / bespoke analysisNon-standard inputs, outputs, and methods scoped at kickoff—e.g., re-calling vendor VCFs, multi-caller benchmarks, GLnexus merge for DeepVariant cohorts, pedigree filtering, plink exports, or integration with existing gVCFs

What is variant calling?

Variant calling identifies SNVs and short indels from aligned reads—not library prep or basecalling. GATK HaplotypeCaller assembles local haplotypes and assigns genotypes; joint genotyping rescues variants weakly supported per sample (Van der Auwera et al., 2013). Unlike CNV/SV calling at segment scale, SNV/indel calling operates at single-base resolution. Documented pipelines pin software versions and filter logic for reproducibility (Regier et al., 2018; Pan et al., 2022). GATK processed ~180,000 WGS samples in TOPMed; DeepVariant was selected for 500,000+ UK Biobank exomes (Abdelwahab et al., 2023). Pepkio starts from client BAMs, gVCFs, or scoped FASTQs and returns filtered, annotated VCFs. Custom scope is agreed at kickoff. See the variant calling glossary.

When should you use variant calling?

Dedicated variant calling fits when alignments exist, vendor VCFs need re-calling, or cohorts require harmonized joint genotyping across batches.

Comparison of dedicated variant calling, full WGS/WES spoke, and vendor default VCF
ApproachBest forLimitationsApproximate cost range
Dedicated variant calling (BAM/gVCF in)Re-calling, cohort harmonization, filter tuning, multi-caller benchmarks, plink or GWAS-ready exportsRequires upstream BAM QC audit; cannot recover failed library prep; somatic tumor–normal separately scopedBioinformatics-only; lower than full FASTQ-to-VCF when alignment exists
Full WGS/WES spokeEnd-to-end FASTQ-to-VCF with modality-specific alignment and depth QCHigher scope when BAMs already existLibrary prep + sequencing + bioinformatics vary by depth and sample count
Vendor or sequencer default VCFFast turnaround bundled with sequencingOpaque filters, version drift, poor cross-center harmonization (Regier et al., 2018; Pan et al., 2022)Often bundled with sequencing; re-analysis avoided only if calls are adequate
  • National rare-disease diagnosis: Stark et al. (2023) applied WGS across 290 critically ill families in Acute Care Genomics, reporting 47% diagnostic yield and variant classes including structural and deep intronic changes.
  • Cancer germline predisposition at cohort scale: Lu et al. (2015) catalogued rare germline truncations across 4,034 TCGA exomes spanning 12 cancer types in 114 susceptibility genes.
  • Precision oncology driver discovery: Kinnersley et al. (2024) analyzed 10,478 cancer WGS profiles from the UK 100,000 Genomes Project, identifying 330 candidate driver genes; approximately 55% of patients harbored at least one clinically relevant mutation.

How the analysis works — step by step

  1. 1. Validate inputs and sample metadata

    Pepkio verifies BAM, CRAM, gVCF, or FASTQ integrity, confirms coordinate sorting and indexing, and checks @RG read groups required by GATK (Van der Auwera et al., 2013). Sample sex, batch, sequencing center, and reference build are recorded in sample_manifest.csv; reference-build mismatches are flagged before calling.

    Tools and outputs

    Tools used: md5sum; samtools quickcheck; custom validation scripts

    Output: sample_manifest.csv with sample IDs, input type, reference build, read counts or BAM size, and QC flags

  2. 2. Audit alignment prerequisites

    Pepkio inspects duplicate marking and base quality score recalibration (BQSR) status. Calling without MarkDuplicates or ApplyBQSR increases false-positive SNVs and indels (Van der Auwera et al., 2013; Koboldt, 2020). Missing prerequisites trigger scoped Picard MarkDuplicates and GATK BQSR re-runs or documented deviation notes.

    Tools and outputs

    Tools used: Picard 3.2.0 CollectAlignmentSummaryMetrics; GATK 4.6.0.0 ValidateSamFile; custom BAM audit scripts

    Output: bam_prerequisite_audit.csv with duplicate-marking status, BQSR tag presence, mapping rate, and re-preprocessing recommendations

  3. 3. Assess coverage and callability

    mosdepth and Picard CollectWgsMetrics or CollectHsMetrics report mean depth, breadth at ≥15×/≥20×, and on-target fraction (Rehm et al., 2021; Koboldt, 2020). Callable breadth—not mean depth alone—determines locus-level calling; SNV reproducibility plateaus above ~30× while indel reproducibility continues to improve with depth (Pan et al., 2022). Sub-threshold samples are flagged before discovery.

    Tools and outputs

    Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectWgsMetrics or CollectHsMetrics

    Output: coverage_summary.csv; mosdepth.global.dist.txt or target-coverage tables; coverage histogram and CDF plots

  4. 4. Define calling intervals and resources

    Intervals are set to whole genome, client exome or panel BED, or agreed subsets. GATK resource bundle 4.6 known sites for BQSR verification and VQSR training match the reference build (Broad Institute, 2024). Interval lists and BED files are version-controlled in the deliverable bundle.

    Tools and outputs

    Tools used: GATK 4.6.0.0 PreprocessIntervals (when needed); custom interval validation scripts

    Output: calling_intervals.list or .bed; interval_validation_report.txt

  5. 5. Call germline SNVs and indels per sample

    GATK HaplotypeCaller runs in -ERC GVCF mode, emitting gVCF blocks across uncalled regions (Van der Auwera et al., 2013). DeepVariant 1.8.0 or Strelka2 2.9.10 are optional when scoped at kickoff (Poplin et al., 2018; Barbitoff et al., 2022; Kim et al., 2018). Per-sample Ti/Tv ratios are checked against ~2.0–2.1 for whole-genome germline SNVs (Koboldt, 2020).

    Tools and outputs

    Tools used: GATK 4.6.0.0 HaplotypeCaller; DeepVariant 1.8.0 or Strelka2 2.9.10 when scoped

    Output: {sample}.g.vcf.gz and .tbi; per-sample variant count and Ti/Tv summary

  6. 6. Joint-genotype cohorts when applicable

    For multi-sample projects, gVCFs import into GenomicsDB and joint-genotype with GenotypeGVCFs (Van der Auwera et al., 2013). Single-sample projects skip this step. GLnexus merge is scoped for DeepVariant cohorts when required (Regier et al., 2018).

    Tools and outputs

    Tools used: GATK 4.6.0.0 GenomicsDBImport; GATK GenotypeGVCFs; GLnexus when scoped for DeepVariant

    Output: {cohort}.joint.vcf.gz; GenomicsDB workspace; sample count and site-level summary

  7. 7. Filter variants to high-confidence calls

    When cohort size supports it, GATK VariantRecalibrator applies VQSR trained on known polymorphism resources (Broad Institute, 2024). Smaller cohorts use documented GATK hard filters with all FILTER labels retained (Koboldt, 2020). DeepVariant projects use quality thresholds documented in the Methods draft. Filtered and pass-only VCFs export separately.

    Tools and outputs

    Tools used: GATK 4.6.0.0 VariantRecalibrator / ApplyVQSR or VariantFiltration; bcftools 1.21

    Output: {cohort}.filtered.vcf.gz; {cohort}.pass-only.vcf.gz; VQSR tranche plots or hard-filter summary table

  8. 8. QC variant call sets

    Pepkio computes Ti/Tv on pass sites, variant counts by type, and FILTER distributions. Caller and aligner choice affected reproducibility more than sequencing platform in a multi-center WGS benchmark (Pan et al., 2022); multi-caller benchmarks are documented when scoped. Problematic regions (segmental duplications, homopolymers) are flagged.

    Tools and outputs

    Tools used: bcftools 1.21 stats; custom QC scripts; hap.py when benchmark comparison is scoped

    Output: variant_qc_summary.csv; sample_qc_summary.csv; filter_summary.csv; Ti/Tv plot; FILTER status distribution chart; optional benchmark comparison table

  9. 9. Annotate variants with transcript consequences

    Ensembl VEP annotates consequence terms, gene symbols, and population or clinical fields when databases are configured (McLaren et al., 2016). ClinVar, gnomAD, or client gene lists are added when scoped at kickoff.

    Tools and outputs

    Tools used: Ensembl VEP 112; bcftools 1.21

    Output: variant_annotation_master.tsv; annotated .vcf.gz when requested

  10. 10. Package deliverables

    MultiQC aggregates QC metrics when multiple stages are in scope (Ewels et al., 2016). Scripts, README, Methods draft with exact versions and filter parameters, and HTML QC report are packaged per agreed retention policy.

    Tools and outputs

    Tools used: MultiQC 1.25; custom packaging scripts

    Output: deliverable bundle with scripts, figures, QC report, and Methods draft

What Pepkio delivers

Processed data files

  • Per-sample .g.vcf.gz; joint {cohort}.joint.vcf.gz
  • Filtered {cohort}.filtered.vcf.gz and {cohort}.pass-only.vcf.gz
  • variant_annotation_master.tsv; coverage_summary.csv; bam_prerequisite_audit.csv
  • variant_qc_summary.csv; sample_qc_summary.csv; filter_summary.csv

Figures (PDF/SVG)

  • Coverage histogram and cumulative distribution
  • Ti/Tv ratio summary; variant consequence bar chart
  • VQSR tranche curves when applicable; FILTER status pie chart
  • Optional multi-caller concordance plot when benchmarked

Tables

  • variant_annotation_master.tsv with CHROM, POS, REF, ALT, QUAL, FILTER, SYMBOL, Consequence, IMPACT, and population/clinical fields when configured
  • sample_qc_summary.csv with mean depth, pct bases ≥15×/≥20×, duplicate rate, Ti/Tv, pass-variant counts, and caller version
  • filter_summary.csv with filter labels and site counts

Code

  • Commented bash, R, and Python scripts per stage
  • Environment lock files; delivery via private Git or agreed file transfer

Documentation

  • HTML/PDF QC report; README; Methods draft
  • Bespoke milestones scoped at kickoff; post-delivery reviewer support within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

Default caller: GATK HaplotypeCaller gVCF for cohorts; DeepVariant when single-sample accuracy is prioritized
GATK gVCF plus GenomicsDBImport is standard for multi-sample joint genotyping (Van der Auwera et al., 2013; Regier et al., 2018). DeepVariant showed the best performance and highest robustness on GIAB gold-standard WES and WGS datasets in a systematic benchmark (Barbitoff et al., 2022; Poplin et al., 2018).
Filtering: VQSR when cohort size supports it; hard filters otherwise
GATK recommends at least one WGS or ~30 exomes for VQSR training (Broad Institute, 2024). Smaller cohorts use documented hard filters with all FILTER labels retained (Koboldt, 2020).
Joint genotyping for n ≥ 2 when compatible
gVCF aggregation rescues weak per-sample evidence (Van der Auwera et al., 2013). Single-sample projects skip GenomicsDBImport.
Reference build: GRCh38 primary with GATK 4.6 resource bundle
Functional-equivalence alignment standards reduce cross-center discordance (Regier et al., 2018). Legacy GRCh37/hg19 on request when BAM headers match.
Prerequisite audit before calling
BAMs missing duplicate marking or BQSR inflate false positives (Van der Auwera et al., 2013; Koboldt, 2020). Pepkio documents upstream QC status rather than delivering calls without disclosure.

Common questions

What is the minimum sequencing depth and cohort size for variant calling?

For WGS BAMs, Pepkio recommends ≥30× mean autosomal depth with >95% callability (Rehm et al., 2021). WES or panel BAMs typically require ~100× mean on-target depth (Illumina, 2024). VQSR requires at least one high-quality WGS or ~30 exomes (Broad Institute, 2024); smaller cohorts use hard filters. Thresholds are confirmed at kickoff.

Can you re-call variants from existing BAMs or gVCFs without re-alignment?

Yes—this is the default entry point. Pepkio audits duplicate marking, BQSR status, read groups, and reference build before calling (Van der Auwera et al., 2013). Compatible client .g.vcf.gz files can enter at joint genotyping. Missing BAM prerequisites can be re-processed when scoped at kickoff.

Can you analyze low-quality or low-coverage BAMs?

Yes, with caveats. Mean depth below ~20× for WGS or ~50× on-target for exome reduces sensitivity, especially for indels >5 bp (Pan et al., 2022). Sub-threshold samples are flagged in sample_qc_summary.csv. Partial calling on high-confidence intervals is possible when re-sequencing is not feasible.

Do you support Illumina, MGI DNBSEQ, Element AVITI, and Ultima BAMs?

Illumina and MGI DNBSEQ short-read BAMs use the standard GATK or DeepVariant workflow on coordinate-sorted alignments. Element AVITI and Ultima UG100 BAMs are supported when scoped at kickoff with validated aligner compatibility. PacBio and Oxford Nanopore SNV/indel calling is referred to the long-read DNA-seq spoke.

How long does a variant calling analysis service project take at Pepkio?

Single-sample projects from client BAMs typically complete in 1–3 weeks. Multi-sample cohorts with joint genotyping and VQSR typically require 3–6 weeks. Timelines depend on sample count, upstream BQSR needs, and annotation scope—all confirmed at kickoff.

How do you handle batch effects across sequencing centers or library prep batches?

Harmonized pipelines aligned to functional-equivalence standards reduce batch-driven differences (Regier et al., 2018). Pepkio stratifies sequencing center, flowcell, and batch in QC reports. Joint genotyping improves sensitivity for weakly supported variants across batches (Van der Auwera et al., 2013). Cross-batch harmonization beyond standard QC is scoped at kickoff when needed.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of code, scripts, and results. Pepkio delivers commented bash, R, and Python scripts with environment lock files. Jupyter or R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after BAM prerequisite audit, coverage assessment, and before final delivery. A PhD-level scientific contact leads the project at agreed milestones.

What does post-delivery reviewer support include?

Clarification of methods, filter logic, QC thresholds, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Substantial new reviewer requests are scoped separately.

Is co-authorship required?

No. Pepkio operates as a fee-for-service provider and does not require co-authorship unless explicitly discussed in advance.

Should we use GATK HaplotypeCaller or DeepVariant for our cohort?

GATK HaplotypeCaller gVCF is the default for multi-sample joint genotyping and VQSR (Van der Auwera et al., 2013). DeepVariant showed the best performance and highest robustness on GIAB WES and WGS benchmarks and is preferred for single-sample accuracy when scoped (Barbitoff et al., 2022; Poplin et al., 2018). Multi-caller benchmarks against vendor VCFs are scoped at kickoff when discordance is a concern (Pan et al., 2022).

Can you handle custom or non-standard variant calling analyses?

Yes. Bespoke work—multi-caller consensus, re-calling vendor VCFs, tumor–normal somatic extensions, pedigree filtering, plink exports, custom annotation, or gVCF integration—is scoped at kickoff with milestone pricing (Koboldt, 2020).

Related services

References
  1. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics. 2013;43(1):11.10.1–11.10.33. https://doi.org/10.1002/0471250953.bi1110s43 (PMID: 25431634)
  2. Van der Auwera GA, O'Connor BD. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st ed.). O'Reilly Media; 2020. ISBN: 978-1-4919-7519-0
  3. Koboldt DC. Best practices for variant calling in clinical sequencing. Genome Medicine. 2020;12:91. https://doi.org/10.1186/s13073-020-00791-w (PMID: 33106175)
  4. Regier AA, Farjoun Y, Larson DE, et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 2018;9:4038. https://doi.org/10.1038/s41467-018-06159-4 (PMID: 30279509)
  5. Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
  6. Barbitoff YA, Abasov R, Tvorogova VE, et al. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. 2022;23:155. https://doi.org/10.1186/s12864-022-08365-3 (PMID: 35193511)
  7. Pan B, Kusko R, Xiao W, et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biology. 2022;23:2. https://doi.org/10.1186/s13059-021-02569-8 (PMID: 34980216)
  8. McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
  9. Kim S, Scheffler K, Halpern AL, et al. Strelka2: fast and accurate calling of germline and somatic variants. Nature Methods. 2018;15(8):591–594. https://doi.org/10.1038/s41592-018-0051-x (PMID: 30013048)
  10. Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genomic Medicine. 2021;6(1):47. https://doi.org/10.1038/s41525-020-00154-9 (PMID: 33110627)
  11. Stark Z, Lunke S, Brett GR, et al. Integrated multi-omics for rapid rare disease diagnosis on a national scale. Nature Medicine. 2023;29(7):1771–1781. https://doi.org/10.1038/s41591-023-02401-9 (PMID: 37291213)
  12. Lu C, Xie M, Wendl MC, et al. Patterns and functional implications of rare germline variants across 12 cancer types. Nature Communications. 2015;6:10086. https://doi.org/10.1038/ncomms10086 (PMID: 26689913)
  13. Kinnersley B, Sud A, Everall A, et al. Analysis of 10,478 cancer genomes identifies candidate driver genes and opportunities for precision oncology. Nature Genetics. 2024;56(9):1868–1877. https://doi.org/10.1038/s41588-024-01785-9 (PMID: 38890488)
  14. Abdelwahab O, Belzile F, Torkamaneh D. Performance analysis of conventional and AI-based variant callers using short and long reads. BMC Bioinformatics. 2023;24:472. https://doi.org/10.1186/s12859-023-05596-3 (PMID: 38097928)
  15. Broad Institute. GATK 4.6.0.0 release notes and VQSR documentation. 2024. https://github.com/broadinstitute/gatk/releases/tag/4.6.0.0; https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR
  16. Illumina. Sequencing coverage for NGS experiments. 2024. https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/coverage.html
  17. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. https://doi.org/10.1093/bioinformatics/btw354 (PMID: 27312411)
  18. Ensembl. VEP documentation. 2024. https://www.ensembl.org/info/docs/tools/vep/index.html

Let's Talk About Your Science

Tell us:

  • • Your biological question
  • • Data type and size
  • • Timeline constraints

We'll tell you:

  • • What's feasible
  • • How long it will take
  • • Exactly what it will cost
Contact Us

Contact us to start with a free consultation. Need everyday bench calculators? Try our free lab tools.