Genomics & Variant Analysis

Whole-Genome Sequencing (WGS) Analysis Service — GRCh38-Aligned Germline Variant Calls from Raw FASTQs with Callable-Region QC

Whole-genome sequencing (WGS) calls genome-wide germline SNVs and indels from aligned paired-end reads (Van der Auwera et al., 2013). Pepkio delivers version-pinned FASTQ-to-VCF analysis with custom workflow support for academic, biotech, and pharma clients—typically ≥30× mean autosomal depth with >95% callability or ≥95% autosomes at ≥15× per consortium specs (Rehm et al., 2021; Genomics England, 2021). Scripts, figures, and a Methods draft included.

Key facts

Key facts about Whole-Genome Sequencing
Fact	Value
Supported platforms / instruments	Illumina NovaSeq X / 6000 / NextSeq 2000, HiSeq 2500/4000; Element Biosciences AVITI and Ultima Genomics UG100 when scoped at kickoff. PacBio Revio and Oxford Nanopore deferred to long-read DNA-seq spoke
Input requirements	Paired-end FASTQ (≥2×150 bp typical); ≥30× mean autosomal depth and >95% callability recommended (Rehm et al., 2021); ≥95% autosomes at ≥15× per Genomics England rare-disease QC (Genomics England, 2021). Cohort VQSR: typically ≥30 samples or one high-quality WGS per GATK guidance (Broad Institute, 2024)
Reference builds supported	Human GRCh38 primary (GATK resource bundle 4.6); legacy GRCh37/hg19 on request; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)	BWA-MEM2 2.2.1; GATK 4.6.0.0; Picard 3.2.0; samtools 1.21; bcftools 1.21; mosdepth 0.3.3; Ensembl VEP 112; fastp 0.23.4; FastQC 0.12.1; MultiQC 1.25. DeepVariant 1.8.0 optional
Typical turnaround time	2–4 weeks (single-sample germline); 4–8 weeks (multi-sample cohort with joint genotyping) — confirmed at kickoff
Deliverable formats	.bam, .g.vcf.gz, filtered .vcf.gz, annotation .tsv; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice reference	Van der Auwera et al. (2013), Current Protocols in Bioinformatics; Regier et al. (2018), Nature Communications; Rehm et al. (2021), npj Genomic Medicine
Custom / bespoke analysis	Non-standard inputs, outputs, and methods scoped at kickoff—e.g., client BAMs, custom references, custom annotation fields, GWAS-ready plink exports, or tumor–normal somatic extensions

What is whole-genome sequencing (WGS)?

WGS aligns short reads across the full nuclear genome, recalibrates base qualities, and discovers germline SNVs and indels relative to GRCh38—not just coding exons. Unlike WES, WGS covers introns, promoters, and intergenic regions where pathogenic variants occur (Rehm et al., 2021). At ~30× mean mapped depth, ~90% of the genome was callable in early benchmarks; callable fraction matters as much as mean depth (Ajay et al., 2011). Pepkio starts from FASTQs or client BAMs and returns filtered VCFs with coverage QC. The UK 100,000 Genomes Project analyzed 10,478 cancer genomes across 35 tumor types (Kinnersley et al., 2024). Custom inputs and deliverables are agreed at kickoff. See the whole-genome sequencing glossary.

When should you use whole-genome sequencing (WGS)?

WGS fits when you need a germline variant catalog across coding and non-coding sequence, uniform coverage outside capture targets, or SNV/indel discovery with optional CNV/SV add-ons.

Comparison of WGS, WES, and targeted gene panels
Approach	Best for	Limitations	Approximate cost range
WGS (short-read)	Rare undiagnosed disease, population cohort catalogs, non-coding variant discovery, uniform genome-wide coverage	Higher sequencing cost per sample than WES; repetitive regions remain challenging; storage and compute heavier than exome	Library prep + sequencing + bioinformatics vary by depth, sample count, and joint-calling scope
WES (capture)	Coding-region variant discovery at lower per-sample sequencing cost	Gaps in non-coding, mitochondrial, and off-target regions; uneven coverage in low-complexity targets (Rehm et al., 2021)	Lower sequencing cost than WGS; capture kit and target breadth affect sensitivity
Targeted gene panel	Known gene sets, high depth for low-VAF mosaicism in specific loci	Limited to panel genes; misses novel loci outside design	Lowest per-sample sequencing cost; panel size drives sensitivity

Undiagnosed rare disease: WGS can detect non-coding variants and broader variant classes relative to WES in diagnostic settings (Rehm et al., 2021).
Cancer predisposition baseline: Light et al. (2023) analyzed WGS from Li-Fraumeni syndrome tumors and reported near-ubiquitous early TP53 loss of heterozygosity with gain of the mutant allele years before diagnosis.
Precision oncology at scale: Kinnersley et al. (2024) analyzed 10,478 cancer WGS profiles from the UK 100,000 Genomes Project, identifying 330 candidate driver genes and estimating ~55% of patients harbor at least one clinically relevant mutation.

How the analysis works — step by step

1. Validate inputs and sample metadata
Pepkio verifies FASTQ integrity (MD5 checksums), read length, paired-end structure, and read groups (@RG tags) required by GATK (Van der Auwera et al., 2013). Sample sex, batch, and sequencing center are recorded in sample_manifest.csv; sub-threshold yield is flagged before alignment.
Tools and outputs
Tools used: md5sum; custom validation scripts
Output: sample_manifest.csv with sample IDs, flowcell/lane, read counts, and QC flags
2. QC raw reads
FastQC assesses per-base quality, adapter content, and duplication; fastp trims adapters and low-quality ends when needed. Libraries with low Q30 yield or extreme adapter contamination are flagged before alignment.
Tools and outputs
Tools used: FastQC 0.12.1; fastp 0.23.4
Output: fastqc/ reports; fastp.json / fastp.html trim statistics; trimmed FASTQs when trimming is applied
3. Align reads to the reference genome
Paired-end reads are aligned to GRCh38 (or agreed build) with BWA-MEM2 (-M for Picard/GATK compatibility; Vasimuddin et al., 2019), producing coordinate-sorted BAM. Mapping rate and insert-size distribution are compared against expected ranges.
Tools and outputs
Tools used: BWA-MEM2 2.2.1; samtools 1.21
Output: {sample}.sorted.bam; alignment summary metrics
4. Mark duplicates and index BAMs
PCR and optical duplicates are marked with Picard MarkDuplicates so downstream callers do not double-count clustered molecules (Van der Auwera et al., 2013). Elevated duplicate rates trigger review of library prep metadata. BAM indices are generated for random access.
Tools and outputs
Tools used: Picard 3.2.0 MarkDuplicates; samtools 1.21 index
Output: {sample}.dedup.bam and .bai; duplicate metrics table
5. Recalibrate base quality scores
GATK BaseRecalibrator and ApplyBQSR adjust per-base qualities using known polymorphism sites from the GATK resource bundle (Van der Auwera et al., 2013). Recalibration reports are inspected for covariate drift across cycles and read groups.
Tools and outputs
Tools used: GATK 4.6.0.0 BaseRecalibrator; GATK ApplyBQSR
Output: {sample}.recal.bam; recalibration report PDF
6. Assess coverage and callability
mosdepth and Picard CollectWgsMetrics report mean coverage and pct autosomes at ≥15×/≥20×. Callable breadth—not mean depth alone—is reported because ~30× mean depth leaves ~10% of the genome uncalled in benchmarks (Ajay et al., 2011). Samples below agreed thresholds are flagged before calling.
Tools and outputs
Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectWgsMetrics
Output: coverage_summary.csv; mosdepth.global.dist.txt; wgs_metrics.txt; coverage histogram and CDF plots
7. Call germline SNVs and indels per sample
GATK HaplotypeCaller runs in -ERC GVCF mode per sample, emitting gVCF blocks that retain reference confidence across uncalled regions (Van der Auwera et al., 2013). DeepVariant 1.8.0 is available as an alternative single-sample caller when scoped at kickoff (Poplin et al., 2018). Ti/Tv ratios and variant counts are checked against expectations for the species and build.
Tools and outputs
Tools used: GATK 4.6.0.0 HaplotypeCaller
Output: {sample}.g.vcf.gz and .tbi; per-sample variant count summary
8. Joint-genotype cohorts when applicable
For multi-sample projects, gVCFs are imported into GenomicsDB and joint-genotyped with GenotypeGVCFs, rescuing variants weakly supported in individual samples (Van der Auwera et al., 2013). Single-sample projects skip this step and proceed directly to filtering. GenomicsDB workspace paths and interval lists are documented for reproducibility.
Tools and outputs
Tools used: GATK 4.6.0.0 GenomicsDBImport; GATK GenotypeGVCFs
Output: {cohort}.joint.vcf.gz; GenomicsDB workspace; sample count and site-level summary
9. Filter variants to high-confidence calls
When cohort size supports it, GATK VariantRecalibrator applies VQSR (Broad Institute, 2024); smaller cohorts use documented hard filters. Filtered and pass-only VCFs are exported with Ti/Tv sanity checks on pass sites.
Tools and outputs
Tools used: GATK 4.6.0.0 VariantRecalibrator / ApplyVQSR or VariantFiltration; bcftools 1.21
Output: {cohort}.filtered.vcf.gz; {cohort}.pass-only.vcf.gz; VQSR tranche plots or hard-filter summary
10. Annotate variants and package deliverables
Ensembl VEP annotates consequences (e.g., missense, frameshift, splice region), gene symbols, and population or clinical fields when reference databases are configured for the project (McLaren et al., 2016). MultiQC aggregates QC metrics across samples. Final scripts, README, Methods draft, and HTML QC report are packaged per agreed retention policy.
Tools and outputs
Tools used: Ensembl VEP 112; bcftools 1.21; MultiQC 1.25
Output: variant_annotation_master.tsv; MultiQC report; final deliverable bundle with scripts and Methods draft

What Pepkio delivers

Processed data files

Coordinate-sorted .bam and .bai; recalibrated BAM
Per-sample .g.vcf.gz; joint and filtered .vcf.gz for cohort projects
variant_annotation_master.tsv; coverage_summary.csv; wgs_metrics.txt; sample_qc_summary.csv

Figures (PDF/SVG)

Per-base FastQC heatmaps; coverage histogram and cumulative distribution
Insert-size distribution; Ti/Tv ratio summary
Variant consequence bar chart (missense, synonymous, LOF, etc.)
VQSR tranche curves when applicable

Tables

variant_annotation_master.tsv with CHROM, POS, REF, ALT, QUAL, FILTER, SYMBOL, Consequence, IMPACT, and population/clinical fields when configured (e.g., gnomAD allele frequency, ClinVar significance)
sample_qc_summary.csv with mapping rate, mean depth, pct autosomes ≥15×/≥20×, duplicate rate, and pass-variant counts

Code

Commented bash, R, and Python scripts per stage
Environment lock files; delivery via private Git or agreed file transfer

Documentation

HTML/PDF QC report; README with reproduction instructions
Methods draft with exact software versions
Bespoke milestones scoped at kickoff; post-delivery reviewer support within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

Aligner: BWA-MEM2 default: BWA-MEM2 2.2.1 aligns Illumina WGS reads with improved speed over BWA-MEM while maintaining GATK Best Practices compatibility (Vasimuddin et al., 2019; Van der Auwera et al., 2013). minimap2 is used only when long-read data are scoped to the long-read DNA-seq spoke.
Caller: GATK HaplotypeCaller gVCF + joint genotyping for cohorts: gVCF mode preserves reference confidence and supports cohort joint genotyping via GenomicsDBImport (Van der Auwera et al., 2013; Regier et al., 2018). DeepVariant 1.8.0 is an alternative for single-sample projects when clients prefer neural-network calling (Poplin et al., 2018).
Filtering: VQSR when cohort size supports it; hard filters otherwise: VQSR requires sufficient variant sites for Gaussian mixture training; GATK recommends at least one WGS or ~30 exomes (Broad Institute, 2024). Smaller cohorts use documented GATK hard filters with filter labels retained in output.
Annotation: Ensembl VEP: VEP provides consistent transcript-level consequence terms across Ensembl/GENCODE annotations (McLaren et al., 2016). ANNOVAR or custom gene lists are available on request when clients require specific transcript databases.
Coverage QC: breadth at 15×/20×, not mean depth alone: Clinical WGS validation emphasizes callability and depth thresholds alongside mean coverage (Rehm et al., 2021; Ajay et al., 2011). Pepkio flags samples below agreed breadth before variant calling proceeds.

Common questions

What is the minimum sequencing depth and sample count for WGS analysis?

For germline SNV/indel discovery on GRCh38, Pepkio recommends ~30× mean autosomal depth with >95% callability (Rehm et al., 2021) and ≥95% autosomes at ≥15× where Genomics England-style QC applies (Genomics England, 2021). VQSR typically requires at least one WGS or ~30 exomes (Broad Institute, 2024). Thresholds are confirmed at kickoff.

Can you analyze low-quality or low-yield WGS libraries?

Yes, with caveats. Low Q30 yield, high adapter content, or mean depth below ~20× reduce callable fraction (Ajay et al., 2011). Sub-threshold samples are flagged; re-sequencing is discussed before full calling. Partial analysis is possible when re-sequencing is not feasible.

Do you support Illumina NovaSeq, Element AVITI, and Ultima UG100 data?

Illumina NovaSeq X, 6000, NextSeq 2000, and HiSeq data use the standard BWA-MEM2 + GATK workflow. Element AVITI and Ultima UG100 FASTQs can be processed when scoped at kickoff. PacBio and Oxford Nanopore WGS are referred to the long-read DNA-seq spoke.

How long does WGS analysis take at Pepkio?

Single-sample germline projects typically complete in 2–4 weeks; multi-sample cohorts with joint genotyping and VQSR typically require 4–8 weeks. Timelines are confirmed at kickoff.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of code, scripts, and results. Pepkio delivers commented bash, R, and Python scripts with environment lock files. Jupyter or R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after alignment QC, coverage assessment, and before final delivery. A PhD-level scientific contact leads the project.

What does post-delivery reviewer support include?

Clarification of methods, QC thresholds, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Methods and Supplementary drafts are included for analyses we performed; substantial new reviewer requests are scoped separately.

Is co-authorship required?

No. Pepkio does not require co-authorship unless explicitly discussed. Acknowledgment of bioinformatics support is standard practice.

Should I use GRCh38 or hg19 for my WGS analysis?

Pepkio defaults to GRCh38 with the GATK 4.6 resource bundle (Broad Institute, 2024; Regier et al., 2018). GRCh37/hg19 is supported on request for legacy cohorts with matching annotation builds. Build choice is documented in the Methods draft.

How do you handle batch effects and joint genotyping in multi-batch cohorts?

Harmonized pipelines reduce batch-driven call differences (Regier et al., 2018). Pepkio stratifies sequencing center, flowcell, and batch in QC reports. Joint genotyping via gVCF aggregation improves sensitivity for weakly supported variants (Van der Auwera et al., 2013). Batch-specific correction beyond standard QC is scoped at kickoff when needed.

Can Pepkio perform tumor–normal somatic WGS analysis?

Somatic tumor–normal WGS is a separately scoped milestone—not the default germline workflow. Matched normals, higher tumor depth, and distinct callers are defined at kickoff.

Can you handle custom or non-standard WGS analyses?

Yes. Bespoke work—client BAMs, custom references, plink exports, pedigree filtering, or CNV/SV integration—is scoped at kickoff with milestone pricing.

Related services

Whole-exome sequencing — Lower-cost coding-region variant discovery when non-coding coverage is not required.
Variant calling — Caller selection, filter tuning, and joint genotyping when alignment is already complete.
CNV and structural variation — Genome-wide copy-number and structural variant calling from WGS alignments.
Long-read DNA sequencing — Phased variants and structural events that short-read WGS cannot fully resolve.
Custom consulting — Pre-sequencing depth, cohort size, and reference-build planning before library prep.

References

Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics. 2013;43(1):11.10.1–11.10.33. https://doi.org/10.1002/0471250953.bi1110s43 (PMID: 25431634)
Regier AA, Farjoun Y, Larson DE, et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 2018;9:4038. https://doi.org/10.1038/s41467-018-06159-4 (PMID: 30279509)
Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genomic Medicine. 2021;6(1):47. https://doi.org/10.1038/s41525-020-00154-9 (PMID: 33110627)
Ajay SS, Parker SCJ, Abaan HO, et al. Accurate and comprehensive sequencing of personal genomes. Genome Research. 2011;21(9):1498–1505. https://doi.org/10.1101/gr.123638.111 (PMID: 21771779)
Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
Kinnersley B, Sud A, Everall A, et al. Analysis of 10,478 cancer genomes identifies candidate driver genes and opportunities for precision oncology. Nature Genetics. 2024;56(9):1868–1877. https://doi.org/10.1038/s41588-024-01785-9 (PMID: 38890488)
Light N, Layeghifard M, Attery A, et al. Germline TP53 mutations undergo copy number gain years prior to tumor diagnosis. Nature Communications. 2023;14:77. https://doi.org/10.1038/s41467-022-35727-y (PMID: 36604421)
Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. IEEE IPDPS. 2019. https://doi.org/10.1109/IPDPS.2019.00041 (BWA-MEM2)
Broad Institute. GATK 4.6.0.0 release notes and VQSR documentation. 2024. https://github.com/broadinstitute/gatk/releases/tag/4.6.0.0; https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR
Genomics England. Rare Disease Genome Analysis Guide — quality control (genome coverage). 2021. https://pipeline-rd-help.genomicsengland.co.uk/latest/bioinformatics-pipeline/quality-control-and-genomic-identity-checks/quality-control/
Ensembl. VEP documentation. 2024. https://www.ensembl.org/info/docs/tools/vep/index.html

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Whole-Genome Sequencing (WGS) Analysis Service — GRCh38-Aligned Germline Variant Calls from Raw FASTQs with Callable-Region QC

Key facts

What is whole-genome sequencing (WGS)?

When should you use whole-genome sequencing (WGS)?

How the analysis works — step by step

1. Validate inputs and sample metadata

2. QC raw reads

3. Align reads to the reference genome

4. Mark duplicates and index BAMs

5. Recalibrate base quality scores

6. Assess coverage and callability

7. Call germline SNVs and indels per sample

8. Joint-genotype cohorts when applicable

9. Filter variants to high-confidence calls

10. Annotate variants and package deliverables