Genomics & Variant Analysis

CNV and Structural Variation Analysis Service — Multi-Caller Copy-Number Segments and Breakpoint-Resolved SV VCFs from WGS or WES BAMs

Copy-number and structural variation (CNV/SV) analysis detects genome-wide dosage changes and rearrangements from read depth and paired-end/split-read evidence (Collins et al., 2020). Pepkio delivers version-pinned segment tables, SV VCFs, gene-overlap annotations, and scripts—with custom and bespoke workflow support—for academic, biotech, and pharma teams starting from BAMs or scoped alignments. Collins et al. (2020) reported a median of 7,439 high-quality SVs per genome in gnomAD-SV—a reference benchmark, not a Pepkio deliverable.

Key facts

Key facts about CNV and Structural Variation
Fact	Value
Supported platforms / instruments	Pre-aligned BAMs or CRAMs from Illumina NovaSeq X / 6000 / NextSeq 2000, HiSeq 2500/4000; MGI DNBSEQ (BGI) short-read WGS/WES; Element Biosciences AVITI and Ultima Genomics UG100 when scoped at kickoff. WGS preferred; WES capture-aware. Re-alignment from FASTQ scoped separately (BWA-MEM2 2.2.1)
Input requirements	WGS: ≥30× mean autosomal depth and >95% callability recommended (Rehm et al., 2021). WES: ≥50–100× on-target depth; all callers perform poorly on sparse WES relative to WGS (Gabrielaite et al., 2021). gCNV cohort model: ≥100 technically matched WGS samples (Broad Institute, 2024). Smaller cohorts or single-sample: CNVkit 0.9.11 or GATK-SV single-sample mode with a pre-computed reference panel (Broad Institute, 2024)
Reference builds supported	Human GRCh38 primary (GATK resource bundle 4.6); legacy GRCh37/hg19 on request; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)	GATK 4.6.0.0 (GermlineCNVCaller / gCNV); Manta 1.6.0; DELLY 1.2.6; Lumpy 0.2.13 via smoove 0.2.8; CNVkit 0.9.11 when scoped; cn.MOPS 1.56.0 for cohort depth CNVs; samtools/bcftools 1.21; mosdepth 0.3.3; bedtools 2.31.1; Picard 3.2.0; MultiQC 1.25
Typical turnaround time	2–4 weeks (single-sample from BAM); 4–8 weeks (multi-sample cohort with gCNV model training and multi-caller merge) — confirmed at kickoff
Deliverable formats	.cnv.vcf.gz, .sv.vcf.gz, segment .bed/.tsv; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice reference	Gabrielaite et al. (2021), Cancers; Collins et al. (2020), Nature; Mahmoud et al. (2019), Genome Biology
Custom / bespoke analysis	Non-standard inputs, outputs, and methods scoped at kickoff—e.g., somatic tumor–normal CNV/SV, locus-specific re-calling, client reference panels, array validation overlays, CytoScan comparison tables, or non-standard export formats

What is CNV and structural variation analysis?

CNV and structural variation analysis segments read-depth profiles to call copy-number gains and losses, then integrates discordant read pairs and split reads to detect rearrangements typically ≥50 bp (Mahmoud et al., 2019; Broad Institute, 2024). It measures dosage and large-scale architecture—not SNVs and indels, which require separate variant calling (Gabrielaite et al., 2021). Pepkio returns annotated segment tables and SV VCFs from client BAMs or scoped alignments, with sensitivity limits documented per assay. See the CNV and structural variation glossary.

When should you use CNV and structural variation analysis?

CNV/SV analysis fits when SNV/indel calls are insufficient—undetected dosage changes, translocations, inversions, or negative WES/WGS SNV panels in rare disease—and when aligned WGS or WES data already exist or are planned.

Comparison of CNV/SV calling approaches
Approach	Best for	Limitations	Approximate cost range
CNV/SV from WGS/WES BAMs (this service)	Genome-wide dosage changes, breakpoint-resolved SVs, undiagnosed disease after negative SNV panels	WES misses many non-coding and large CNVs; short-read misses complex SVs and repeat-rich loci (Gabrielaite et al., 2021; Ebert et al., 2021)	Bioinformatics-only; lower than full FASTQ pipeline when BAMs exist
SNV/indel variant calling only	Base-pair mutations, cohort SNV harmonization	Does not reliably detect most CNVs or balanced SVs	Lower compute when CNV/SV not required
SNP-array / CytoScan CNV	Clinical-grade CNV at established loci with array validation history	Fixed probe coverage; limited breakpoint resolution and novel SV discovery	Lowest genotyping cost; resolution limited by probe density
Long-read SV (separate spoke)	Complex, repeat-rich, phased structural events	Higher per-base sequencing cost than short-read WGS	Library prep + sequencing + bioinformatics vary by platform and depth

Rare disease trios negative on WES: Gabrielaite et al. (2021) benchmarked 11 CNV callers on matched WES and WGS; WGS outperformed WES for CNV recall, and no single caller captured all events—they recommend combining GATK gCNV, Lumpy, DELLY, and cn.MOPS.
Population-scale SV reference catalogs: Collins et al. (2020) integrated Manta, Wham, gCNV, and cn.MOPS across 14,891 genomes in gnomAD-SV, producing a harmonized SV reference for allele-frequency filtering.
Complex de novo SVs in undiagnosed disease: Jung et al. (2025) analyzed 13,698 probands from the UK 100,000 Genomes Project; 8.4% of de novo SVs were complex, and 22% of array- or WES-called simple deletions/duplications were reclassified as complex on WGS.

How the analysis works — step by step

1. Validate BAM/CRAM inputs and metadata
Pepkio verifies MD5 checksums, coordinate-sorted BAM or CRAM structure, indexed .bai/.crai presence, and @RG tags. Reference build, sample sex, assay type, platform, and batch are recorded in sample_manifest.csv. Mismatched builds or missing indices are flagged before calling.
Tools and outputs
Tools used: md5sum; samtools 1.21 quickcheck; custom validation scripts
Output: sample_manifest.csv with sample IDs, reference build, assay type, platform, input path, and QC flags
2. Audit coverage and callability
mosdepth and Picard CollectWgsMetrics or CollectHsMetrics report mean coverage, pct at ≥15×/≥20×, duplicate rate, and insert-size distribution. Sub-threshold samples are flagged (Rehm et al., 2021; Ajay et al., 2011).
Tools and outputs
Tools used: mosdepth 0.3.3; Picard 3.2.0 CollectWgsMetrics or CollectHsMetrics
Output: coverage_summary.csv; mosdepth.global.dist.txt; wgs_metrics.txt or hs_metrics.txt; coverage histogram and CDF plots
3. Prepare bin-level read-depth profiles
Genomic bins match the assay—uniform bins for WGS (GATK defaults start at ~1 kb; larger when scoped), capture-aware intervals for WES (Talevich et al., 2016). GC bias and mappability are assessed per bin; blacklisted regions and sex chromosomes are handled per project design.
Tools and outputs
Tools used: bedtools 2.31.1; mosdepth 0.3.3; custom bin-generation scripts
Output: {project}.wgs_bins.bed or {project}.capture_targets.bed; depth_by_bin.tsv; GC bias diagnostic plots
4. Train or select the CNV reference model
For cohorts with ≥100 matched germline WGS samples, GATK GermlineCNVCaller trains a cohort gCNV model (Broad Institute, 2024). Smaller cohorts, somatic projects, or WES use CNVkit or GATK-SV single-sample mode with a pre-computed reference panel (Broad Institute, 2024). Training samples should match cases in library prep and batch (Collins et al., 2020).
Tools and outputs
Tools used: GATK 4.6.0.0 GermlineCNVCaller model training; CNVkit 0.9.11 batch reference build when scoped
Output: {batch}.gcnv_model/ or {project}.cnvkit_reference.cnn; model training QC report; cnv_reference_manifest.csv
5. Call copy-number variants
GATK gCNV infers copy-number states from cohort-trained depth models; CNVkit applies GC correction and CBS segmentation (Talevich et al., 2016); cn.MOPS models read counts across cohort samples (Klambauer et al., 2012). Per-sample CNV counts are checked before merging.
Tools and outputs
Tools used: GATK 4.6.0.0 PostprocessGermlineCNVCalls; CNVkit 0.9.11 batch / call; cn.MOPS 1.56.0 when scoped
Output: {sample}.gcnv_calls.vcf.gz; {sample}.cnvkit.cns; {sample}.cnmops_segments.bed; per-sample CNV count summary
6. Call structural variants from read signatures
Manta and DELLY call SVs from discordant pairs, split reads, and depth evidence by default; Lumpy (via smoove) is added when scoped (Chen et al., 2016; Rausch et al., 2012; Layer et al., 2014). Per-caller VCFs retain QUAL, FILTER, and evidence tags for merge decisions.
Tools and outputs
Tools used: Manta 1.6.0; DELLY 1.2.6; smoove 0.2.8 (Lumpy 0.2.13) when scoped
Output: {sample}.manta.sv.vcf.gz; {sample}.delly.sv.vcf.gz; {sample}.lumpy.sv.vcf.gz when scoped; per-caller variant count summaries
7. Merge and filter multi-caller evidence
CNV segments and SV calls are merged using reciprocal overlap thresholds agreed at kickoff (Gabrielaite et al., 2021). Size, quality, and bin-count filters are tuned per assay (Talevich et al., 2016).
Tools and outputs
Tools used: bcftools 1.21; bedtools 2.31.1; SURVIVOR 1.0.6 or custom merge scripts when scoped
Output: {cohort}.merged_cnv.bed; {cohort}.merged_sv.vcf.gz; cnv_sv_filter_summary.csv; caller overlap Venn data
8. Annotate CNV/SV with genes and functional context
CNV segments and SV breakpoints are intersected with Ensembl/GENCODE gene models. gnomAD-SV allele frequencies and ClinVar overlap are added when reference databases are configured (Collins et al., 2020).
Tools and outputs
Tools used: bedtools 2.31.1; bcftools 1.21; Ensembl BioMart or custom annotation scripts
Output: cnv_gene_overlap.tsv; sv_annotation_master.tsv; annotated .vcf.gz when requested
9. Run CNV/SV sanity QC
Per-sample CNV burden, SV type breakdown, and size distributions are compared across the cohort. Outlier call counts or batch clustering trigger review before delivery (Gabrielaite et al., 2021). CytoScan or MLPA truth sets are overlaid when clients provide coordinates.
Tools and outputs
Tools used: custom R/Python QC scripts; bcftools 1.21 stats
Output: cnv_sv_qc_summary.csv; CNV burden and SV type bar charts; batch stratification report
10. Aggregate QC and package deliverables
MultiQC aggregates coverage, CNV, and SV metrics across samples. Final scripts, README, Methods draft, and HTML QC report are packaged per agreed retention policy.
Tools and outputs
Tools used: MultiQC 1.25
Output: MultiQC report; final deliverable bundle with scripts, figures, and Methods draft

What Pepkio delivers

Processed data files

Per-sample CNV .bed/.vcf.gz; per-caller SV VCFs (Manta and DELLY default; Lumpy when scoped)
Merged cohort outputs when in scope
Annotation and QC tables: cnv_gene_overlap.tsv, sv_annotation_master.tsv, coverage_summary.csv, cnv_sv_qc_summary.csv, cnv_sv_filter_summary.csv

Figures (PDF/SVG)

CNV profile; segment size histogram; CNV burden bar chart
SV type breakdown; caller overlap summary; locus zoom plots when scoped

Tables

cnv_gene_overlap.tsv (chrom, start, end, CN, genes)
sv_annotation_master.tsv (SVTYPE, SVLEN, genes, gnomAD-SV AF when configured)
cnv_sv_qc_summary.csv (depth, CNV/SV counts, batch ID)

Code

Commented bash, R, and Python scripts; environment lock files
Private Git or agreed file transfer

Documentation

QC report; README; Methods draft
Post-delivery reviewer support within agreed scope

Technical decisions we make — and why

Multi-caller default: gCNV + Manta + DELLY (+ Lumpy when scoped): Gabrielaite et al. (2021) recommend combining GATK gCNV, Lumpy, DELLY, and cn.MOPS because no single CNV caller captures all variant classes reliably. Single-caller delivery is available when scoped.
gCNV cohort model vs CNVkit: GATK GermlineCNVCaller trains a cohort model when ≥100 matched germline WGS samples are available (Broad Institute, 2024). CNVkit 0.9.11 is the default for smaller cohorts, somatic designs, and WES (Talevich et al., 2016; Gabrielaite et al., 2021).
WGS vs WES expectations: WGS is the default for genome-wide CNV and SV discovery (Gabrielaite et al., 2021). WES CNV/SV is scoped with capture-kit–aware bins and documented sensitivity limits outside capture targets.
Filter stringency: project-tuned, not universal cutoffs: Filters depend on assay, cohort size, and use case. Pepkio documents thresholds per project rather than applying generic SNV cutoffs (Talevich et al., 2016; Koboldt, 2020).
Batch handling for gCNV and depth models: GATK-SV recommends separate gCNV models per batch when library protocols differ (Broad Institute, 2024). Pepkio stratifies batch in QC and trains batch-specific models when required.

Common questions

What is the minimum sequencing depth for reliable CNV and SV calling?

For germline CNV/SV from WGS BAMs, Pepkio recommends ~30× mean autosomal depth with >95% callability (Rehm et al., 2021). WES requires ≥50–100× on-target depth (Gabrielaite et al., 2021). Sarwal et al. (2022) found Manta and LUMPY among the highest F-scores for deletion calling. Thresholds are confirmed at kickoff.

Can you call CNVs from my existing WES BAMs?

Yes, when on-target depth and capture-kit metadata are documented. WES misses many off-target events that WGS detects (Gabrielaite et al., 2021). Pepkio applies capture-aware CNVkit binning when scoped and documents sensitivity limits in the Methods draft.

Can you analyze low-quality or low-coverage BAMs?

Sometimes, with caveats. Mean depth below ~20× on WGS or sparse on-target coverage on WES reduces CNV segment confidence and SV evidence counts (Rehm et al., 2021). Sub-threshold samples are flagged before calling; re-sequencing is discussed when feasible. Partial calling on agreed chromosomal regions or gene lists is possible when full-genome CNV/SV is not justified.

Do you support BAMs from Illumina, MGI DNBSEQ, Element AVITI, and Ultima UG100?

Pepkio processes pre-aligned BAMs from Illumina and MGI DNBSEQ (BGI) platforms when @RG tags and reference build are documented. Element AVITI and Ultima UG100 BAMs are accepted when scoped at kickoff. If only FASTQs are available, re-alignment with BWA-MEM2 2.2.1 is scoped separately or via the WGS/WES spokes.

How long does CNV and structural variation analysis take at Pepkio?

Single-sample CNV/SV calling from client BAMs typically completes in 2–4 weeks; multi-sample cohorts with gCNV training and merge typically require 4–8 weeks. Timelines depend on sample count, assay, and scope—confirmed at kickoff.

How do you handle batch effects in multi-batch CNV/SV cohorts?

CNV depth models are sensitive to library prep, sequencer, and batch-specific coverage biases (Collins et al., 2020; Broad Institute, 2024). Pepkio trains separate gCNV models per batch when dosage scores differ, stratifies batch in QC reports, and flags batch-clustered samples. Batch-aware CNVkit reference panels are constructed from matched normals when available.

Do you use one caller or combine Manta, DELLY, and gCNV?

Pepkio defaults to GATK gCNV (or CNVkit) plus Manta and DELLY because no single tool captures all CNV classes reliably (Gabrielaite et al., 2021). Lumpy and cn.MOPS are added when scoped. Single-caller workflows are available with documented trade-offs.

How do you handle complex structural variants?

Short-read pipelines detect many complex SVs but cannot resolve all breakpoint architectures that long-read sequencing resolves (Ebert et al., 2021; Jung et al., 2025). Pepkio reports complex SVTYPE calls when assigned; unresolved events are flagged for long-read follow-up when scoped.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of code, scripts, and results. Pepkio delivers commented bash, R, and Python scripts with environment lock files. Jupyter or R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after coverage audit, post-calling sanity checks, and before final delivery. A PhD-level scientific contact leads the project and documents decisions at each stage.

What does post-delivery reviewer support include?

Clarification of methods, merge thresholds, QC metrics, and minor figure or table revisions within agreed scope. Methods and Supplementary drafts are included for analyses we performed; substantial new reviewer requests are scoped separately.

Is co-authorship required?

No. Pepkio does not require co-authorship unless explicitly discussed. Acknowledgment of bioinformatics support is standard practice.

Related services

Whole-genome sequencing — Full FASTQ-to-BAM pipeline when alignment and raw-read QC are needed before CNV/SV calling.
Whole-exome sequencing — Capture-enriched libraries with on-target depth QC and coding-region variant discovery.
Variant calling — SNV/indel calling from the same BAMs to build a complete small-variant catalog alongside CNV/SV.
Long-read DNA sequencing — Complex, repeat-rich, and phased structural events that short-read callers cannot fully resolve.
Custom consulting — Pre-project depth, cohort size, caller strategy, and array-validation planning before sequencing or CNV/SV analysis.

References

Collins RL, Brand H, Karczewski KJ, et al. A structural variation reference for medical and population genetics. Nature. 2020;581(7809):444–451. https://doi.org/10.1038/s41586-020-2287-8 (PMID: 32461652)
Gabrielaite M, Torp MH, Rasmussen MS, et al. A comparison of tools for copy-number variation detection in germline whole exome and whole genome sequencing data. Cancers. 2021;13(24):6283. https://doi.org/10.3390/cancers13246283 (PMID: 34944901)
Mahmoud M, Gobet N, Cruz-Dávalos DI, et al. Structural variant calling: the long and the short of it. Genome Biology. 2019;20:246. https://doi.org/10.1186/s13059-019-1828-7 (PMID: 31747936)
Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genomic Medicine. 2021;6(1):47. https://doi.org/10.1038/s41525-020-00154-9 (PMID: 33110627)
Talevich E, Shain AH, Botton T, Bastian BC. CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing. PLoS Computational Biology. 2016;12(4):e1004873. https://doi.org/10.1371/journal.pcbi.1004873 (PMID: 27100738)
Chen X, Schulz-Trieglaff O, Shaw R, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32(8):1220–1222. https://doi.org/10.1093/bioinformatics/btv710 (PMID: 26647377)
Rausch T, Zichner T, Klipp F, et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28(18):i333–i339. https://doi.org/10.1093/bioinformatics/bts378 (PMID: 22962449)
Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biology. 2014;15(6):R84. https://doi.org/10.1186/gb-2014-15-6-r84 (PMID: 24970577)
Klambauer G, Schwarzbauer K, Mayr A, et al. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Research. 2012;40(9):e69. https://doi.org/10.1093/nar/gks003 (PMID: 22302147)
Sarwal V, Niehus S, Ayyala R, et al. A comprehensive benchmarking of WGS-based deletion structural variant callers. Briefings in Bioinformatics. 2022;23(5):bbac221. https://doi.org/10.1093/bib/bbac221 (PMID: 35753701)
Jung H, Yang TP, Walker S, et al. Complex de novo structural variants are an underestimated cause of rare disorders. Nature Communications. 2025;16:9528. https://doi.org/10.1038/s41467-025-64722-2 (PMID: 41184278)
Ebert P, Audano PA, Zhu Q, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021;372(6537):eabf7117. https://doi.org/10.1126/science.abf7117 (PMID: 33632895)
Ajay SS, Parker SCJ, Abaan HO, et al. Accurate and comprehensive sequencing of personal genomes. Genome Research. 2011;21(9):1498–1505. https://doi.org/10.1101/gr.123638.111 (PMID: 21771779)
Koboldt DC. Best practices for variant calling in clinical sequencing. Genome Medicine. 2020;12:91. https://doi.org/10.1186/s13073-020-00791-w (PMID: 33106175)
Pedersen BS, Quinlan AR. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. 2018;34(5):867–868. https://doi.org/10.1093/bioinformatics/btx699 (PMID: 29096012)
Ewels P, Magnusson M, Lundin S, et al. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. https://doi.org/10.1093/bioinformatics/btw354 (PMID: 27312411)
Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. https://doi.org/10.1093/bioinformatics/btp352 (PMID: 19505943)
Broad Institute. GATK-SV structural variation discovery documentation. 2024. https://gatk.broadinstitute.org/hc/en-us/articles/9022487952155-Structural-variant-SV-discovery; https://github.com/broadinstitute/gatk-sv
Jeffares DC, Jolly C, Hoti M, et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nature Communications. 2017;8:14061. https://doi.org/10.1038/ncomms14061 (SURVIVOR toolset; release v1.0.6)

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

CNV and Structural Variation Analysis Service — Multi-Caller Copy-Number Segments and Breakpoint-Resolved SV VCFs from WGS or WES BAMs

Key facts

What is CNV and structural variation analysis?

When should you use CNV and structural variation analysis?

How the analysis works — step by step

1. Validate BAM/CRAM inputs and metadata

2. Audit coverage and callability

3. Prepare bin-level read-depth profiles

4. Train or select the CNV reference model

5. Call copy-number variants

6. Call structural variants from read signatures

7. Merge and filter multi-caller evidence

8. Annotate CNV/SV with genes and functional context

9. Run CNV/SV sanity QC

10. Aggregate QC and package deliverables