Genomics & Variant Analysis

Long-Read DNA Sequencing Analysis Service — Phased Structural Variants and SNV/Indel Calls from PacBio HiFi, Oxford Nanopore, or MGI CycloneSEQ Reads

Long-read DNA sequencing resolves structural variants (SVs), repeats, and phasing that short-read WGS misses (De Coster et al., 2021; Sedlazeck et al., 2018). Pepkio delivers version-pinned FASTQ-to-VCF analysis with custom workflows scoped at kickoff for academic, biotech, and pharma clients. In simulated benchmarks, multiple SV callers exceeded F1 0.75 at ~20× coverage (Jiang et al., 2021). Scripts, figures, and a Methods draft included.

Key facts

Key facts about Long-Read DNA Sequencing
Fact	Value
Supported platforms / instruments	PacBio Revio / Sequel II (HiFi CCS); Oxford Nanopore PromethION / GridION / MinION (R10.4.x chemistry when scoped); MGI CycloneSEQ-WT02 / WY01 (nanopore-class, when scoped at kickoff). Ultra-long ONT reads and native-modification basecalling scoped at kickoff
Input requirements	≥15× mean genome coverage for SV discovery; ≥30× recommended for joint small-variant + SV clinical interpretation (De Coster et al., 2021; Höps et al., 2025). PacBio HiFi mean read length ≥10 kb, median per-read accuracy 99.9% at typical CCS pass counts (Wenger et al., 2019). Population SV joint calling: typically ≥3 samples with matched library prep
Reference builds supported	Human GRCh38 primary (default); T2T-CHM13v2.0 for centromeric, telomeric, and gap-region SV projects; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)	pbmm2 1.13.1; minimap2 2.28; Dorado 1.4.0 (ONT basecalling when scoped); NanoPlot 1.42.0; Chopper 0.9.0; samtools 1.21; bcftools 1.21; mosdepth 0.3.3; Clair3 1.0.9; DeepVariant 1.8.0; Sniffles2 2.8.0; pbsv 2.9.0; HiPhase 1.6.0; WhatsHap 2.3; hifiasm 0.23.0; Flye 2.9.5; modkit 0.4.0 (ONT methylation when scoped); Ensembl VEP 112; MultiQC 1.25
Typical turnaround time	3–5 weeks (single-sample reference-based SV + small-variant calling); 5–8 weeks (multi-sample population SV joint calling); 6–12 weeks (de novo assembly with variant annotation) — confirmed at kickoff
Deliverable formats	.bam, .vcf.gz, annotation .tsv, optional assembly .fasta/.gfa; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice reference	De Coster et al. (2021), Nature Reviews Genetics; Sedlazeck et al. (2024), Nature Biotechnology; Logsdon et al. (2020), Nature Reviews Genetics
Custom / bespoke analysis	Non-standard inputs scoped at kickoff—BAM-only SV re-calling, trio phasing, hybrid HiFi + ONT or DNBSEQ + CycloneSEQ assembly, methylation (5mC/6mA), bacterial/metagenome modules, or custom VCF filters

What is long-read DNA sequencing?

Long-read DNA sequencing computationally aligns or assembles individual molecules spanning kilobases to megabases, enabling direct detection of structural variants (SVs), repeat expansions, and haplotype-resolved small variants that short-read alignments fragment across repeats (Sedlazeck et al., 2018; Logsdon et al., 2020). Pepkio maps PacBio HiFi, Oxford Nanopore, or MGI CycloneSEQ reads (when scoped) with pbmm2 or minimap2, calls SNVs/indels with Clair3 using platform-specific models and SVs with Sniffles2, and optionally phases variants with HiPhase or WhatsHap (De Coster et al., 2021; Sedlazeck et al., 2024; Zheng et al., 2022). PacBio HiFi reads achieve median per-read accuracy of 99.9% at typical CCS pass counts (Wenger et al., 2019). Custom deliverables beyond the standard workflow are scoped at kickoff.

When should you use long-read DNA sequencing?

Long-read DNA sequencing fits when the biological question requires SV resolution, phasing, repeat-aware variant calling, or gap closure in complex regions. The table contrasts long-read WGS with short-read WGS and long-read de novo assembly.

Comparison of long-read WGS, short-read WGS, and long-read de novo assembly
Approach	Best for	Limitations	Approximate cost range
Long-read WGS (PacBio HiFi / ONT / CycloneSEQ)	SVs, repeat expansions, phasing, pseudogene regions, gap-adjacent variant discovery	Higher per-base sequencing cost than short-read; larger storage and compute footprint	Library prep + sequencing + bioinformatics vary by depth, platform, and phasing scope
Short-read WGS	Cohort-scale SNV/indel catalogs at lower per-sample cost	SV sensitivity drops for events >50 bp and in segmental duplications (Sedlazeck et al., 2018)	Lower sequencing cost; SV and phasing add-ons limited
Long-read de novo assembly	New references, T2T goals, structural haplotypes, non-model organisms	Requires higher coverage; assembly QC (BUSCO, QV) adds scope	Highest bioinformatics scope; compute-intensive

Gapless human reference: Nurk et al. (2022) assembled a complete 3.055 Gbp human genome (T2T-CHM13) using PacBio HiFi and ONT ultra-long reads, adding nearly 200 Mbp of sequence absent from GRCh38.
Population structural variation: Reis et al. (2023) profiled SVs across four Indigenous Australian communities with ONT long reads and T2T-CHM13, revealing SV landscapes invisible to short-read catalogs.
Rare-disease diagnostic variants: Höps et al. (2025) tested 100 HiFi genomes at ~30× against 145 challenging pathogenic variants; automated callers detected 83% (120/145), with 93% total after visual review.

How the analysis works — step by step

1. Validate inputs and sample metadata
Pepkio confirms FASTQ or BAM integrity (MD5 checksums), platform, chemistry, and experimental design, recording coverage, trio relationships, and reference build in sample_manifest.csv. Sub-threshold yield or missing read groups are flagged before processing (De Coster et al., 2021).
Tools and outputs
Tools used: md5sum; samtools quickcheck; custom validation scripts
Output: sample_manifest.csv with sample IDs, platform, read counts, batch, and QC flags
2. QC raw long reads
Read-length N50, Q-score distributions, and pass-filter rates are computed per sample; truncated libraries or low full-pass yield are flagged before alignment (Wenger et al., 2019; De Coster et al., 2021).
Tools and outputs
Tools used: NanoPlot 1.42.0; pycoQC 2.5.2 (ONT); PacBio dataset reports (HiFi)
Output: read_qc_summary.csv; read-length histograms; per-sample QC flags
3. Generate or verify HiFi/basecalled reads
When clients deliver raw signal (PacBio subreads, ONT POD5/FAST5, or CycloneSEQ), Pepkio runs CCS/HiFi generation, Dorado, or MGI-compatible basecalling with chemistry-appropriate models (Wenger et al., 2019; MGI Tech, 2024). Pre-delivered FASTQs are validated before filtering.
Tools and outputs
Tools used: PacBio SMRT Link 13.1 (pbccs / dataset); Dorado 1.4.0 (ONT raw signal when scoped); MGI CycloneSEQ basecaller (when scoped)
Output: Platform-normalized .fastq.gz per sample; basecall_stats.csv
4. Filter and normalize read sets
Low-quality and sub-length reads are removed with documented, project-specific thresholds (Chopper for ONT/CycloneSEQ; length filters for HiFi when needed). Read loss is reported before alignment.
Tools and outputs
Tools used: Chopper 0.9.0; Filtlong 0.2.1 (when scoped)
Output: Filtered .fastq.gz; filter_summary.csv with reads retained vs. removed
5. Align reads to the reference genome
HiFi reads align with pbmm2 (--preset CCS); ONT and CycloneSEQ reads align with minimap2 (-x map-ont or -x map-hifi as appropriate; Li, 2021). Mapping rate and soft-clip profiles are audited per sample.
Tools and outputs
Tools used: pbmm2 1.13.1; minimap2 2.28; samtools 1.21 sort/index
Output: Coordinate-sorted, indexed {sample}.bam and .bai; alignment_summary.csv
6. Assess coverage and mappability
mosdepth reports mean coverage and fraction of genome at ≥15× and ≥30×; samples below agreed thresholds are flagged before variant calling (Jiang et al., 2021; De Coster et al., 2021).
Tools and outputs
Tools used: mosdepth 0.3.3; samtools 1.21 stats
Output: coverage_summary.csv; mosdepth.global.dist.txt; coverage histogram and CDF plots
7. Call small variants (SNVs and indels)
Clair3 calls germline SNVs and indels with platform-specific models (hifi, ont; Zheng et al., 2022); DeepVariant 1.8.0 is optional for Google pipeline parity (Poplin et al., 2018). Ti/Tv ratios and variant counts are checked per species and build.
Tools and outputs
Tools used: Clair3 1.0.9; DeepVariant 1.8.0 (optional)
Output: {sample}.small_variants.vcf.gz and .tbi; per-sample SNV/indel count summary
8. Call structural variants
Sniffles2 detects DEL, DUP, INV, INS, and BND with --tandem-repeats annotations (Sedlazeck et al., 2024). pbsv 2.9.0 is used for PacBio-native workflows when scoped (Chen et al., 2024); cohort projects use Sniffles2 .snf merge.
Tools and outputs
Tools used: Sniffles2 2.8.0; pbsv 2.9.0 (PacBio-native, when scoped)
Output: {sample}.sv.vcf.gz and .tbi; {sample}.snf for population merge; sv_count_by_type.csv
9. Phase variants when metadata supports
When trio, Hi-C, or Strand-seq phase priors are available, Pepkio phases variants with HiPhase (PacBio) or WhatsHap (long-read BAMs and phase-input VCFs; De Coster et al., 2021). Phase block N50 is reported when phasing is in scope.
Tools and outputs
Tools used: HiPhase 1.6.0; WhatsHap 2.3
Output: Phased .vcf.gz with PS/HP tags; phasing_summary.csv; phase-block length distribution plot
10. Annotate, visualize, and package deliverables
Ensembl VEP annotates small-variant consequences; SVs are overlapped with gene models and repeat annotations (McLaren et al., 2016; Yang et al., 2023). MultiQC aggregates QC metrics; scripts, README, Methods draft, and HTML QC report are packaged per agreed retention policy.
Tools and outputs
Tools used: Ensembl VEP 112; bcftools 1.21; MultiQC 1.25; R 4.4.x / Python 3.12 plotting scripts
Output: variant_annotation_master.tsv; sv_annotation_master.tsv; MultiQC report; final deliverable bundle

What Pepkio delivers

Processed data files

Coordinate-sorted .bam/.bai; small-variant and SV .vcf.gz (indexed)
variant_annotation_master.tsv; sv_annotation_master.tsv
QC tables (coverage_summary.csv, read_qc_summary.csv, alignment_summary.csv, sample_qc_summary.csv)
Optional phased VCF; optional assembly .fasta/.gfa when de novo scope is included

Figures (PDF/SVG)

Read-length histograms; mapping-rate and coverage plots
SV type and small-variant consequence bar charts
Phasing block length distribution when in scope
Locus plots for prioritized SVs when scoped

Tables

Annotated variant masters with gene consequence, impact, and clinical fields when configured
sample_qc_summary.csv with mapping rate, mean depth, pct genome ≥15×/≥30×, and variant counts

Code

Commented bash, R, and Python scripts per stage
Environment lock files; delivery via private Git repository or agreed file transfer

Documentation

HTML/PDF QC report; README; Methods draft
Post-delivery reviewer support for method clarification and minor revisions within agreed scope

Technical decisions we make — and why

Reference: GRCh38 default; T2T-CHM13 when complex regions are the primary target: GRCh38 supports standard clinical annotation and cross-cohort comparison (De Coster et al., 2021). T2T-CHM13 improves SV and alignment confidence in centromeres, telomeres, and acrocentric short arms where GRCh38 contains gaps (Yang et al., 2023; Nurk et al., 2022).
SV caller: Sniffles2 default; pbsv when scoped for PacBio: Sniffles2 is 11.8× faster and 29% more accurate than prior long-read SV callers across 5–50× HiFi and ONT data (Sedlazeck et al., 2024). pbsv integrates with PacBio SMRT Link when clients require PacBio-native SV signatures (Liu et al., 2024).
Small-variant caller: Clair3 default; DeepVariant optional: Clair3 achieves competitive long-read SNV/indel accuracy with lower compute than graph-based short-read callers (Zheng et al., 2022). DeepVariant 1.8.0 is optional for Google benchmark parity (Poplin et al., 2018).
Aligner: pbmm2 for HiFi; minimap2 for ONT and CycloneSEQ: pbmm2 wraps minimap2 with PacBio-native presets (Pacific Biosciences, 2024). minimap2 2.28 with -x map-ont is standard for nanopore-class WGS, including ONT and CycloneSEQ FASTQs (Li, 2021; De Coster et al., 2021). CycloneSEQ tooling is newer than ONT R10.4.x; Pepkio confirms basecalling models and QC gates at kickoff (MGI Tech, 2024).
Coverage gate: ≥15× for SV discovery; ≥30× for joint small-variant + SV clinical interpretation: In simulated benchmarks, multiple SV callers exceeded F1 0.75 at ~20× (Jiang et al., 2021). Höps et al. (2025) validated HiFi diagnostic panels at ~30×; 90% of automatically called variants remained detectable at 15× in titration analysis.

Common questions

What is the minimum coverage, read length, and sample count for long-read DNA analysis?

Pepkio recommends ≥15× mean genome coverage for SV discovery and ≥30× when joint small-variant and SV clinical interpretation is required (De Coster et al., 2021; Höps et al., 2025). PacBio HiFi libraries should yield mean read lengths ≥10 kb (Wenger et al., 2019). Population SV joint calling typically needs ≥3 samples with harmonized library prep. Exact thresholds are confirmed at kickoff.

Can you analyze low-yield or degraded DNA samples?

Yes, with limitations documented in the QC report. Low DNA input or fragmentation reduces read length N50 and SV recall in repetitive regions (De Coster et al., 2021). Pepkio flags sub-threshold yield before calling; partial analysis or targeted re-sequencing is discussed when coverage cannot support the planned variant classes.

Do you support PacBio Revio HiFi, Oxford Nanopore, or MGI CycloneSEQ data?

Yes, when platform matches project scope. PacBio HiFi uses pbmm2; ONT (R10.4.x when scoped) and CycloneSEQ-WT02/WY01 FASTQs use minimap2 and Sniffles2 (Li, 2021; Sedlazeck et al., 2024; MGI Tech, 2024). Raw-signal basecalling is scoped separately.

How long does long-read DNA analysis take at Pepkio?

Single-sample SV + small-variant projects typically take 3–5 weeks; population SV joint calling 5–8 weeks; de novo assembly with annotation 6–12 weeks. Timelines are confirmed at kickoff.

How do you handle batch effects in multi-sample long-read cohorts?

Harmonized alignment and caller parameters reduce run-to-run differences (De Coster et al., 2021). Pepkio stratifies instrument, flowcell, and library prep batch in QC reports; Sniffles2 .snf merge improves genotype consistency across samples (Sedlazeck et al., 2024). Batch-specific re-calling is scoped at kickoff.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership. Pepkio delivers commented bash, R, and Python scripts with environment lock files; Jupyter or R Markdown on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after read QC, alignment, coverage assessment, and before final delivery. A PhD-level scientific contact leads the project.

What does post-delivery reviewer support include?

Method clarification and minor figure or table revisions within agreed scope. Substantial new reviewer requests are scoped separately.

Is co-authorship required?

No, unless explicitly discussed. Acknowledgment of bioinformatics support is standard practice.

Should I use GRCh38 or T2T-CHM13 as my reference genome?

GRCh38 is the default for clinical annotation and cross-study comparison (De Coster et al., 2021). T2T-CHM13 is recommended for centromeres, telomeres, acrocentric short arms, or GRCh38 gap regions (Yang et al., 2023; Nurk et al., 2022). Build choice is documented in the Methods draft.

Can you detect DNA methylation (5mC or 6mA) from ONT or PacBio reads?

Yes, when scoped at kickoff. ONT methylation from Dorado or modkit 0.4.0 and PacBio 5mC from kinetic signals are separate milestones from standard variant calling (De Coster et al., 2021).

Should I choose read-based SV calling or de novo assembly for my project?

Read-based SV calling fits when a reference exists at ≥15× (Jiang et al., 2021). De novo assembly with hifiasm or Flye fits new references or T2T goals and requires higher coverage plus assembly QC (Logsdon et al., 2020; Nurk et al., 2022). Pepkio advises at kickoff.

Can you handle custom or non-standard long-read DNA analyses?

Yes. Pepkio scopes bespoke work at kickoff—BAM-only SV re-calling, trio phasing, hybrid HiFi + ONT assembly, CycloneSEQ or DNBSEQ + CycloneSEQ hybrids, bacterial/metagenome modules, or custom VCF filters (De Coster et al., 2021). Milestones and timelines are confirmed before work begins.

Related services

Whole-genome sequencing — Lower-cost short-read SNV/indel catalogs when long-read SV resolution is not required.
CNV and structural variation — Short-read CNV/SV calling when long-read data are unavailable.
Variant calling — Caller selection, filter tuning, and joint genotyping when alignment is already complete.
Long-read RNA-seq — Isoform-resolved transcriptome analysis that complements DNA SV and phasing context.
Custom consulting — Pre-sequencing depth, platform, and reference-build planning before library prep.

References

De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards population-scale long-read sequencing. Nature Reviews Genetics. 2021;22(9):572–587. https://doi.org/10.1038/s41576-021-00367-3 (PMID: 34050336)
Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nature Reviews Genetics. 2020;21(10):597–614. https://doi.org/10.1038/s41576-020-0236-x (PMID: 32504078)
Sedlazeck FJ, Rescheneder P, Smolka M, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods. 2018;15(6):461–468. https://doi.org/10.1038/s41592-018-0001-7 (PMID: 29713083)
Sedlazeck FJ, Sabat S, Pasch J, et al. Detection of mosaic and population-level structural variants with Sniffles2. Nature Biotechnology. 2024;42(10):1483–1495. https://doi.org/10.1038/s41587-023-02024-y (PMID: 38168980)
Wenger AM, Peluso P, Rowell WJ, et al. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology. 2019;37(10):1155–1162. https://doi.org/10.1038/s41587-019-0217-9 (PMID: 31406327)
Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53. https://doi.org/10.1126/science.abj6987 (PMID: 35357919)
Reis ALM, Rapadas M, Hammond JM, et al. The landscape of genomic structural variation in Indigenous Australians. Nature. 2023;624(8029):602–610. https://doi.org/10.1038/s41586-023-06842-7 (PMID: 38093003)
Höps WGA, Weiss MM, Derks RC, et al. HiFi long-read genomes for difficult-to-detect, clinically relevant variants. American Journal of Human Genetics. 2025;112(2):450–456. https://doi.org/10.1016/j.ajhg.2024.12.013 (PMID: 39809270)
Jiang T, Liu S, Cao S, et al. Long-read sequencing settings for efficient structural variation detection based on comprehensive evaluation. BMC Bioinformatics. 2021;22:552. https://doi.org/10.1186/s12859-021-04422-y (PMID: 34772337)
Yang X, Wang X, Zou Y, et al. Characterization of large-scale genomic differences in the first complete human genome. Genome Biology. 2023;24(1):157. https://doi.org/10.1186/s13059-023-02995-w (PMID: 37403156)
Liu Z, Xie Z, Li M. Comprehensive and deep evaluation of structural variation detection pipelines with third-generation sequencing data. Genome Biology. 2024;25:173. https://doi.org/10.1186/s13059-024-03324-5 (PMID: 39010145)
Zheng Z, Li S, Su J, et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nature Computational Science. 2022;2(12):797–803. https://doi.org/10.1038/s43588-022-00387-x (PMID: 38177392)
Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021;37(23):4572–4574. https://doi.org/10.1093/bioinformatics/btab705 (PMID: 34623391)
Pacific Biosciences. pbmm2 documentation and release notes. 2024. https://github.com/PacificBiosciences/pbmm2
Oxford Nanopore Technologies. Dorado basecaller. https://github.com/nanoporetech/dorado
MGI Tech. CycloneSEQ technology. 2024. https://global-mgitech.com/technologies/cycloneseq-technology/

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Long-Read DNA Sequencing Analysis Service — Phased Structural Variants and SNV/Indel Calls from PacBio HiFi, Oxford Nanopore, or MGI CycloneSEQ Reads

Key facts

What is long-read DNA sequencing?

When should you use long-read DNA sequencing?

How the analysis works — step by step

1. Validate inputs and sample metadata

2. QC raw long reads

3. Generate or verify HiFi/basecalled reads

4. Filter and normalize read sets

5. Align reads to the reference genome

6. Assess coverage and mappability

7. Call small variants (SNVs and indels)

8. Call structural variants

9. Phase variants when metadata supports

10. Annotate, visualize, and package deliverables