Genomics & Variant Analysis

Long-Read DNA Sequencing Analysis Service — Phased Structural Variants and SNV/Indel Calls from PacBio HiFi, Oxford Nanopore, or MGI CycloneSEQ Reads

Long-read DNA sequencing resolves structural variants (SVs), repeats, and phasing that short-read WGS misses (De Coster et al., 2021; Sedlazeck et al., 2018). Pepkio delivers version-pinned FASTQ-to-VCF analysis with custom workflows scoped at kickoff for academic, biotech, and pharma clients. In simulated benchmarks, multiple SV callers exceeded F1 0.75 at ~20× coverage (Jiang et al., 2021). Scripts, figures, and a Methods draft included.

Key facts

Key facts about Long-Read DNA Sequencing
FactValue
Supported platforms / instrumentsPacBio Revio / Sequel II (HiFi CCS); Oxford Nanopore PromethION / GridION / MinION (R10.4.x chemistry when scoped); MGI CycloneSEQ-WT02 / WY01 (nanopore-class, when scoped at kickoff). Ultra-long ONT reads and native-modification basecalling scoped at kickoff
Input requirements≥15× mean genome coverage for SV discovery; ≥30× recommended for joint small-variant + SV clinical interpretation (De Coster et al., 2021; Höps et al., 2025). PacBio HiFi mean read length ≥10 kb, median per-read accuracy 99.9% at typical CCS pass counts (Wenger et al., 2019). Population SV joint calling: typically ≥3 samples with matched library prep
Reference builds supportedHuman GRCh38 primary (default); T2T-CHM13v2.0 for centromeric, telomeric, and gap-region SV projects; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)pbmm2 1.13.1; minimap2 2.28; Dorado 1.4.0 (ONT basecalling when scoped); NanoPlot 1.42.0; Chopper 0.9.0; samtools 1.21; bcftools 1.21; mosdepth 0.3.3; Clair3 1.0.9; DeepVariant 1.8.0; Sniffles2 2.8.0; pbsv 2.9.0; HiPhase 1.6.0; WhatsHap 2.3; hifiasm 0.23.0; Flye 2.9.5; modkit 0.4.0 (ONT methylation when scoped); Ensembl VEP 112; MultiQC 1.25
Typical turnaround time3–5 weeks (single-sample reference-based SV + small-variant calling); 5–8 weeks (multi-sample population SV joint calling); 6–12 weeks (de novo assembly with variant annotation) — confirmed at kickoff
Deliverable formats.bam, .vcf.gz, annotation .tsv, optional assembly .fasta/.gfa; PDF/SVG figures; HTML QC report; documented bash/R/Python scripts; Methods draft
Key cited best-practice referenceDe Coster et al. (2021), Nature Reviews Genetics; Sedlazeck et al. (2024), Nature Biotechnology; Logsdon et al. (2020), Nature Reviews Genetics
Custom / bespoke analysisNon-standard inputs scoped at kickoff—BAM-only SV re-calling, trio phasing, hybrid HiFi + ONT or DNBSEQ + CycloneSEQ assembly, methylation (5mC/6mA), bacterial/metagenome modules, or custom VCF filters

What is long-read DNA sequencing?

Long-read DNA sequencing computationally aligns or assembles individual molecules spanning kilobases to megabases, enabling direct detection of structural variants (SVs), repeat expansions, and haplotype-resolved small variants that short-read alignments fragment across repeats (Sedlazeck et al., 2018; Logsdon et al., 2020). Pepkio maps PacBio HiFi, Oxford Nanopore, or MGI CycloneSEQ reads (when scoped) with pbmm2 or minimap2, calls SNVs/indels with Clair3 using platform-specific models and SVs with Sniffles2, and optionally phases variants with HiPhase or WhatsHap (De Coster et al., 2021; Sedlazeck et al., 2024; Zheng et al., 2022). PacBio HiFi reads achieve median per-read accuracy of 99.9% at typical CCS pass counts (Wenger et al., 2019). Custom deliverables beyond the standard workflow are scoped at kickoff.

When should you use long-read DNA sequencing?

Long-read DNA sequencing fits when the biological question requires SV resolution, phasing, repeat-aware variant calling, or gap closure in complex regions. The table contrasts long-read WGS with short-read WGS and long-read de novo assembly.

Comparison of long-read WGS, short-read WGS, and long-read de novo assembly
ApproachBest forLimitationsApproximate cost range
Long-read WGS (PacBio HiFi / ONT / CycloneSEQ)SVs, repeat expansions, phasing, pseudogene regions, gap-adjacent variant discoveryHigher per-base sequencing cost than short-read; larger storage and compute footprintLibrary prep + sequencing + bioinformatics vary by depth, platform, and phasing scope
Short-read WGSCohort-scale SNV/indel catalogs at lower per-sample costSV sensitivity drops for events >50 bp and in segmental duplications (Sedlazeck et al., 2018)Lower sequencing cost; SV and phasing add-ons limited
Long-read de novo assemblyNew references, T2T goals, structural haplotypes, non-model organismsRequires higher coverage; assembly QC (BUSCO, QV) adds scopeHighest bioinformatics scope; compute-intensive
  • Gapless human reference: Nurk et al. (2022) assembled a complete 3.055 Gbp human genome (T2T-CHM13) using PacBio HiFi and ONT ultra-long reads, adding nearly 200 Mbp of sequence absent from GRCh38.
  • Population structural variation: Reis et al. (2023) profiled SVs across four Indigenous Australian communities with ONT long reads and T2T-CHM13, revealing SV landscapes invisible to short-read catalogs.
  • Rare-disease diagnostic variants: Höps et al. (2025) tested 100 HiFi genomes at ~30× against 145 challenging pathogenic variants; automated callers detected 83% (120/145), with 93% total after visual review.

How the analysis works — step by step

  1. 1. Validate inputs and sample metadata

    Pepkio confirms FASTQ or BAM integrity (MD5 checksums), platform, chemistry, and experimental design, recording coverage, trio relationships, and reference build in sample_manifest.csv. Sub-threshold yield or missing read groups are flagged before processing (De Coster et al., 2021).

    Tools and outputs

    Tools used: md5sum; samtools quickcheck; custom validation scripts

    Output: sample_manifest.csv with sample IDs, platform, read counts, batch, and QC flags

  2. 2. QC raw long reads

    Read-length N50, Q-score distributions, and pass-filter rates are computed per sample; truncated libraries or low full-pass yield are flagged before alignment (Wenger et al., 2019; De Coster et al., 2021).

    Tools and outputs

    Tools used: NanoPlot 1.42.0; pycoQC 2.5.2 (ONT); PacBio dataset reports (HiFi)

    Output: read_qc_summary.csv; read-length histograms; per-sample QC flags

  3. 3. Generate or verify HiFi/basecalled reads

    When clients deliver raw signal (PacBio subreads, ONT POD5/FAST5, or CycloneSEQ), Pepkio runs CCS/HiFi generation, Dorado, or MGI-compatible basecalling with chemistry-appropriate models (Wenger et al., 2019; MGI Tech, 2024). Pre-delivered FASTQs are validated before filtering.

    Tools and outputs

    Tools used: PacBio SMRT Link 13.1 (pbccs / dataset); Dorado 1.4.0 (ONT raw signal when scoped); MGI CycloneSEQ basecaller (when scoped)

    Output: Platform-normalized .fastq.gz per sample; basecall_stats.csv

  4. 4. Filter and normalize read sets

    Low-quality and sub-length reads are removed with documented, project-specific thresholds (Chopper for ONT/CycloneSEQ; length filters for HiFi when needed). Read loss is reported before alignment.

    Tools and outputs

    Tools used: Chopper 0.9.0; Filtlong 0.2.1 (when scoped)

    Output: Filtered .fastq.gz; filter_summary.csv with reads retained vs. removed

  5. 5. Align reads to the reference genome

    HiFi reads align with pbmm2 (--preset CCS); ONT and CycloneSEQ reads align with minimap2 (-x map-ont or -x map-hifi as appropriate; Li, 2021). Mapping rate and soft-clip profiles are audited per sample.

    Tools and outputs

    Tools used: pbmm2 1.13.1; minimap2 2.28; samtools 1.21 sort/index

    Output: Coordinate-sorted, indexed {sample}.bam and .bai; alignment_summary.csv

  6. 6. Assess coverage and mappability

    mosdepth reports mean coverage and fraction of genome at ≥15× and ≥30×; samples below agreed thresholds are flagged before variant calling (Jiang et al., 2021; De Coster et al., 2021).

    Tools and outputs

    Tools used: mosdepth 0.3.3; samtools 1.21 stats

    Output: coverage_summary.csv; mosdepth.global.dist.txt; coverage histogram and CDF plots

  7. 7. Call small variants (SNVs and indels)

    Clair3 calls germline SNVs and indels with platform-specific models (hifi, ont; Zheng et al., 2022); DeepVariant 1.8.0 is optional for Google pipeline parity (Poplin et al., 2018). Ti/Tv ratios and variant counts are checked per species and build.

    Tools and outputs

    Tools used: Clair3 1.0.9; DeepVariant 1.8.0 (optional)

    Output: {sample}.small_variants.vcf.gz and .tbi; per-sample SNV/indel count summary

  8. 8. Call structural variants

    Sniffles2 detects DEL, DUP, INV, INS, and BND with --tandem-repeats annotations (Sedlazeck et al., 2024). pbsv 2.9.0 is used for PacBio-native workflows when scoped (Chen et al., 2024); cohort projects use Sniffles2 .snf merge.

    Tools and outputs

    Tools used: Sniffles2 2.8.0; pbsv 2.9.0 (PacBio-native, when scoped)

    Output: {sample}.sv.vcf.gz and .tbi; {sample}.snf for population merge; sv_count_by_type.csv

  9. 9. Phase variants when metadata supports

    When trio, Hi-C, or Strand-seq phase priors are available, Pepkio phases variants with HiPhase (PacBio) or WhatsHap (long-read BAMs and phase-input VCFs; De Coster et al., 2021). Phase block N50 is reported when phasing is in scope.

    Tools and outputs

    Tools used: HiPhase 1.6.0; WhatsHap 2.3

    Output: Phased .vcf.gz with PS/HP tags; phasing_summary.csv; phase-block length distribution plot

  10. 10. Annotate, visualize, and package deliverables

    Ensembl VEP annotates small-variant consequences; SVs are overlapped with gene models and repeat annotations (McLaren et al., 2016; Yang et al., 2023). MultiQC aggregates QC metrics; scripts, README, Methods draft, and HTML QC report are packaged per agreed retention policy.

    Tools and outputs

    Tools used: Ensembl VEP 112; bcftools 1.21; MultiQC 1.25; R 4.4.x / Python 3.12 plotting scripts

    Output: variant_annotation_master.tsv; sv_annotation_master.tsv; MultiQC report; final deliverable bundle

What Pepkio delivers

Processed data files

  • Coordinate-sorted .bam/.bai; small-variant and SV .vcf.gz (indexed)
  • variant_annotation_master.tsv; sv_annotation_master.tsv
  • QC tables (coverage_summary.csv, read_qc_summary.csv, alignment_summary.csv, sample_qc_summary.csv)
  • Optional phased VCF; optional assembly .fasta/.gfa when de novo scope is included

Figures (PDF/SVG)

  • Read-length histograms; mapping-rate and coverage plots
  • SV type and small-variant consequence bar charts
  • Phasing block length distribution when in scope
  • Locus plots for prioritized SVs when scoped

Tables

  • Annotated variant masters with gene consequence, impact, and clinical fields when configured
  • sample_qc_summary.csv with mapping rate, mean depth, pct genome ≥15×/≥30×, and variant counts

Code

  • Commented bash, R, and Python scripts per stage
  • Environment lock files; delivery via private Git repository or agreed file transfer

Documentation

  • HTML/PDF QC report; README; Methods draft
  • Post-delivery reviewer support for method clarification and minor revisions within agreed scope

Technical decisions we make — and why

Reference: GRCh38 default; T2T-CHM13 when complex regions are the primary target
GRCh38 supports standard clinical annotation and cross-cohort comparison (De Coster et al., 2021). T2T-CHM13 improves SV and alignment confidence in centromeres, telomeres, and acrocentric short arms where GRCh38 contains gaps (Yang et al., 2023; Nurk et al., 2022).
SV caller: Sniffles2 default; pbsv when scoped for PacBio
Sniffles2 is 11.8× faster and 29% more accurate than prior long-read SV callers across 5–50× HiFi and ONT data (Sedlazeck et al., 2024). pbsv integrates with PacBio SMRT Link when clients require PacBio-native SV signatures (Liu et al., 2024).
Small-variant caller: Clair3 default; DeepVariant optional
Clair3 achieves competitive long-read SNV/indel accuracy with lower compute than graph-based short-read callers (Zheng et al., 2022). DeepVariant 1.8.0 is optional for Google benchmark parity (Poplin et al., 2018).
Aligner: pbmm2 for HiFi; minimap2 for ONT and CycloneSEQ
pbmm2 wraps minimap2 with PacBio-native presets (Pacific Biosciences, 2024). minimap2 2.28 with -x map-ont is standard for nanopore-class WGS, including ONT and CycloneSEQ FASTQs (Li, 2021; De Coster et al., 2021). CycloneSEQ tooling is newer than ONT R10.4.x; Pepkio confirms basecalling models and QC gates at kickoff (MGI Tech, 2024).
Coverage gate: ≥15× for SV discovery; ≥30× for joint small-variant + SV clinical interpretation
In simulated benchmarks, multiple SV callers exceeded F1 0.75 at ~20× (Jiang et al., 2021). Höps et al. (2025) validated HiFi diagnostic panels at ~30×; 90% of automatically called variants remained detectable at 15× in titration analysis.

Common questions

What is the minimum coverage, read length, and sample count for long-read DNA analysis?

Pepkio recommends ≥15× mean genome coverage for SV discovery and ≥30× when joint small-variant and SV clinical interpretation is required (De Coster et al., 2021; Höps et al., 2025). PacBio HiFi libraries should yield mean read lengths ≥10 kb (Wenger et al., 2019). Population SV joint calling typically needs ≥3 samples with harmonized library prep. Exact thresholds are confirmed at kickoff.

Can you analyze low-yield or degraded DNA samples?

Yes, with limitations documented in the QC report. Low DNA input or fragmentation reduces read length N50 and SV recall in repetitive regions (De Coster et al., 2021). Pepkio flags sub-threshold yield before calling; partial analysis or targeted re-sequencing is discussed when coverage cannot support the planned variant classes.

Do you support PacBio Revio HiFi, Oxford Nanopore, or MGI CycloneSEQ data?

Yes, when platform matches project scope. PacBio HiFi uses pbmm2; ONT (R10.4.x when scoped) and CycloneSEQ-WT02/WY01 FASTQs use minimap2 and Sniffles2 (Li, 2021; Sedlazeck et al., 2024; MGI Tech, 2024). Raw-signal basecalling is scoped separately.

How long does long-read DNA analysis take at Pepkio?

Single-sample SV + small-variant projects typically take 3–5 weeks; population SV joint calling 5–8 weeks; de novo assembly with annotation 6–12 weeks. Timelines are confirmed at kickoff.

How do you handle batch effects in multi-sample long-read cohorts?

Harmonized alignment and caller parameters reduce run-to-run differences (De Coster et al., 2021). Pepkio stratifies instrument, flowcell, and library prep batch in QC reports; Sniffles2 .snf merge improves genotype consistency across samples (Sedlazeck et al., 2024). Batch-specific re-calling is scoped at kickoff.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership. Pepkio delivers commented bash, R, and Python scripts with environment lock files; Jupyter or R Markdown on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after read QC, alignment, coverage assessment, and before final delivery. A PhD-level scientific contact leads the project.

What does post-delivery reviewer support include?

Method clarification and minor figure or table revisions within agreed scope. Substantial new reviewer requests are scoped separately.

Is co-authorship required?

No, unless explicitly discussed. Acknowledgment of bioinformatics support is standard practice.

Should I use GRCh38 or T2T-CHM13 as my reference genome?

GRCh38 is the default for clinical annotation and cross-study comparison (De Coster et al., 2021). T2T-CHM13 is recommended for centromeres, telomeres, acrocentric short arms, or GRCh38 gap regions (Yang et al., 2023; Nurk et al., 2022). Build choice is documented in the Methods draft.

Can you detect DNA methylation (5mC or 6mA) from ONT or PacBio reads?

Yes, when scoped at kickoff. ONT methylation from Dorado or modkit 0.4.0 and PacBio 5mC from kinetic signals are separate milestones from standard variant calling (De Coster et al., 2021).

Should I choose read-based SV calling or de novo assembly for my project?

Read-based SV calling fits when a reference exists at ≥15× (Jiang et al., 2021). De novo assembly with hifiasm or Flye fits new references or T2T goals and requires higher coverage plus assembly QC (Logsdon et al., 2020; Nurk et al., 2022). Pepkio advises at kickoff.

Can you handle custom or non-standard long-read DNA analyses?

Yes. Pepkio scopes bespoke work at kickoff—BAM-only SV re-calling, trio phasing, hybrid HiFi + ONT assembly, CycloneSEQ or DNBSEQ + CycloneSEQ hybrids, bacterial/metagenome modules, or custom VCF filters (De Coster et al., 2021). Milestones and timelines are confirmed before work begins.

Related services

  • Whole-genome sequencingLower-cost short-read SNV/indel catalogs when long-read SV resolution is not required.
  • CNV and structural variationShort-read CNV/SV calling when long-read data are unavailable.
  • Variant callingCaller selection, filter tuning, and joint genotyping when alignment is already complete.
  • Long-read RNA-seqIsoform-resolved transcriptome analysis that complements DNA SV and phasing context.
  • Custom consultingPre-sequencing depth, platform, and reference-build planning before library prep.
References
  1. De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards population-scale long-read sequencing. Nature Reviews Genetics. 2021;22(9):572–587. https://doi.org/10.1038/s41576-021-00367-3 (PMID: 34050336)
  2. Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nature Reviews Genetics. 2020;21(10):597–614. https://doi.org/10.1038/s41576-020-0236-x (PMID: 32504078)
  3. Sedlazeck FJ, Rescheneder P, Smolka M, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods. 2018;15(6):461–468. https://doi.org/10.1038/s41592-018-0001-7 (PMID: 29713083)
  4. Sedlazeck FJ, Sabat S, Pasch J, et al. Detection of mosaic and population-level structural variants with Sniffles2. Nature Biotechnology. 2024;42(10):1483–1495. https://doi.org/10.1038/s41587-023-02024-y (PMID: 38168980)
  5. Wenger AM, Peluso P, Rowell WJ, et al. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology. 2019;37(10):1155–1162. https://doi.org/10.1038/s41587-019-0217-9 (PMID: 31406327)
  6. Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53. https://doi.org/10.1126/science.abj6987 (PMID: 35357919)
  7. Reis ALM, Rapadas M, Hammond JM, et al. The landscape of genomic structural variation in Indigenous Australians. Nature. 2023;624(8029):602–610. https://doi.org/10.1038/s41586-023-06842-7 (PMID: 38093003)
  8. Höps WGA, Weiss MM, Derks RC, et al. HiFi long-read genomes for difficult-to-detect, clinically relevant variants. American Journal of Human Genetics. 2025;112(2):450–456. https://doi.org/10.1016/j.ajhg.2024.12.013 (PMID: 39809270)
  9. Jiang T, Liu S, Cao S, et al. Long-read sequencing settings for efficient structural variation detection based on comprehensive evaluation. BMC Bioinformatics. 2021;22:552. https://doi.org/10.1186/s12859-021-04422-y (PMID: 34772337)
  10. Yang X, Wang X, Zou Y, et al. Characterization of large-scale genomic differences in the first complete human genome. Genome Biology. 2023;24(1):157. https://doi.org/10.1186/s13059-023-02995-w (PMID: 37403156)
  11. Liu Z, Xie Z, Li M. Comprehensive and deep evaluation of structural variation detection pipelines with third-generation sequencing data. Genome Biology. 2024;25:173. https://doi.org/10.1186/s13059-024-03324-5 (PMID: 39010145)
  12. Zheng Z, Li S, Su J, et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nature Computational Science. 2022;2(12):797–803. https://doi.org/10.1038/s43588-022-00387-x (PMID: 38177392)
  13. Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
  14. McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
  15. Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021;37(23):4572–4574. https://doi.org/10.1093/bioinformatics/btab705 (PMID: 34623391)
  16. Pacific Biosciences. pbmm2 documentation and release notes. 2024. https://github.com/PacificBiosciences/pbmm2
  17. Oxford Nanopore Technologies. Dorado basecaller. https://github.com/nanoporetech/dorado
  18. MGI Tech. CycloneSEQ technology. 2024. https://global-mgitech.com/technologies/cycloneseq-technology/

Let's Talk About Your Science

Tell us:

  • • Your biological question
  • • Data type and size
  • • Timeline constraints

We'll tell you:

  • • What's feasible
  • • How long it will take
  • • Exactly what it will cost
Contact Us

Contact us to start with a free consultation. Need everyday bench calculators? Try our free lab tools.