Bioinformatics analysis service

Genomics Analysis Services — Version-Pinned SNV, Indel, CNV, and SV Analysis from Short- and Long-Read DNA

Genomics analysis turns DNA sequencing reads into annotated SNV and indel calls following GATK Best Practices (Van der Auwera et al., 2013), with optional CNV, SV, and long-read modules scoped at kickoff. Pepkio provides a genomics analysis service for academic, biotech, and pharma teams: version-pinned pipelines, documented scripts, publication-grade figures, and a Methods draft, with custom inputs, outputs, and non-standard workflows agreed at kickoff.

Key facts

Key facts about genomics & variant analysis analysis
FactValue
Data types supportedIllumina short-read paired-end FASTQ or BAM; Element Biosciences AVITI and Ultima UG100 when scoped at kickoff; PacBio and Oxford Nanopore long-read FASTQ for the long-read spoke; pre-aligned BAM or gVCF for variant-calling-only projects
Reference builds or standards usedHuman GRCh38 primary with GATK resource bundle 4.6 (Broad Institute, 2024); GRCh38 alternate loci per functional-equivalence standards (Regier et al., 2018); legacy GRCh37/hg19; mouse GRCm39; custom references scoped at kickoff
Primary tools (with versions)BWA-MEM2 2.2.1; GATK 4.6.0.0; Picard 3.2.0; samtools/bcftools 1.21; mosdepth 0.3.3; Ensembl VEP 112; Manta 1.6.0; DELLY 1.2.6; minimap2 2.28; Clair3 1.0.9; fastp 0.23.4; FastQC 0.12.1; MultiQC 1.25; DeepVariant 1.8.0 optional
Typical turnaround range2–4 weeks (single-sample germline); 4–8 weeks (multi-sample cohort with joint genotyping and VQSR) — confirmed at kickoff
Deliverable formats.bam, .g.vcf.gz, filtered .vcf.gz, annotation .tsv, CNV/SV call tables; PDF/SVG figures; HTML QC report; commented bash/R/Python scripts; Methods draft
Regulatory/reproducibility standards followedGATK Best Practices (Van der Auwera et al., 2013); functional-equivalence alignment standards (Regier et al., 2018); MD5 checksum validation; version-pinned software with documented parameters; clinical validation guidance where applicable (Rehm et al., 2021; Koboldt, 2020)

What is genomics?

Genomics is the study of an organism's complete DNA sequence and the computational analysis required to detect variation against a reference genome. A single-nucleotide variant (SNV) is a one-base change; an indel is a short insertion or deletion; a copy-number variant (CNV) is a gain or loss of genomic segments; and a structural variant (SV) is a larger rearrangement such as an inversion or translocation. The core biological question is: which DNA sequence changes explain disease risk, drug response, or evolutionary divergence? The NIH All of Us Research Program released 245,350+ whole-genome sequences to researchers by 2023, with planned releases exceeding 400,000 WGS (NIH All of Us, 2025).

What genomics analysis can answer

Genomics analysis identifies inherited and somatic DNA changes that explain disease mechanism, treatment response, and population genetic architecture—from rare undiagnosed disorders to cancer driver catalogs and cohort-scale association studies (Rehm et al., 2021; Kinnersley et al., 2024). The examples below pair specific research questions with published results.

  • Which pathogenic variants explain an undiagnosed rare disease when coding panels and exome sequencing are inconclusive? Clinical WGS validation guidelines emphasize detecting noncoding variants and structural changes that WES may miss (Rehm et al., 2021).
  • Which somatic driver mutations are actionable across a large cancer cohort? Kinnersley et al. (2024) analyzed 10,478 cancer WGS profiles across 35 tumor types, identifying 330 candidate driver genes; approximately 55% of patients harbor at least one clinically relevant mutation.
  • Does germline TP53 loss of heterozygosity precede clinical cancer diagnosis in predisposed individuals? Light et al. (2023) reported near-ubiquitous early TP53 LOH with gain of the mutant allele years before tumor diagnosis in Li-Fraumeni syndrome WGS data.
  • Which CNVs and SVs underlie developmental disorders when SNV/indel panels are negative? Gabrielaite et al. (2021) benchmarked 11 CNV callers on matched WES and WGS data and suggested combining GATK gCNV, Lumpy, DELLY, and cn.MOPS because no single caller captures all CNV classes reliably.
  • How can population cohorts share harmonized variant calls for association studies? Regier et al. (2018) defined functional-equivalence standards—BWA-MEM alignment, GRCh38 with alternate loci, CRAM compression—that reduce cross-center call discordance.

Services included in this category

Genomics & Variant Analysis services offered by Pepkio
ServiceDescriptionPrimary tools
Whole-genome sequencingGenome-wide germline SNV and indel discovery from raw FASTQs with callable-region QCBWA-MEM2, GATK HaplotypeCaller, Ensembl VEP
Whole-exome sequencingTargeted coding-region variant calling from capture-enriched libraries at lower sequencing cost than WGSBWA-MEM2, GATK, VEP; capture-aware depth QC
Variant callingSNV/indel calling, joint genotyping, and filter tuning when alignment is already completeGATK HaplotypeCaller, DeepVariant, bcftools
CNV and structural variationGenome-wide copy-number and structural variant detection from WGS or WES alignmentsGATK gCNV, Manta, DELLY; CNVkit when scoped
Long-read DNA sequencingPhased variants and structural events from PacBio or Oxford Nanopore readsminimap2, Clair3; long-read SV callers when scoped

What Pepkio delivers

Every genomics project delivers filtered variant files, coverage QC tables, annotated figures, version-pinned scripts, and a Methods draft—plus bespoke outputs when scoped at kickoff. Standard deliverables include:

Variant files

  • Indexed .bam
  • .g.vcf.gz
  • Filtered .vcf.gz
  • variant_annotation_master.tsv with consequence and population/clinical fields when configured

QC tables and figures

  • Coverage summaries, mapping and depth metrics, FastQC/MultiQC reports
  • Coverage distributions, Ti/Tv and consequence plots
  • VQSR tranche curves when applicable

CNV/SV outputs when scoped

  • Segment tables and SV VCF with breakpoint evidence

Code and documentation

  • Commented bash/R/Python scripts with lock files
  • README and Methods draft
  • Post-delivery reviewer clarification for analyses we performed

How the analysis works — step by step

  1. 1. Validate inputs and sample metadata

    Verify FASTQ or BAM integrity (MD5 checksums), read layout, read groups, and experimental design; record sample_manifest.csv and flag sub-threshold yield.

    Tools and outputs

    Tools used: md5sum; custom validation scripts

  2. 2. QC and trim raw reads

    FastQC 0.12.1 and fastp 0.23.4 assess and trim reads; flag low Q30 or high-adapter libraries before alignment.

  3. 3. Align reads to the reference genome

    BWA-MEM2 2.2.1 aligns short reads to GRCh38; minimap2 2.28 on the long-read spoke; samtools 1.21 indexes coordinate-sorted BAMs.

  4. 4. Mark duplicates and recalibrate base qualities

    Picard 3.2.0 MarkDuplicates and GATK 4.6.0.0 BaseRecalibrator / ApplyBQSR using the GATK resource bundle.

  5. 5. Assess coverage and callability

    mosdepth 0.3.3 and Picard CollectWgsMetrics or CollectHsMetrics report mean coverage and breadth at agreed thresholds (Rehm et al., 2021).

  6. 6. Call SNVs and indels

    GATK 4.6.0.0 HaplotypeCaller in gVCF mode; GenomicsDBImport and GenotypeGVCFs for cohorts; DeepVariant 1.8.0 optional (Poplin et al., 2018).

  7. 7. Call CNVs and structural variants when scoped

    GATK gCNV, Manta 1.6.0, DELLY 1.2.6, and CNVkit when scoped; Gabrielaite et al. (2021) recommend combining multiple CNV methodologies.

  8. 8. Annotate, filter, and package deliverables

    Ensembl VEP 112 annotation; GATK VQSR or hard filters (Broad Institute, 2024); bcftools 1.21; MultiQC 1.25; final scripts, README, Methods draft, and figures.

Tools and standards we use

Core stack for short-read and long-read genomics projects; versions pinned at kickoff and cited in the Methods draft.

Genomics & Variant Analysis tools and standards
ToolVersionRolePrimary citation
BWA-MEM22.2.1Short-read alignment to GRCh38Vasimuddin et al., 2019 — https://doi.org/10.1109/IPDPS.2019.00041
GATK4.6.0.0Duplicate marking, BQSR, variant calling, VQSRVan der Auwera et al., 2013 — https://doi.org/10.1002/0471250953.bi1110s43
Picard3.2.0MarkDuplicates, coverage metricsVan der Auwera et al., 2013 — https://doi.org/10.1002/0471250953.bi1110s43
samtools / bcftools1.21BAM indexing, VCF manipulationLi et al., 2009 — https://doi.org/10.1093/bioinformatics/btp352
mosdepth0.3.3Fast coverage depth and breadthPedersen & Quinlan, 2018 — https://doi.org/10.1101/517435
Ensembl VEP112Variant consequence annotationMcLaren et al., 2016 — https://doi.org/10.1186/s13059-016-0974-4
DeepVariant1.8.0Neural-network SNV/indel caller (optional)Poplin et al., 2018 — https://doi.org/10.1038/nbt.4235
Manta1.6.0SV discovery from paired readsChen et al., 2016 — https://doi.org/10.1093/bioinformatics/btv710
DELLY1.2.6SV and indel callingRausch et al., 2012 — https://doi.org/10.1093/bioinformatics/bts378
minimap22.28Long-read alignmentLi, 2018 — https://doi.org/10.1093/bioinformatics/bty191
Clair31.0.9Long-read SNV/indel callingZheng et al., 2022 — https://doi.org/10.1038/s41467-022-32121-6
MultiQC1.25Aggregated QC reportingEwels et al., 2016 — https://doi.org/10.1093/bioinformatics/btw354

Common challenges — and how we handle them

Researchers outsourcing genomics analysis struggle with pipeline reproducibility, caller discordance, unreliable CNV/SV detection, storage costs, and undocumented filters (Pan et al., 2022; Gabrielaite et al., 2021; Koboldt, 2020).

Pipeline choice affects reproducibility more than sequencing platform.
Pan et al. (2022) tested 56 aligner–caller combinations; pipelines had a larger impact on inherited-variant reproducibility than platform or library prep. Pepkio pins versions and documents parameters.
Single pipelines can yield discordant inherited calls.
Pan et al. (2022) found SNV reproducibility varies across pipelines; complementary filters or callers reduce discordance. Pepkio applies documented GATK or DeepVariant filters and discusses multi-caller strategies when needed.
CNV/SV detection varies by assay and tool.
Gabrielaite et al. (2021) found WGS outperforms WES and suggested combining GATK gCNV, Lumpy, DELLY, and cn.MOPS. Pepkio scopes multi-tool workflows at kickoff.
Storage and transfer costs grow at cohort scale.
Regier et al. (2018) showed CRAM reduces a 30× WGS BAM from ~54 Gb to ~17 Gb (>3-fold). Pepkio delivers CRAM or compressed VCF where agreed.
Filters are hard to reproduce without documentation.
Koboldt (2020) emphasizes rigorous filtering and benchmarking. Pepkio exports VCF FILTER labels and drafts Methods with exact versions.

Common questions

What data do I need to provide for a genomics analysis project?

Paired-end FASTQ or coordinate-sorted BAMs, sample metadata, and target reference build. Clinical WGS guidelines recommend mean coverage and callability metrics agreed at kickoff (Rehm et al., 2021)—often ~30× depth with >95% callability. MD5 checksums are requested at transfer; non-standard inputs are scoped at kickoff.

How long does genomics analysis take at Pepkio?

Single-sample germline projects typically take 2–4 weeks; multi-sample cohorts with joint genotyping and VQSR typically 4–8 weeks. CNV/SV or long-read add-ons extend timelines; milestones are confirmed at kickoff.

What do the deliverables look like?

Filtered VCFs, annotation tables, coverage QC summaries, PDF/SVG figures, MultiQC HTML, commented scripts with lock files, and a Methods draft. Bespoke formats—plink exports, IGV sessions, custom gene lists—are agreed at kickoff.

Can you handle my specific sequencing platform or instrument?

Illumina NovaSeq X, 6000, NextSeq 2000, and HiSeq use BWA-MEM2 + GATK. Element AVITI and Ultima UG100 are processed when scoped at kickoff. PacBio Revio and Oxford Nanopore route to the long-read spoke (minimap2, Clair3).

What if my data quality is poor or depth is below recommended thresholds?

Sub-threshold libraries are flagged before calling. Low Q30, high adapter content, or depth below agreed thresholds reduce callable fraction (Rehm et al., 2021). Pepkio discusses re-sequencing versus stricter filters before proceeding.

Do you provide the analysis code, and do I own it?

Yes—you retain full ownership. Pepkio delivers commented bash, R, and Python scripts with lock files via private Git or agreed transfer; notebooks on request.

Can I be involved during the analysis?

Yes—checkpoint reviews after alignment QC, coverage assessment, and before delivery. A PhD-level bioinformatician is your primary contact.

What happens if a journal reviewer requests changes after delivery?

Clarification of methods, QC thresholds, and minor figure or table revisions for analyses we performed are included. Substantial new reviewer-requested analyses are scoped separately.

Should I use GRCh38 or hg19 for my genomics analysis?

Pepkio defaults to GRCh38 with GATK bundle 4.6 (Broad Institute, 2024) and alternate loci (Regier et al., 2018). GRCh37/hg19 is supported on request for legacy cohorts.

Can Pepkio handle custom or non-standard genomics analyses?

Yes—client BAMs, custom references, plink exports, somatic extensions, and workflows outside the five spokes are scoped at kickoff with documented deliverables.

Related services

  • TranscriptomicsLink germline or somatic variants to gene expression for eQTL mapping, allele-specific expression, and multi-omics interpretation.
  • Custom consultingPre-sequencing depth, cohort size, reference-build, and capture-kit planning before library prep.
  • Whole-genome sequencingDetailed FASTQ-to-VCF workflow when genome-wide coverage is your primary need.
  • CNV and structural variationDedicated CNV/SV calling when SNV/indel analysis is already complete or insufficient for your biological question.
References
  1. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics. 2013;43(1):11.10.1–11.10.33. https://doi.org/10.1002/0471250953.bi1110s43 (PMID: 25431634)
  2. Regier AA, Farjoun Y, Larson DE, et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nature Communications. 2018;9:4038. https://doi.org/10.1038/s41467-018-06159-4 (PMID: 30279509)
  3. Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genomic Medicine. 2021;6(1):47. https://doi.org/10.1038/s41525-020-00154-9 (PMID: 33110627)
  4. Pan B, Kusko R, Xiao W, et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biology. 2022;23:2. https://doi.org/10.1186/s13059-021-02569-8 (PMID: 34980216)
  5. Koboldt DC. Best practices for variant calling in clinical sequencing. Genome Medicine. 2020;12:91. https://doi.org/10.1186/s13073-020-00791-w (PMID: 33106175)
  6. Kinnersley B, Sud A, Everall A, et al. Analysis of 10,478 cancer genomes identifies candidate driver genes and opportunities for precision oncology. Nature Genetics. 2024;56(9):1868–1877. https://doi.org/10.1038/s41588-024-01785-9 (PMID: 38890488)
  7. Light N, Layeghifard M, Attery A, et al. Germline TP53 mutations undergo copy number gain years prior to tumor diagnosis. Nature Communications. 2023;14:77. https://doi.org/10.1038/s41467-022-35727-y (PMID: 36604421)
  8. Gabrielaite M, Torp MH, Rasmussen MS, et al. A comparison of tools for copy-number variation detection in germline whole exome and whole genome sequencing data. Cancers. 2021;13(24):6283. https://doi.org/10.3390/cancers13246283 (PMID: 34944901)
  9. McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. https://doi.org/10.1186/s13059-016-0974-4 (PMID: 27268795)
  10. Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983–987. https://doi.org/10.1038/nbt.4235 (PMID: 30247488)
  11. Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. IEEE IPDPS. 2019. https://doi.org/10.1109/IPDPS.2019.00041
  12. Ewels P, Magnusson M, Lundin S, et al. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. https://doi.org/10.1093/bioinformatics/btw354 (PMID: 27312411)
  13. NIH All of Us Research Program. Program update — Denny J. 2025. https://dpcpsi.nih.gov/sites/default/files/2025-01/Day-1-215PM-All-of-Us-Program-Update-Denny-v3-508.pdf
  14. Broad Institute. GATK 4.6.0.0 release notes and VQSR documentation. 2024. https://github.com/broadinstitute/gatk/releases/tag/4.6.0.0; https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR

Individual services

Deep-dive pages for specific genomics & variant analysis methods and workflows.

Let's Talk About Your Science

Tell us:

  • • Your biological question
  • • Data type and size
  • • Timeline constraints

We'll tell you:

  • • What's feasible
  • • How long it will take
  • • Exactly what it will cost
Contact Us

Contact us to start with a free consultation. Need everyday bench calculators? Try our free lab tools.