1. Validate inputs and sample metadata
Pepkio confirms FASTQ or BAM integrity, platform, chemistry, and metadata. Read depth, RIN, replicates, and contrast design are recorded in sample_manifest.csv. Sub-threshold yield or truncated reads are flagged before alignment (Pacific Biosciences, 2024; Amarasinghe et al., 2020).
Tools and outputs
Tools used: fastqc / fastp as needed; samtools quickcheck for BAM inputs
Output: sample_manifest.csv with library IDs, platform, read counts, and QC flags
2. QC raw long reads
Read-length distributions, pass-filter rates, and full-length fractions are computed per sample. ONT reports include N50 read length and median quality; PacBio HiFi reports include mean read length and CCS pass rates. Truncation that inflates isoform catalogs is flagged (Amarasinghe et al., 2020; Pardo-Palacios et al., 2024).
Tools and outputs
Tools used: NanoPlot 1.42.0; pycoQC 2.5.2 (ONT); PacBio dataset reports (HiFi)
Output: read_qc_summary.csv; read-length histograms; per-sample QC flags
3. Process PacBio subreads to HiFi reads (when applicable)
When clients deliver PacBio subreads or CCS BAMs, Pepkio runs the Iso-Seq workflow—demultiplex, refine poly-A tails, and cluster to HiFi reads (Pacific Biosciences, 2024). Barcode crosstalk and low CCS yield are documented before alignment.
Tools and outputs
Tools used: isoseq3 (SMRT Link 13.1)
Output: Demultiplexed HiFi .fastq.gz per sample; isoseq_stats.csv
4. Align reads to the reference genome
Reads are mapped in splice-aware mode: minimap2 splice:hq for PacBio HiFi and splice with k-mer size 14 for ONT, with annotated GENCODE splice junctions supplied as BED input (Li, 2018; Prjibelski et al., 2023). Mapping rates, chimeric fractions, and primary vs. secondary alignments are audited per sample.
Tools and outputs
Tools used: minimap2 2.31; samtools 1.21
Output: Coordinate-sorted, indexed .bam per sample; alignment_summary.csv
5. Reconstruct and quantify transcript models
IsoQuant runs in reference-guided mode with GENCODE, extending the reference with sample-specific isoforms (Prjibelski et al., 2023). Gene-, isoform-, exon-, and intron-level counts are generated; saturation is compared to platform depth targets (Pacific Biosciences, 2024; Chen et al., 2025).
Tools and outputs
Tools used: IsoQuant 3.13.0
Output: extended_annotation.gtf; isoform_counts.tsv; gene_counts.tsv; saturation curves
6. Collapse redundant isoform models
Long-read pipelines often emit highly redundant transcript models differing by terminal exons or indels (Pardo-Palacios et al., 2024; ConesaLab SQANTI3 wiki). When redundancy exceeds project thresholds, Pepkio collapses near-identical models before SQANTI3 classification.
Tools and outputs
Tools used: TAMA collapse or cDNA_Cupcake collapse_isoforms_by_sam as appropriate
Output: collapsed_annotation.gtf; collapse audit log
7. Classify and filter with SQANTI3
SQANTI3 assigns structural categories (FSM, ISM, NIC, NNC, genic, intergenic, antisense) and quality metrics on TSS, TTS, and splice junctions (Pardo-Palacios et al., 2024). Rules-based filtering is default; ML filtering is documented when selected.
Tools and outputs
Tools used: SQANTI3 6.0.1
Output: SQANTI3_classification.txt; SQANTI3_filter_report.html; filtered corrected.gtf
8. Re-quantify the filtered transcript catalog
SQANTI3 expression estimates are used for QC only—not for differential testing (ConesaLab SQANTI3 wiki; Pardo-Palacios et al., 2024). Pepkio re-runs IsoQuant quantification against the SQANTI3-filtered GTF per sample to produce final count matrices.
Tools and outputs
Tools used: IsoQuant 3.13.0 (--reference with filtered GTF)
Output: filtered_isoform_counts.tsv; filtered_gene_counts.tsv; TPM tables
9. Test differential expression
Gene-level contrasts use DESeq2 with Benjamini–Hochberg FDR correction when ≥3 biological replicates per condition are available. Isoform-level testing with DRIMSeq and stageR is run only when replicate count and read depth support isoform-resolved power (Du et al., 2023; LRGASP Consortium, 2024; Nowicka & Robinson, 2016). Batch is included in the design matrix when the same contrast spans multiple sequencing runs (Love et al., 2014).
Tools and outputs
Tools used: DESeq2 1.44.0; DRIMSeq 1.30.0; stageR 1.28.0
Output: deg_results.csv; differential_isoform_usage.csv; MA and volcano plots
10. Package deliverables
Pepkio assembles figures, exports count tables, writes commented scripts, and drafts a Methods section citing software versions. Custom milestones are included when scoped at kickoff.
Tools and outputs
Tools used: R 4.4.x / Python 3.12 scripts; ggplot2 3.5.1
Output: Final deliverable bundle; HTML QC report; README; Methods draft