1. Validate inputs and sample metadata
Pepkio confirms FASTQ integrity, read structure (Read 1: 28 bp barcode/UMI; Read 2: ≥90 bp cDNA for 10x 3′), and metadata. Chemistry, expected recovery, depth, and covariates are recorded in sample_manifest.csv. Sub-threshold depth is flagged before alignment (10x Genomics, 2024).
Tools and outputs
Tools used: fastqc / fastp as needed
Output: sample_manifest.csv with library IDs, chemistry, read counts, and QC flags
2. Align reads and generate count matrices
For 10x data, Pepkio runs cellranger count or cellranger multi with GRCh38-2024-A or GRCm39-2024-A references (10x Genomics, 2024). Saturation, median genes per cell, and fraction reads in cells are compared against vendor expected ranges.
Tools and outputs
Tools used: Cell Ranger 10.0.0
Output: filtered_feature_bc_matrix/, raw_feature_bc_matrix/, metrics_summary.csv, web_summary.html
3. Import and audit Cell Ranger outputs
Count matrices are imported preserving raw UMI counts in a dedicated layer. Pepkio audits cell calling, saturation curves, and gene detection distributions. Non-10x count matrices (BD Rhapsody, Parse Evercode, SMART-seq2) are imported via anndata or Seurat::CreateSeuratObject when provided.
Tools and outputs
Tools used: Scanpy 1.12.1 or Seurat 5.2.1
Output: Per-sample .h5ad or .rds with counts layer and initial metadata
4. Correct ambient RNA
Cell-free RNA contaminates droplet matrices and can misassign marker genes (Young & Behjati, 2020; Heumos et al., 2023). When raw and filtered Cell Ranger matrices are available, Pepkio estimates contamination with SoupX and produces background-corrected counts. Elevated estimated soup fractions are flagged for review before clustering.
Tools and outputs
Tools used: SoupX 1.6.2
Output: Corrected count matrix; per-sample soup_fraction in metadata; SoupX diagnostic plots
5. Detect and flag doublets
Shared barcodes create hybrid transcriptomes that distort clustering (Wolock et al., 2019). Pepkio runs scDblFinder or Scrublet per sample—not on merged objects (Germain et al., 2022; Heumos et al., 2023). Expected multiplet rates are ~0.8% per 1,000 cells on Next GEM and ~0.4% on GEM-X (10x Genomics, 2024). Predicted doublets are flagged in metadata.
Tools and outputs
Tools used: scDblFinder 1.18.0 or Scrublet 0.2.3
Output: doublet_score, predicted_doublet columns; doublet score histograms
6. Filter low-quality cells
Cells with extreme mitochondrial fractions, low gene complexity, or empty-droplet profiles are removed using sample-adaptive thresholds, because optimal QC boundaries vary by tissue and dissociation protocol (Luecken & Theis, 2019). Retained and excluded counts are documented per filter rule.
Tools and outputs
Tools used: Scanpy 1.12.1 or Seurat 5.2.1
Output: Filtered object; QC plots for nCount_RNA, nFeature_RNA, percent.mt, percent.ribo
7. Normalize and select highly variable genes
For R workflows, Pepkio applies SCTransform v2, modeling sequencing depth and returning Pearson residuals for PCA (Hafemeister & Satija, 2019). Python workflows use sc.pp.normalize_total and HVG selection with the seurat_v3 flavor (Wolf et al., 2018). HVG sets and parameters are recorded for reproducibility.
Tools and outputs
Tools used: sctransform 0.4.3 (via Seurat 5.2.1) or Scanpy 1.12.1
Output: Normalized layers; highly_variable_genes.csv
8. Integrate batches across samples
Harmony integrates same-modality batches with shared cell types (Korsunsky et al., 2019). scVI handles atlas-level integration where compositional differences confound linear methods (Lopez et al., 2018; Gayoso et al., 2022; Luecken et al., 2022). Marker-gene preservation checks that biological states are not over-merged.
Tools and outputs
Tools used: harmonypy 1.2.3 or scvi-tools 1.4.3
Output: X_harmony or X_scVI embedding; before/after UMAP by batch and condition
9. Cluster, embed, and annotate cell types
Pepkio builds a neighbor graph, runs Leiden community detection at data-driven resolution (validated with marker genes), and computes UMAP. Cell types are assigned by reference mapping with SingleR (Aran et al., 2019), followed by manual marker review. Ambiguous clusters receive provisional labels with supporting evidence.
Tools and outputs
Tools used: Scanpy 1.12.1 or Seurat 5.2.1; SingleR 2.12.0; leidenalg 0.10.2
Output: Cluster assignments; UMAP/t-SNE plots; marker_gene_table.csv
10. Test differential expression and package deliverables
Cluster-wise or condition-wise DE uses Wilcoxon rank-sum tests with Benjamini–Hochberg FDR correction (Luecken & Theis, 2019). Seurat workflows use the presto implementation on large objects when installed (Hao et al., 2023). Results export as ranked gene lists with log₂ fold-change, detection rate, and adjusted p-values. Pseudotime trajectory analysis is scoped separately when requested.
Tools and outputs
Tools used: Scanpy rank_genes_groups or Seurat FindMarkers with presto
Output: deg_by_cluster.csv; volcano plots; final .h5ad/.rds; scripts; Methods draft