Machine learning

Biomarker Discovery Analysis Service — Stability-Scored Signatures from Nested Cross-Validation

Biomarker discovery identifies compact multivariate signatures from high-dimensional omics or clinical matrices for diagnosis, prognosis, or treatment stratification (Díaz-Uriarte et al., 2022). Pepkio delivers nested cross-validation, stability-scored feature panels, commented code, and a Methods draft for academic, biotech, and pharma clients, with custom inputs and non-standard methods scoped at kickoff. Typical omics studies face p >> n (Díaz-Uriarte et al., 2022).

Key facts

Key facts about Biomarker Discovery
Fact	Value
Supported platforms / instruments	Feature matrices from bulk RNA-seq (variance-stabilized or voom-normalized tables), microarray, Olink NPX, MaxQuant/DIA-NN/MSstats protein tables, metabolomics peak tables, methylation beta values, clinical/laboratory tables; single-cell pseudobulk or embedding summaries when scoped
Input requirements	Sample-level matrix (samples × features); binary or multiclass outcome labels; ≥2 outcome groups; enough samples per class for stratified nested CV without empty folds—feasibility confirmed at kickoff (Vabalas et al., 2019; Lewis et al., 2023); p >> n common
Reference builds supported	Human GRCh38 (GENCODE v44 / Ensembl 110); mouse GRCm39 (GENCODE vM33 / Ensembl 110); UniProt Swiss-Prot for protein features; not applicable for purely clinical tables
Primary tools (with versions)	scikit-learn 1.6.1; glmnet 4.1-8; Boruta 11.0.0; XGBoost 2.1.3; SHAP 0.46.0; clusterProfiler 4.20.0; limma 3.60; scikit-survival 0.24.1 when survival endpoints scoped; nestedcv (R) when client prefers R-native nested CV — pinned per project
Typical turnaround time	3–5 weeks (single cohort, one outcome, ≤500 features, internal nested CV); 5–8 weeks (external cohort validation, multi-omics harmonization, or survival signatures) — confirmed at kickoff
Deliverable formats	selected_features.csv; signature_performance_summary.csv; nested-CV prediction tables; PDF/SVG ROC, precision-recall, stability, and SHAP plots; serialized models (.pkl, .rds); DOME summary table; commented R/Python scripts with renv.lock or conda export; Methods draft
Key cited best-practice reference	Díaz-Uriarte et al. (2022), PLOS Computational Biology; Walsh et al. (2021), DOME — Nature Methods
Custom / bespoke analysis	Non-standard outcomes, custom feature engineering, client-specified selection methods (e.g., Boruta vs glmnet), multi-omics fusion, ensemble selection, or alternate deliverable formats scoped at kickoff

What is biomarker discovery?

Biomarker discovery applies supervised machine learning to high-dimensional feature matrices to identify a minimal multivariate panel that separates outcome groups or ranks risk (Díaz-Uriarte et al., 2022). Unlike univariate differential testing followed by a top-N cutoff, discovery optimizes correlated feature combinations under regularization or wrapper selection embedded within cross-validation (Haury et al., 2011). Mirza et al. (2023) applied seven ML methods to 701 breast cancer samples across 11 GEO datasets, yielding nine-gene diagnostic and eight-gene prognostic signatures validated by qRT-PCR. Pepkio starts from client-supplied or Pepkio-processed matrices; custom workflows are agreed at kickoff. See the biomarker discovery glossary.

When should you use biomarker discovery?

Biomarker discovery fits when labeled samples provide a high-dimensional feature matrix and the research question requires a compact, multivariate signature—not when only univariate ranking or a pre-specified analyte panel is needed (Díaz-Uriarte et al., 2022).

Comparison of ML biomarker discovery, univariate DE + top-N, and hypothesis-driven fixed panels
Approach	Best for	Limitations	Approximate cost range
ML biomarker discovery (nested CV + stability scoring)	High-dimensional omics or clinical data; compact multi-feature signatures; pilot panels for targeted follow-up	Requires labeled outcomes; small n limits signature reliability; strict leakage control required	Quote-based; bioinformatics effort scales with cohort size, feature count, and validation scope
Univariate DE + top-N cutoff	Exploratory ranking; large univariate effect sizes	No multivariate optimization; ignores correlation structure; unstable at high dimensionality	Lower bioinformatics cost
Hypothesis-driven fixed panel	Known analytes with prior clinical or mechanistic evidence	Cannot discover novel feature combinations	Assay cost dominates; minimal bioinformatics

Neoadjuvant breast cancer response: Sammut et al. (2022) built a multi-omic ML predictor from 168 neoadjuvant patients; the integrated model achieved AUC 0.87 on a 75-patient external cohort.
Breast cancer transcriptomic signatures: Mirza et al. (2023) derived nine-gene diagnostic and eight-gene prognostic signatures from 701 samples across 11 GEO datasets, with qRT-PCR validation.
HCC prognostic panel: Feng et al. (2023) used 77 algorithms on scRNA-seq and bulk RNA-seq to derive an 11-gene NK-related signature validated across TCGA, GEO, and ICGC cohorts.

How the analysis works — step by step

1. Scope outcome, signature target, and validation plan
Pepkio confirms the prediction target, endpoint type (binary, multiclass, or survival when scoped), signature size range, and whether an independent holdout or external cohort exists (Walsh et al., 2021; Díaz-Uriarte et al., 2022). Confounded batch-by-outcome designs and duplicate patients are flagged before modeling.
Tools and outputs
Tools used: Custom scoping templates; DOME checklist (Walsh et al., 2021)
Output: Signed scope memo with primary metrics (AUROC, AUPRC, balanced accuracy, or C-index when survival is scoped)
2. Validate feature matrix and harmonize metadata
Sample IDs are matched between the feature matrix and outcome table; orientation, missingness, duplicate features, and class balance are audited. Feature identifiers are mapped to GENCODE v44 gene IDs or UniProt accessions when annotation is in scope.
Tools and outputs
Tools used: Custom validation scripts; clusterProfiler 4.20.0 for ID mapping when scoped
Output: sample_manifest.csv; feature_qc_summary.csv
3. Profile missingness and unsupervised variance
Missing values, near-zero-variance features, and outlier samples are summarized. Unsupervised variance filtering is documented but not applied using outcome labels—supervised filtering before cross-validation inflates performance estimates (Vabalas et al., 2019).
Tools and outputs
Tools used: scikit-learn 1.6.1 (VarianceThreshold); custom R diagnostics
Output: missingness_profile.csv; variance_filter_log.csv; sample and feature QC plots (PDF/SVG)
4. Audit data leakage
Pepkio applies the seven guiding questions from Bernett et al. (2024) before splitting to confirm preprocessing, feature selection, and tuning occur only on training folds.
Tools and outputs
Tools used: Documented leakage checklist (Bernett et al., 2024)
Output: leakage_audit.md
5. Define patient-level nested cross-validation splits
Patient-level nested CV separates inner-loop feature selection and tuning from outer-loop performance estimation (Walsh et al., 2021; Vabalas et al., 2019). Stratification preserves class proportions when sample size permits.
Tools and outputs
Tools used: scikit-learn 1.6.1 (StratifiedKFold, GridSearchCV); nestedcv (R) when client prefers R-native nested CV (Lewis et al., 2023)
Output: cv_fold_assignments.csv with outer and inner fold IDs per sample
6. Preprocess within inner folds
Scaling, imputation, and optional variance filtering are fit on inner training data only. Batch is regressed with limma removeBatchEffect on training folds when batch is not confounded with outcome (Whalen et al., 2022).
Tools and outputs
Tools used: scikit-learn 1.6.1 (Pipeline, ColumnTransformer); limma 3.60
Output: Fold-specific preprocessed matrices logged in pipeline parameters
7. Run embedded feature selection in inner loops
Feature selection runs inside inner CV folds—never on pooled train and test data (Vabalas et al., 2019). Default: glmnet LASSO or elastic-net (glmnet 4.1-8); Boruta 11.0.0 when all-relevant identification is scoped; optional univariate prefilter inside inner folds only (Haury et al., 2011).
Tools and outputs
Tools used: glmnet 4.1-8; Boruta 11.0.0; scikit-learn 1.6.1
Output: Per-fold selected feature lists; inner-CV tuning metrics
8. Train classifiers and evaluate on outer folds
A regularized logistic or elastic-net classifier is trained on inner-selected features and evaluated on outer-fold samples. XGBoost 2.1.3 is compared when sample size and scope permit (Greener et al., 2022); simpler baselines are reported alongside complex models.
Tools and outputs
Tools used: scikit-learn 1.6.1; XGBoost 2.1.3; scikit-survival 0.24.1 when survival endpoints scoped
Output: signature_performance_summary.csv (AUROC, AUPRC, balanced accuracy, or C-index per outer fold)
9. Compute signature stability across folds
Selection frequency—the proportion of outer folds in which each feature is selected—is computed for every candidate feature (Haury et al., 2011; Meinshausen & Bühlmann, 2010). A consensus signature includes features meeting a kickoff-agreed stability threshold (often ≥50–60% of outer folds).
Tools and outputs
Tools used: Custom R/Python stability scripts; scikit-learn 1.6.1
Output: selected_features.csv (feature_id, selection_frequency, mean_coefficient, annotation); stability barplot and heatmap (PDF/SVG)
10. Interpret signature biology and package deliverables
Consensus features receive SHAP attribution and pathway enrichment (SHAP 0.46.0; clusterProfiler 4.20.0). Deliverables are assembled; external or holdout evaluation is reported separately when a cohort was reserved at kickoff (Whalen et al., 2022).
Tools and outputs
Tools used: SHAP 0.46.0; clusterProfiler 4.20.0
Output: pathway_enrichment_signature.csv; SHAP summary plots; final deliverable bundle; Methods draft

What Pepkio delivers

Processed data files

selected_features.csv, signature_performance_summary.csv, nested_cv_predictions.csv, cv_fold_assignments.csv
Serialized models (.pkl, .rds) with documented decision thresholds

Figures (PDF/SVG)

ROC and precision-recall curves; selection-frequency barplot; stability heatmap
SHAP summary and beeswarm plots; optional Kaplan–Meier plots when survival signatures are scoped

Tables and code

QC and enrichment tables listed in pipeline steps
Commented R/Python scripts with renv.lock or conda export via private Git or agreed file transfer—you retain full ownership

Documentation

HTML QC report, DOME summary table (Walsh et al., 2021), README, and journal-formatted Methods draft

Post-delivery support

Milestone check-ins; clarification of methods and minor revisions within agreed scope (typically ≤20% of deliverables)
Substantial new analyses are scoped as separate milestones

Technical decisions we make — and why

Validation design: nested CV vs single train/test split: Nested CV is the default because feature selection on pooled data inflates accuracy more than non-nested tuning (Vabalas et al., 2019), and DOME recommends nested CV for unbiased estimation (Walsh et al., 2021). A single holdout is used only when an external cohort is reserved at kickoff.
Feature selection: glmnet vs Boruta vs univariate filter: glmnet is the default for sparse signatures in p >> n settings (Friedman et al., 2010). Boruta identifies all relevant features when broader sets are needed (Kursa & Rudnicki, 2010). Univariate prefilters run only inside inner folds; Haury et al. (2011) found t-test filters competitive with wrappers on breast cancer data.
Stability threshold: Features meeting a kickoff-agreed selection frequency (often ≥50–60% of outer folds) enter the consensus panel (Meinshausen & Bühlmann, 2010).
Classifier: regularized logistic vs XGBoost: Regularized logistic or elastic-net is the default for small tabular cohorts (Greener et al., 2022; Whalen et al., 2022). XGBoost is compared when sample size and scope support nonlinear boundaries.
Batch correction: limma removeBatchEffect on training folds when batch and outcome are separable (Whalen et al., 2022). Fully confounded designs are flagged at kickoff.

Common questions

What is the minimum sample size for biomarker discovery?

There is no universal minimum, but nested CV requires enough samples per class to populate folds without empty strata—feasibility confirmed at kickoff (Vabalas et al., 2019; Lewis et al., 2023). Smaller cohorts support exploratory work but should not claim external validity without a holdout (Whalen et al., 2022).

Can you run biomarker discovery on poor-quality or low-yield data?

Yes, within documented limits. High missingness, extreme imbalance, or batch confounded with outcome are flagged before modeling. Pepkio does not claim robust signatures from severely underpowered or confounded designs without a feasibility discussion at kickoff.

Which omics platforms and data formats do you accept?

Pepkio accepts feature matrices from bulk RNA-seq, microarray, Olink NPX, MaxQuant/DIA-NN/MSstats protein tables, metabolomics peak tables, methylation beta values, and clinical-laboratory tables. Single-cell pseudobulk or embedding summaries are accepted when scoped. Upstream Olink QC or MS preprocessing can be scoped through related proteomics spokes. Matrices must be sample-level with documented upstream normalization.

How long does biomarker discovery take at Pepkio?

Standard single-cohort projects with one binary outcome and ≤500 features typically complete in 3–5 weeks from data receipt. External cohort validation, multi-omics harmonization, survival endpoints, or bespoke ensemble selection may require 5–8 weeks—confirmed at kickoff.

How do you handle batch effects in biomarker discovery?

Batch is diagnosed with PCA and correlation checks. When batch is not confounded with outcome, Pepkio regresses batch with limma removeBatchEffect within training folds only (Whalen et al., 2022). Confounded designs are flagged at kickoff.

Do I own the code—and in what format is it delivered?

Yes—you retain full ownership of all code, scripts, and results. Pepkio delivers commented R and/or Python scripts with renv.lock or conda environment exports, organized by pipeline stage with a README. Delivery is via private Git repository or agreed file transfer.

Can I be involved during the biomarker discovery analysis?

Yes. Checkpoint reviews occur after QC, leakage audit, consensus signature selection, and before final delivery. You can review outcome definitions, selection methods, stability thresholds, and figures within agreed scope.

What does post-delivery reviewer support include?

Support covers clarification of CV design, feature selection methods, stability metrics, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Pepkio drafts Methods and Supplementary text for analyses we performed. Substantial new outcomes, cohorts, or selection methods requested by reviewers are scoped as separate milestones.

Is co-authorship required?

No. Pepkio operates as a fee-for-service provider and does not require co-authorship unless explicitly discussed in advance. Acknowledgment of bioinformatics support in the Acknowledgments section is standard practice.

Should I use LASSO or Boruta for biomarker discovery?

LASSO or elastic-net via glmnet is the default when you need a compact, sparse signature in p >> n settings (Friedman et al., 2010). Boruta is appropriate when the goal is to identify all statistically relevant features rather than a minimal panel (Kursa & Rudnicki, 2010). Pepkio compares both within nested CV when scoped and reports selection stability for each.

How do you report feature stability across cross-validation folds?

Pepkio reports selection frequency—the proportion of outer CV folds in which each feature is selected—and visualizes stability with barplots and feature-by-fold heatmaps (Haury et al., 2011). The consensus signature lists features meeting a kickoff-agreed threshold (often ≥50–60% of folds). Unstable features selected in only one or two folds are documented but excluded from the primary panel unless you request otherwise.

How many features should be in a publishable biomarker panel?

Panel size depends on assay feasibility and cohort size, not a fixed rule. Published examples range from eight-gene panels to multi-omic predictors integrating dozens of features (Mirza et al., 2023; Sammut et al., 2022). Pepkio prioritizes parsimony—glmnet LASSO and stability filtering favor smaller panels—and documents trade-offs between signature size, stability, and outer-fold performance in the deliverable report.

Related services

Predictive modeling — Full classifier or survival model validation, calibration assessment, and TRIPOD+AI-aligned reporting after a signature is defined.
Multi-omics integration — Matched multi-layer feature matrices from MOFA2 or DIABLO as inputs for supervised signature construction.
Bulk RNA-seq — Expression count matrices and differential-expression preprocessing as transcriptomic feature inputs.
Olink proximity extension assay — Bridge-normalized NPX matrices and QC for targeted protein signature discovery.
Bioinformatics consulting — Experimental design, power estimation, and covariate planning before committing to ML signature work.

References

Díaz-Uriarte R, Gómez de Lope E, Giugno R, et al. Ten quick tips for biomarker discovery and validation analyses using machine learning. PLOS Computational Biology. 2022;18(8):e1010357. https://doi.org/10.1371/journal.pcbi.1010357 (PMID: 35951526)
Walsh I, Fishman D, Garcia-Gasulla D, et al. DOME: recommendations for supervised machine learning validation in biology. Nature Methods. 2021;18(10):1122–1127. https://doi.org/10.1038/s41592-021-01205-4 (PMID: 34316068)
Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLOS ONE. 2019;14(11):e0224365. https://doi.org/10.1371/journal.pone.0224365 (PMID: 31697686)
Bernett J, Blumenthal DB, Grimm DG, et al. Guiding questions to avoid data leakage in biological machine learning applications. Nature Methods. 2024;21(8):1444–1453. https://doi.org/10.1038/s41592-024-02362-y (PMID: 39122953)
Whalen S, Schreiber J, Noble WS, Pollard KS. Navigating the pitfalls of applying machine learning in genomics. Nature Reviews Genetics. 2022;23(3):169–181. https://doi.org/10.1038/s41576-021-00434-9 (PMID: 34837041)
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology. 2022;23(1):40–55. https://doi.org/10.1038/s41580-021-00407-0 (PMID: 34518686)
Haury A-C, Gestraud P, Vert J-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLOS ONE. 2011;6(12):e28210. https://doi.org/10.1371/journal.pone.0028210 (PMID: 22205940)
Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society Series B. 2010;72(4):417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x
Lewis MJ, Spiliopoulou A, Goldmann K, et al. nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data. Bioinformatics Advances. 2023;3(1):vbad048. https://doi.org/10.1093/bioadv/vbad048 (PMID: 37113250)
Sammut SJ, Crispin-Ortuzar M, Chin S-F, et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature. 2022;601(7894):623–629. https://doi.org/10.1038/s41586-021-04278-5 (PMID: 34875674)
Mirza Z, Ansari MS, Iqbal MS, et al. Identification of novel diagnostic and prognostic gene signature biomarkers for breast cancer using artificial intelligence and machine learning assisted transcriptomics analysis. Cancers. 2023;15(12):3237. https://doi.org/10.3390/cancers15123237 (PMID: 37370847)
Feng Q, Huang Z, Song L, Wang L, Lu H, Wu L. Combining bulk and single-cell RNA-sequencing data to develop an NK cell-related prognostic signature for hepatocellular carcinoma based on an integrated machine learning framework. European Journal of Medical Research. 2023;28(1):300. https://doi.org/10.1186/s40001-023-01300-6 (PMID: 37649103)
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. https://jmlr.org/papers/v12/pedregosa11a.html
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1–22. https://doi.org/10.18637/jss.v033.i01
Kursa MB, Rudnicki WR. Feature selection with the Boruta package. Journal of Statistical Software. 2010;36(11):1–13. https://doi.org/10.18637/jss.v036.i11
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD. 2016:785–794. https://doi.org/10.1145/2939672.2939785
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017;30:4765–4774. https://proceedings.neurips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions
Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–287. https://doi.org/10.1089/omi.2011.0118 (PMID: 22455463)
Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43(7):e47. https://doi.org/10.1093/nar/gkv007 (PMID: 25605792)
Pölsterl S, Bache U. scikit-survival: a library for time-to-event analysis built on top of scikit-learn. Journal of Machine Learning Research. 2021;22(212):1–6. https://jmlr.org/papers/v22/21-004.html

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Biomarker Discovery Analysis Service — Stability-Scored Signatures from Nested Cross-Validation

Key facts

What is biomarker discovery?

When should you use biomarker discovery?

How the analysis works — step by step

1. Scope outcome, signature target, and validation plan

2. Validate feature matrix and harmonize metadata

3. Profile missingness and unsupervised variance

4. Audit data leakage

5. Define patient-level nested cross-validation splits

6. Preprocess within inner folds

7. Run embedded feature selection in inner loops

8. Train classifiers and evaluate on outer folds

9. Compute signature stability across folds

10. Interpret signature biology and package deliverables