Metabolomics

Untargeted Metabolomics Analysis Service — Annotated Feature Matrices from Raw LC-MS Spectra to limma-Tested Contrasts

Untargeted metabolomics profiles small-molecule abundance by LC-MS or GC-MS to discover pathway perturbations and candidate biomarkers. Pepkio delivers version-pinned preprocessing, annotation-tier tables, limma-tested contrasts, figures, documented R code, and a Methods draft for academic, biotech, and pharma teams. Custom and bespoke workflows—non-standard inputs, outputs, or analyses—are scoped at kickoff. HMDB 5.0 catalogs 217,920 metabolite entries (Wishart et al., 2022).

Key facts

Key facts about Untargeted Metabolomics
Fact	Value
Supported platforms / instruments	Thermo Orbitrap (Q Exactive, Exploris, Fusion Lumos); SCIEX TripleTOF and ZenoTOF; Agilent Q-TOF (`.d`); Bruker timsTOF and Waters SYNAPT/Xevo when scoped at kickoff; mzML/mzXML; GC-MS and vendor peak tables on request
Input requirements	Thermo `.raw`, SCIEX `.wiff`, Agilent `.d`, Bruker `.d`, Waters `.raw`, or mzML/mzXML; multiple biological replicates per condition (often ≥3) for variance estimation (Reinhold et al., 2019); pooled QC injections recommended; polarity mode and chromatography method documented at kickoff; typical standard cohort 10–40 samples
Reference databases / standards	HMDB 5.0; METLIN; KEGG; LipidMaps; GNPS spectral libraries; organism-specific in-house libraries on request
Primary tools (with versions)	xcms 4.x+; MS-DIAL 5.x+; MZmine 3.x+; MetaboAnalystR 4.x+; limma 3.60+; ropls 1.32+; msconvert (ProteoWizard) 3.x+ — pinned per project
Typical turnaround time	2–4 weeks (standard untargeted LC-MS cohort, one contrast); 4–8 weeks (GC-MS extensions, multi-batch correction, or multi-contrast designs) — confirmed at kickoff
Deliverable formats	Feature/intensity matrices (`.csv`, `.tsv`); annotation tables with confidence tiers; PDF/SVG PCA, PLS-DA, and volcano plots; pathway enrichment tables; HTML QC report; commented R scripts; Methods draft
Key cited best-practice reference	Alseekh et al. (2021), Nature Methods (MS metabolomics reporting and annotation tiers)
Custom / bespoke analysis	Non-standard inputs, contrasts, cross-cohort harmonization, lipid class-specific workflows, GNPS networking, or client-specified ML models scoped at kickoff

What is untargeted metabolomics?

Untargeted metabolomics detects and quantifies m/z–retention time (RT) features from LC-MS or GC-MS without a predefined analyte list, then aligns peaks, normalizes intensities, annotates features, and tests differential abundance (Smith et al., 2006; Alseekh et al., 2021). Unlike targeted LC-MS/MS with calibrated standards, untargeted profiling is semi-quantitative and discovery-oriented (Alseekh et al., 2021). Lloyd-Price et al. (2019) generated 546 fecal metabolite profiles from 132 IBD subjects alongside metagenomics and host transcriptomics. Pepkio processes raw vendor files or documented peak tables with locked preprocessing parameters; custom entry points are agreed at kickoff. See the untargeted metabolomics glossary.

When should you use untargeted metabolomics?

Untargeted metabolomics fits when the research question requires broad biochemical discovery—identifying which pathways or metabolite classes shift under disease, treatment, or exposure—before committing to a targeted panel (Alseekh et al., 2021; Reinhold et al., 2019).

Comparison of untargeted LC-MS/GC-MS, targeted LC-MS/MS, and NMR metabolomics approaches
Approach	Best for	Limitations	Approximate cost range
Untargeted LC-MS / GC-MS	Hypothesis generation, pathway profiling, biomarker discovery across unknown metabolites	Incomplete annotation; semi-quantitative; pipeline choice shifts feature sets (Aigensberger et al., 2025)	Moderate per-sample MS cost; moderate bioinformatics cost
Targeted LC-MS/MS	Known metabolite panels, absolute quantification with standards	Limited to predefined compounds; requires panel design upfront	Lower per-compound MS cost; panel design and standard procurement cost
NMR metabolomics (on request)	Structure elucidation, highly reproducible quantification of abundant metabolites	Lower sensitivity and metabolome coverage than LC-MS for low-abundance features	Instrument-dependent; moderate per-sample cost

Which fecal metabolites track IBD disease activity?: Lloyd-Price et al. (2019) profiled 546 fecal metabolite profiles from 132 IBD subjects, linking bile acids, acylcarnitines, and short-chain fatty acids to disease activity (Lloyd-Price et al., 2019).
Which urine metabolites predict incident type 2 diabetes?: Salihovic et al. (2020) analyzed UPLC-MS data from 1,424 adults; six metabolites improved incident T2D prediction beyond clinical risk factors (C-statistic 0.866 to 0.892) (Salihovic et al., 2020).
When should metabolomics follow proteogenomics in ccRCC?: Clark et al. (2019) profiled 103 tumors and 80 normal adjacent tissues by proteogenomics only, finding oxidative phosphorylation dysregulation at the protein but not RNA level—motivating matched metabolomics when sample IDs align (Clark et al., 2019).

How the analysis works — step by step

1. Scope study design and select preprocessing pipeline
Pepkio confirms the biological question, sample matrix, platform, polarity mode, contrast definitions, and pooled-QC design. Confounded batch is flagged before processing (Alseekh et al., 2021; Reinhold et al., 2019). Preprocessing tool—xcms, MS-DIAL, or MZmine—is locked at kickoff (Aigensberger et al., 2025).
Tools and outputs
Tools used: Project scoping checklist; MetaboAnalystR design module when applicable
Output: Signed scope document with pipeline choice and `analysis_parameters.yaml` draft
2. Validate inputs and record metadata
Raw file integrity, polarity, injection order, and sample IDs are verified against the design. Vendor export parameters are recorded for pre-processed peak tables. Pooled QC and blank injections are confirmed (Alseekh et al., 2021; Mosley et al., 2024).
Tools and outputs
Tools used: Custom validation scripts; `sample_manifest.csv` schema check
Output: `sample_manifest.csv` with file paths, condition labels, batch, injection order, and replicate IDs
3. Convert vendor formats and run raw spectral QC
Vendor raw files convert to mzML when cross-platform ingestion is required; SCIEX `.wiff` loads in MS-DIAL when scoped. TIC/BPC overlays, mass accuracy, and blank-feature flags are inspected before peak picking (Chambers et al., 2012; Pang et al., 2024; Reinhold et al., 2019).
Tools and outputs
Tools used: msconvert (ProteoWizard); MetaboAnalystR QC module
Output: `raw_qc_report.html`; per-run TIC/BPC plots (PDF/SVG); mzML files when applicable
4. Detect, align, and integrate peaks
Features are extracted with consistent m/z and RT tolerances; missing peaks are gap-filled. Peak-picking parameters are documented because they are instrument- and matrix-dependent (Tsugawa et al., 2015; Heuckeroth et al., 2024).
Tools and outputs
Tools used: xcms; MS-DIAL; or MZmine (locked at kickoff)
Output: `peak_matrix.csv` (`feature_id`, `mz`, `rt`, sample intensity columns)
5. Normalize and diagnose batch and run-order effects
Pooled-QC samples enable QC-RLSC or probabilistic quotient normalization (PQN) when available; total-sum scaling is documented when QC samples are absent (Reinhold et al., 2019). PCA and correlation heatmaps separate condition from batch and injection order before differential testing.
Tools and outputs
Tools used: MetaboAnalystR; limma `removeBatchEffect` when batch is not confounded with condition (Ritchie et al., 2015)
Output: `normalized_matrix.csv`; PCA scores plot; pooled-QC drift plot (PDF/SVG)
6. Filter features and handle missing values
Low-abundance and high-missingness features are filtered using kickoff-agreed thresholds. Missingness is profiled by batch for BEAMs before imputation (Goh et al., 2023; Reinhold et al., 2019); strategy is logged in `analysis_parameters.yaml`.
Tools and outputs
Tools used: MetaboAnalystR; custom R filtering scripts
Output: `filtered_matrix.csv`; `missing_value_profile.csv`; missingness heatmap (PDF/SVG)
7. Test differential abundance and multivariate structure
limma linear models fit the agreed design matrix with Benjamini–Hochberg FDR control (Ritchie et al., 2015). PLS-DA or OPLS-DA with cross-validated metrics is run when scoped for classification-oriented questions (Thévenot et al., 2015; Eriksson et al., 2008).
Tools and outputs
Tools used: limma; ropls
Output: `diff_results_<contrast>.csv` (`feature_id`, `log2FC`, `AveExpr`, `t`, `P.Value`, `adj.P.Val`); PLS-DA/OPLS-DA scores and VIP tables when scoped
8. Annotate features and run pathway enrichment
Spectra are matched against HMDB 5.0, METLIN, KEGG, LipidMaps, and GNPS with Alseekh-aligned confidence tiers (Alseekh et al., 2021). Pathway over-representation or MSEA is run on annotated features (Pang et al., 2024).
Tools and outputs
Tools used: MS-DIAL; MetaboAnalystR; SIRIUS/CSI:FingerID or GNPS molecular networking on request
Output: `annotation_table.csv` (`feature_id`, `mz`, `rt`, `compound_name`, `HMDB_ID`, `KEGG_ID`, `annotation_level`, `MS2_match_score`); `pathway_enrichment.csv`
9. Generate figure-ready outputs
Volcano plots, top-feature heatmaps, and pathway enrichment dot plots are exported at publication resolution. Annotation tier counts and QC summary statistics are compiled for the Methods draft (Alseekh et al., 2021).
Tools and outputs
Tools used: MetaboAnalystR; custom R visualization scripts
Output: PDF/SVG figure bundle; `figure_manifest.csv`
10. Package scripts, QC report, and Methods draft
Commented R scripts, environment lock files, README, and Methods draft are bundled with all processed tables and figures. Post-delivery reviewer support scope is documented (Alseekh et al., 2021; Mosley et al., 2024).
Tools and outputs
Tools used: `renv` or conda environment export; custom packaging scripts
Output: Final deliverable bundle; README; Methods draft; HTML QC report

What Pepkio delivers

Processed data

Raw and normalized feature matrices (`.csv`, `.tsv`)
`annotation_table.csv`; `diff_results_<contrast>.csv`
`analysis_parameters.yaml` documenting preprocessing, normalization, and imputation choices

Figures (PDF/SVG)

TIC/BPC overlays; pooled-QC drift plot
PCA scores plot, correlation heatmap, missingness heatmap
Volcano plots, top-feature heatmap, PLS-DA/OPLS-DA scores when scoped
Pathway enrichment dot plot

Tables

sample_manifest.csv, raw_qc_summary.csv, missing_value_profile.csv
feature_filter_log.csv, pathway_enrichment.csv, figure_manifest.csv

Code

Commented R scripts per stage with `renv.lock` or conda export
Delivery via private Git repository or agreed file transfer

Documentation

HTML QC report with thresholds and annotation tier counts
README with reproduction instructions
Methods draft citing software versions and preprocessing parameters
Post-delivery reviewer support within agreed scope (typically ≤20% of project scope)

Technical decisions we make — and why

Preprocessing tool: MS-DIAL often selected for MS/MS-rich data; xcms for R-native reproducible pipelines; MZmine for GNPS networking workflows: Only approximately 8% of features overlapped across four workflows on a shared LC-HRMS dataset, so the tool is locked at kickoff (Aigensberger et al., 2025; Tsugawa et al., 2015; Smith et al., 2006).
Normalization: QC-RLSC or PQN when pooled QC available; total-sum scaling when not: Pooled QC samples monitor run-order drift (Reinhold et al., 2019; Alseekh et al., 2021).
Missing values: profile BEAMs before imputation; complete-case as fallback when needed: Imputation before batch correction can inflate false positives when missingness correlates with batch (Goh et al., 2023; Reinhold et al., 2019).
Differential testing: limma with Benjamini–Hochberg FDR; ropls PLS-DA/OPLS-DA when scoped: limma handles complex designs with covariates (Ritchie et al., 2015); PLS-DA complements univariate testing for classification (Thévenot et al., 2015).
Annotation tiers: Alseekh-aligned levels from MS/MS-confirmed to putative formula: MS1-only matches carry higher false-discovery risk than MS/MS-confirmed identifications; tier counts are reported in the QC summary (Alseekh et al., 2021).

Common questions

What is the minimum sample size and replicate count for untargeted metabolomics?

Multiple biological replicates per condition—often three or more—are recommended to estimate feature-level variance for limma testing (Reinhold et al., 2019). Pepkio can analyze smaller designs but documents reduced power in the QC report. Pooled QC injections should bracket sample runs when feasible; sample-size targets are confirmed at kickoff based on contrast design and expected effect sizes.

Can you analyze poor-quality or high-missingness samples?

Yes, with caveats documented in the QC report. Runs with unstable TIC, poor mass accuracy, or high blank contamination are flagged before differential testing. Samples with extreme missingness or outlier positions in PCA are discussed with you; re-measurement is recommended when QC threatens the study question. Imputation strategy is adjusted based on missingness patterns rather than applied blindly (Reinhold et al., 2019; Goh et al., 2023).

Do you support Thermo Orbitrap, SCIEX, Agilent, Waters, and Bruker LC-MS data?

Yes, for formats we can load after kickoff review. Thermo `.raw`, SCIEX `.wiff`, Agilent `.d`, Bruker `.d`, and Waters `.raw` are processed in xcms or MS-DIAL when scoped; cross-platform ingestion uses msconvert (Chambers et al., 2012). Feasibility depends on file format, acquisition mode, and metadata completeness. The Methods draft records preprocessing tool, m/z tolerance, and RT alignment parameters.

How long does untargeted metabolomics analysis take at Pepkio?

A standard project (roughly 10–40 samples, one primary contrast, LC-MS peak picking, normalization, annotation, and pathway enrichment) typically completes in 2–4 weeks from data receipt. GC-MS extensions, multi-contrast designs, large batch-correction studies, or cohorts exceeding 60 samples may require 4–8 weeks. Milestone check-ins occur during the project; exact timelines are confirmed at kickoff.

How do you handle batch effects and run-order drift in metabolomics data?

When batch is known and not fully confounded with condition, Pepkio includes batch in the limma design formula or applies `removeBatchEffect` after normalization diagnostics (Ritchie et al., 2015). QC-RLSC or PQN corrects run-order drift using pooled QC samples when available (Reinhold et al., 2019). PCA and correlation heatmaps are reviewed before modeling. BEAMs are profiled before imputation to avoid confounding batch correction with missing-value handling (Goh et al., 2023). Fully confounded batch-by-condition designs cannot be corrected statistically and are flagged at kickoff.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of all code, scripts, and results delivered under the project agreement. Pepkio provides commented R scripts with `renv.lock` or conda export files so you can rerun analyses when the execution environment matches the pinned setup. Matrices use standard `.csv` and `.tsv` formats readable in R, Python, or Excel; R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after raw spectral QC, after normalization and batch diagnostics, and before final delivery. You can review contrast definitions, normalization method, imputation thresholds, and annotation cutoffs within the agreed project scope. A dedicated scientific contact leads the project and incorporates your matrix-specific knowledge into annotation and interpretation decisions.

What does post-delivery reviewer support include?

Support covers clarification of preprocessing methods, QC thresholds, normalization choices, and minor figure or table revisions within agreed scope (typically ≤20% of project scope). Pepkio drafts Methods and Supplementary text for analyses we performed. Substantial new analyses requested by reviewers—additional contrasts, alternate preprocessing, or cross-cohort extensions—are scoped as separate milestones with updated deliverables and timeline.

Is co-authorship required?

No. Pepkio operates as a fee-for-service provider and does not require co-authorship unless explicitly discussed in advance. Standard practice is acknowledgment of bioinformatics support in the Acknowledgments section. Co-authorship is considered only when Pepkio scientists make substantial intellectual contributions beyond routine analysis execution agreed in the project scope.

Should I use xcms, MS-DIAL, or MZmine for my dataset?

MS-DIAL suits MS/MS-rich untargeted data with built-in deconvolution and annotation (Tsugawa et al., 2015). xcms suits R-native, fully scriptable pipelines with Bioconductor integration (Smith et al., 2006; Pang et al., 2024). MZmine suits workflows requiring GNPS molecular networking or modular lipidomics steps (Heuckeroth et al., 2024). Because only ~8% of features overlapped across four workflows on a benchmark dataset, Pepkio locks the tool at kickoff (Aigensberger et al., 2025).

How are metabolites annotated when MS/MS spectra are unavailable?

MS1-only features are matched by accurate mass and RT against HMDB 5.0, METLIN, and KEGG, assigned putative annotation tiers per Alseekh et al. (2021), and reported separately from MS/MS-confirmed identifications. Putative matches carry higher false-discovery risk; the QC report lists confirmed versus putative counts. SIRIUS/CSI:FingerID or GNPS networking can be scoped when MS/MS data or public spectral matches are available (Alseekh et al., 2021; Pang et al., 2024).

Can you accept pre-processed peak tables from vendor software?

Yes, when export parameters are documented. Peak tables from MS-DIAL, MZmine, Compound Discoverer, or Progenesis QI are accepted with m/z, RT, and intensity columns plus export settings. Pepkio validates feature counts, missingness, and sample ID alignment before modeling. Re-processing from raw files is recommended when export parameters are unknown or inconsistent across batches.

Related services

Multi-omics integration — Joint modeling of metabolomics with transcriptomics, proteomics, or metagenomics on matched sample IDs to identify cross-layer pathway drivers.
DDA/DIA proteomics — Protein–metabolite concordance when matched proteomics and metabolomics data share sample IDs.
Bulk RNA-seq — Pathway-level comparison between gene expression and metabolite abundance on the same cohort.
Shotgun metagenomics — Microbiome functional context for fecal or gut metabolomics studies.
Machine learning — Classifier development and feature selection on metabolite panels identified by untargeted profiling.

References

Alseekh S, Aharoni A, Brotman Y, et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Nature Methods. 2021;18(7):747–756. https://doi.org/10.1038/s41592-021-01197-1 (PMID: 34239102)
Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry. 2006;78(3):779–787. https://doi.org/10.1021/ac051437y (PMID: 16448051)
Tsugawa H, Cajka T, Kind T, et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nature Methods. 2015;12(6):523–526. https://doi.org/10.1038/nmeth.3393 (PMID: 25938372)
Heuckeroth S, Damiani T, Smirnov A, et al. Reproducible mass spectrometry data processing and compound annotation in MZmine 3. Nature Protocols. 2024;19(9):2597–2641. https://doi.org/10.1038/s41596-024-00996-y (PMID: 38769143)
Pang Z, Xu L, Viau C, et al. MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics. Nature Communications. 2024;15:3675. https://doi.org/10.1038/s41467-024-48009-6 (PMID: 38693118)
Reinhold D, Pielke-Lombardo H, Jacobson S, Ghosh D, Kechris K. Pre-analytic considerations for mass spectrometry based untargeted metabolomics data. Methods in Molecular Biology. 2019;1978:323–340. https://doi.org/10.1007/978-1-4939-9236-2_20 (PMID: 31119672)
Wishart DS, Guo AC, Oler E, et al. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Research. 2022;50(D1):D622–D631. https://doi.org/10.1093/nar/gkab1062 (PMID: 34986597)
Aigensberger M, Bueschl C, Castillo-Lopez E, et al. Modular comparison of untargeted metabolomics processing steps. Analytica Chimica Acta. 2025;1336:343491. https://doi.org/10.1016/j.aca.2024.343491 (PMID: 39788662)
Lloyd-Price J, Arze C, Ananthakrishnan AN, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569(7758):655–662. https://doi.org/10.1038/s41586-019-1237-9 (PMID: 31142855)
Salihovic S, Broeckling CD, Ganna A, et al. Non-targeted urine metabolomics and associations with prevalent and incident type 2 diabetes. Scientific Reports. 2020;10:16474. https://doi.org/10.1038/s41598-020-72456-y (PMID: 33020500)
Clark DJ, Dhanasekaran SM, Petralia F, et al. Integrated proteogenomic characterization of clear cell renal cell carcinoma. Cell. 2019;179(4):964–983.e31. https://doi.org/10.1016/j.cell.2019.10.007 (PMID: 31675502)
Goh WWB, Hui HWH, Wong L. How missing value imputation is confounded with batch effects and what you can do about it. Drug Discovery Today. 2023;28(9):103661. https://doi.org/10.1016/j.drudis.2023.103661 (PMID: 37301250)
Mosley JD, Schock TB, Beecher CW, et al. Establishing a framework for best practices for quality assurance and quality control in untargeted metabolomics. Metabolomics. 2024;20:20. https://doi.org/10.1007/s11306-023-02080-0 (PMID: 38345679)
Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43(7):e47. https://doi.org/10.1093/nar/gkv007 (PMID: 25605792)
Thévenot EA, Roux A, Xu Y, et al. Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses. Journal of Proteome Research. 2015;14(8):3322–3335. https://doi.org/10.1021/acs.jproteome.5b00354 (PMID: 26088811)
Eriksson L, Trygg J, Wold S. CV-ANOVA for significance testing of PLS and OPLS models. Journal of Chemometrics. 2008;22(11–12):594–600. https://doi.org/10.1002/cem.1187
Chambers MC, Maclean B, Burke R, et al. A cross-platform toolkit for mass spectrometry and proteomics. Nature Biotechnology. 2012;30(10):918–920. https://doi.org/10.1038/nbt.2377 (PMID: 23051804)

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Untargeted Metabolomics Analysis Service — Annotated Feature Matrices from Raw LC-MS Spectra to limma-Tested Contrasts

Key facts

What is untargeted metabolomics?

When should you use untargeted metabolomics?

How the analysis works — step by step

1. Scope study design and select preprocessing pipeline

2. Validate inputs and record metadata

3. Convert vendor formats and run raw spectral QC

4. Detect, align, and integrate peaks

5. Normalize and diagnose batch and run-order effects

6. Filter features and handle missing values

7. Test differential abundance and multivariate structure

8. Annotate features and run pathway enrichment

9. Generate figure-ready outputs

10. Package scripts, QC report, and Methods draft