Metabolomics

Multi-Omics Integration Analysis Service — Matched-Sample Factor Modeling Across Metabolomics, Transcriptomics, and Proteomics

Multi-omics integration models matched metabolomics, transcriptomics, proteomics, or metagenomics matrices to identify cross-layer drivers and pathway concordance (Hasin et al., 2017). Pepkio delivers version-pinned MOFA2 or DIABLO workflows with documented R scripts, factor-weight tables, and a Methods draft; custom inputs, outputs, and non-standard designs are scoped at kickoff. For academic, biotech, and pharma clients; MOFA2 requires more than 15 matched samples (MOFA2 developers, n.d.).

Key facts

Key facts about Multi-Omics Integration
Fact	Value
Supported platforms / instruments	Metabolomics: Thermo Orbitrap, SCIEX TripleTOF, Agilent, Bruker, Waters LC-MS (raw or peak tables). Transcriptomics: Illumina bulk RNA-seq, microarray. Proteomics: MaxQuant, DIA-NN, or MSstats protein matrices. Metagenomics: shotgun or 16S functional/taxonomic tables on request
Input requirements	>15 matched samples with overlapping omics layers for MOFA2 (MOFA2 developers, n.d.); ≥3 biological replicates per condition recommended for supervised DIABLO contrasts; consistent sample_id across matrices; missing omics per sample documented in manifest
Reference builds supported	Human GRCh38 (GENCODE v44 / Ensembl 110); mouse GRCm39 (GENCODE vM33 / Ensembl 110); UniProt Swiss-Prot reviewed proteomes; HMDB 5.0 for metabolite annotation
Primary tools (with versions)	MOFA2 1.12; mixOmics (DIABLO) 6.28; limma 3.60; MetaboAnalystR 4.0 (Pang et al., 2024); xcms 4.2.3; WGCNA and pairwise correlation modules on request — pinned per project
Typical turnaround time	4–8 weeks (two to three matched omics layers, one cohort); longer for raw preprocessing of all layers or bespoke ML extensions — confirmed at kickoff
Deliverable formats	Factor score and weight tables (.csv, .rds); cross-omics correlation matrices; PDF/SVG factor, heatmap, and network plots; HTML QC report; commented R scripts with renv lock; Methods draft
Key cited best-practice reference	Hasin et al. (2017), Genome Biology; Argelaguet et al. (2018, 2020); Singh et al. (2019), Bioinformatics
Custom / bespoke analysis	Non-standard omics blocks, integration methods, output formats, or analyses beyond standard MOFA2/DIABLO scoped at kickoff—e.g., metagenomics functional profiles, spatial layers, or client-specified ML models

What is multi-omics integration?

Multi-omics integration statistically couples two or more omics feature matrices measured on shared samples—learning latent factors or sparse components that explain coordinated variation across layers (Argelaguet et al., 2018). Unlike single-layer differential testing followed by pathway overlay, integration models cross-layer covariance so metabolite, transcript, and protein shifts are interpreted together (Wörheide et al., 2020). Lloyd-Price et al. (2019) profiled 132 inflammatory bowel disease subjects with metagenomics, metabolomics, and host transcriptomics, linking microbial and host molecular shifts to disease activity. Pepkio starts from matched matrices or scopes per-layer preprocessing first; custom block definitions are agreed at kickoff. See the multi-omics integration glossary.

When should you use multi-omics integration?

Multi-omics integration fits when matched samples span two or more molecular layers and the question requires cross-layer drivers—not when only one layer is available or sample IDs do not align (Hasin et al., 2017).

Comparison of multi-omics integration, single-omics pathway overlay, and correlation-only approaches
Approach	Best for	Limitations	Approximate cost range
Multi-omics integration (MOFA2 / DIABLO)	Matched samples; cross-layer biomarker panels; unsupervised disease subtyping or supervised outcome prediction	Requires aligned sample IDs; MOFA2 needs >15 samples; per-layer QC and normalization before integration	Quote-based; higher bioinformatics effort than single-omics (Hasin et al., 2017; Krassowski et al., 2020)
Single-omics analysis + pathway overlay	Exploratory interpretation when only one layer is measured	No statistical coupling across layers; concordance is post hoc	Lower per-project bioinformatics cost
Correlation or WGCNA-only integration	Hypothesis-light screening across two layers	No shared latent factor structure; multiple-testing burden at high dimensionality	Moderate; scoped separately when MOFA/DIABLO is not justified

Inflammatory bowel disease: Lloyd-Price et al. (2019) linked longitudinal metagenomic, metabolomic, and host transcriptomic shifts to IBD activity in 132 subjects.
Clear cell renal cell carcinoma: Clark et al. (2019) integrated proteogenomic data from 103 ccRCC tumors and identified protein-level oxidative phosphorylation dysregulation alongside genomic and transcriptomic alterations.
Breast cancer metabolism: Katzir et al. (2019) integrated transcriptomics, proteomics, phosphoproteomics, and fluxomics in MCF7 cells to map tiered regulation of metabolic enzymes across regulatory layers.

How the analysis works — step by step

1. Scope study design and verify matched sample IDs
Pepkio confirms the biological question, omics blocks in scope, outcome labels (if supervised DIABLO is planned), and that sample_id values match across layers. Samples missing entire omics blocks are flagged; partial missingness within a block is documented for MOFA2 (MOFA2 developers, n.d.). Confounded batch-by-condition designs are flagged before preprocessing (Hasin et al., 2017).
Tools and outputs
Tools used: Custom validation scripts
Output: integration_manifest.csv with sample IDs, omics availability, condition, batch, and design notes
2. Validate per-omics inputs and metadata
Each layer is audited for orientation (samples × features), missing-value fraction, and metadata completeness. Metabolomics TIC, RNA-seq library complexity, and proteomics detection rates are summarized before normalization (Reinhold et al., 2019; Alseekh et al., 2021).
Tools and outputs
Tools used: MetaboAnalystR 4.0 QC module (Pang et al., 2024); custom R diagnostics
Output: sample_qc_summary.csv; per-layer QC flags and exclusion log
3. Preprocess metabolomics layer
When raw LC-MS is in scope, Pepkio aligns peaks, normalizes with QC-RLSC or PQN, and filters low-abundance features (Smith et al., 2006; Reinhold et al., 2019). Documented client peak tables are accepted.
Tools and outputs
Tools used: xcms 4.2.3; MetaboAnalystR 4.0 (Pang et al., 2024)
Output: metabolite_matrix_normalized.csv; metabolomics PCA and missingness plots
4. Preprocess transcriptomics layer
Count matrices undergo size-factor normalization and VST before integration; microarray data is log-transformed and batch-diagnosed (Ritchie et al., 2015; MOFA2 developers, n.d.). HVG filtering follows MOFA2 guidance when RNA is a primary view (MOFA2 developers, n.d.).
Tools and outputs
Tools used: limma 3.60
Output: transcript_matrix_vst.csv; hvg_transcript_list.csv
5. Preprocess proteomics layer
Protein matrices from MaxQuant, DIA-NN, or MSstats exports are log-transformed, filtered by detection rate, and missing-value patterns are profiled before imputation decisions (Alseekh et al., 2021). Low-detection proteins are removed with thresholds logged in the QC report.
Tools and outputs
Tools used: Perseus-style filtering scripts; limma 3.60
Output: protein_matrix_log2.csv; detection-rate and missingness heatmaps
6. Regress batch effects per layer
When batch is not confounded with condition, Pepkio regresses batch per omics block with limma removeBatchEffect before integration—not as a MOFA covariate, which MOFA2 developers advise against (MOFA2 developers, n.d.). PCA before and after correction confirms condition–batch separation.
Tools and outputs
Tools used: limma 3.60
Output: Batch-corrected matrices; PCA plots colored by batch and condition
7. Harmonize feature sets and align sample order
Feature counts per view are filtered to comparable variance scales so large modalities do not dominate factors (MOFA2 developers, n.d.). Sample order and column names are synchronized across blocks; samples failing per-layer QC are excluded consistently from all views.
Tools and outputs
Tools used: Custom R harmonization scripts
Output: Aligned multi-block input list (.rds); feature_filter_log.csv
8. Select and train integration model
Unsupervised discovery uses MOFA2; supervised outcome prediction with labeled cohorts uses mixOmics DIABLO block sPLS-DA (Argelaguet et al., 2018; Singh et al., 2019). Factor count and DIABLO component sparsity are tuned with cross-validation when sample size permits. Concatenation-only baselines are not used as the primary deliverable (Singh et al., 2019).
Tools and outputs
Tools used: MOFA2 1.12; mixOmics (DIABLO) 6.28
Output: MOFA2_model.hdf5 or DIABLO object (.rds); variance-explained and cross-validation metrics
9. Interpret factors and cross-omics weights
Factor scores are correlated with metadata covariates; top-weighted features per view are exported for biological interpretation (Argelaguet et al., 2018). Cross-omics feature–feature correlations within significant factors are computed and visualized.
Tools and outputs
Tools used: MOFA2 1.12; mixOmics 6.28; corrplot
Output: factor_scores.csv; factor_weights_<view>.csv; cross-omics correlation heatmaps
10. Assess pathway concordance and package deliverables
Metabolite and gene set enrichment tests pathway overlap across layers; concordant and discordant pathways are tabulated (Wörheide et al., 2020). Final scripts, figures, QC report, and Methods draft cite exact software versions and preprocessing parameters.
Tools and outputs
Tools used: MetaboAnalystR 4.0 (Pang et al., 2024); clusterProfiler on request
Output: pathway_concordance.csv; final deliverable bundle; Methods draft

What Pepkio delivers

Processed data files

factor_scores.csv (sample_id, factor_1…factor_K, condition, batch); factor_weights_<view>.csv
Aligned input matrices per scoped view; cross_omics_correlation.csv; MOFA2_model.hdf5 or DIABLO object (.rds)

Figures (PDF/SVG)

MOFA variance-explained barplot; factor scatter plots colored by condition or batch
Cross-omics weight heatmaps; sample clustering dendrogram; DIABLO sample projection and circos/network plots when supervised integration is scoped
Per-layer PCA before and after batch correction

Tables

integration_manifest.csv; sample_qc_summary.csv; feature_filter_log.csv
factor_top_features_<view>.csv; pathway_concordance.csv (pathway_id, layer, enrichment_pvalue, concordance_flag)

Code

Commented R scripts per pipeline stage; renv.lock or conda environment export
Delivery via private Git repository or agreed file transfer

Documentation

HTML QC report with per-layer exclusion counts and normalization parameters; README with reproduction instructions
Journal-formatted Methods draft; bespoke milestones defined at kickoff
Post-delivery support within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

Integration method: MOFA2 vs DIABLO: MOFA2 for exploratory subtyping without a primary outcome label; DIABLO when a categorical outcome supports supervised block sPLS-DA with cross-validated sparsity (Singh et al., 2019). Correlation-only or WGCNA modules on request for simpler two-layer designs.
Early per-layer normalization vs naive concatenation: Each omics block is normalized and filtered independently before late integration. Naive feature concatenation underweights smaller views and mixes incompatible scales (Singh et al., 2019; Wörheide et al., 2020).
Batch correction: limma regression per layer before MOFA2: Technical batch is regressed with limma when not confounded with condition, rather than included as a MOFA covariate, because MOFA2 developers report poor results when discrete covariates replace proper regression (MOFA2 developers, n.d.).
Metabolomics normalization: QC-RLSC or PQN: Run-order and matrix effects are corrected with pooled-QC-based methods where QC injections exist (Reinhold et al., 2019). Alternative: total-intensity normalization when QC pools are unavailable, documented in the QC report.
Feature filtering: variance-based top features per view: Uninformative features are filtered so views contribute comparable signal; RNA views use highly variable feature selection after regressing known covariates (MOFA2 developers, n.d.; Aigensberger et al., 2025 report approximately 8% metabolomics feature overlap across four preprocessing pipelines, motivating locked per-layer parameters at kickoff).

Common questions

What is the minimum number of matched samples for multi-omics integration?

MOFA2 requires more than 15 matched samples with overlapping omics measurements to produce stable latent factors (MOFA2 developers, n.d.). Supervised DIABLO projects benefit from multiple biological replicates per condition—typically ≥3—and larger cohorts improve cross-validation of sparse component selection (Singh et al., 2019). Exact minimums are confirmed at kickoff based on your block count and outcome design.

Can you integrate data if one omics layer has poor quality or missing samples?

Yes, with documented scope limits. Samples failing per-layer QC are excluded consistently across all views. MOFA2 handles missing values within a view without imputing them into the likelihood (MOFA2 developers, n.d.). If an entire omics block is unreliable, Pepkio discusses reducing to two-layer integration or re-measurement before training the model.

Which platforms and data formats do you support for each omics layer?

Metabolomics: Thermo .raw, SCIEX .wiff, Agilent .d, Bruker .d, Waters .raw, mzML, or MS-DIAL/MZmine peak tables when documented at kickoff. Transcriptomics: Illumina bulk RNA-seq count matrices or .bam-derived counts; microarray when scoped. Proteomics: MaxQuant proteinGroups.txt, DIA-NN reports, or MSstats matrices. Metagenomics: taxonomic or functional abundance tables on request. Preprocessed .csv/.tsv matrices are accepted when preprocessing parameters and sample IDs are supplied.

How long does multi-omics integration take at Pepkio?

Standard matched-sample projects with two to three preprocessed layers typically complete in 4–8 weeks from data receipt. Projects requiring raw preprocessing of all layers, multiple contrasts, or bespoke ML extensions may take longer—timelines are confirmed at kickoff with milestone check-ins.

How do you handle batch effects across sequencing runs or MS injection batches?

Batch is diagnosed per omics layer with PCA and correlation checks. When batch is not confounded with condition, Pepkio regresses batch with limma removeBatchEffect on each layer before MOFA2 or DIABLO training (Ritchie et al., 2015; MOFA2 developers, n.d.). Fully confounded designs are flagged at kickoff; integration proceeds only when a defensible contrast remains.

Do I own the code—and in what format is it delivered?

Yes—you retain full ownership of all code, scripts, and results. Pepkio delivers commented R scripts with renv.lock files listing exact package versions, organized by pipeline stage with a README. Objects use standard .csv and .rds formats readable in R; private Git delivery or agreed file transfer is available.

Can I be involved during the analysis?

Yes. Checkpoint reviews occur after per-layer QC, after batch diagnostics, and after factor or component interpretation. You can review contrast definitions, normalization choices, factor count, and feature-selection thresholds within agreed scope. A dedicated scientific contact leads the project.

What does post-delivery reviewer support include?

Support covers clarification of integration methods, preprocessing parameters, factor interpretation, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Pepkio drafts Methods and Supplementary text for analyses we performed. Substantial new omics layers or contrasts requested by reviewers are scoped as separate milestones.

Is co-authorship required?

No. Pepkio operates as a fee-for-service provider and does not require co-authorship unless explicitly discussed in advance. Standard practice is acknowledgment of bioinformatics support in the Acknowledgments section; co-authorship is considered only when Pepkio scientists make substantial intellectual contributions beyond routine analysis.

Should I use MOFA2 or DIABLO for my project?

Use MOFA2 for unsupervised shared-variation discovery without a primary outcome label (Argelaguet et al., 2018). Use DIABLO when you have a categorical outcome and want a sparse multi-omics biomarker panel; Singh et al. (2019) reported a balanced error rate of 17.9 ± 1.9% for a TCGA breast cancer panel under cross-validation. The Methods draft states which method was used and why.

Can MOFA2 integrate samples that are missing one omics layer?

Yes. MOFA2 ignores missing values in the likelihood without hidden imputation (MOFA2 developers, n.d.). A substantial fraction of samples must share each omics pair, and sample order must align across available blocks. Pepkio documents per-sample omics availability in integration_manifest.csv before training.

Can I supply preprocessed matrices instead of raw data for each layer?

Yes. Pepkio accepts client-supplied normalized matrices when preprocessing parameters, feature identifiers, and sample IDs are documented. We audit orientation, missingness, and batch structure before integration; re-normalization from raw data is scoped separately when QC flags inconsistencies (Aigensberger et al., 2025).

Related services

Untargeted metabolomics — Peak detection, normalization, and annotation for the metabolite layer when raw LC-MS is available.
Bulk RNA-seq — Differential expression and count-matrix preprocessing for the transcript layer.
DDA/DIA proteomics — Protein quantification matrices for the proteomics block on matched sample IDs.
Shotgun metagenomics — Functional and taxonomic profiles as an additional integration block for microbiome–host studies.
Bioinformatics consulting — Matched-sample study design and omics-layer planning before data collection.

References

Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biology. 2017;18:83. https://doi.org/10.1186/s13059-017-1215-1 (PMID: 28476144)
Krassowski M, Das V, Sahu SK, Misra BB. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Frontiers in Genetics. 2020;11:610798. https://doi.org/10.3389/fgene.2020.610798 (PMID: 33362867)
Wörheide MA, Krumsiek J, Kastenmüller G, Arnold M. Multi-omics integration in biomedical research – A metabolomics-centric review. Analytica Chimica Acta. 2020;1141:144–162. https://doi.org/10.1016/j.aca.2020.10.038 (PMID: 33248648)
Argelaguet R, Velten B, Arnol D, et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Molecular Systems Biology. 2018;14(6):e8124. https://doi.org/10.15252/msb.20178124 (PMID: 29925568)
Argelaguet R, Arnol D, Bredikhin D, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biology. 2020;21:111. https://doi.org/10.1186/s13059-020-02015-1 (PMID: 32393329)
Singh A, Shannon CP, Gautier B, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35(17):3055–3062. https://doi.org/10.1093/bioinformatics/bty1054 (PMID: 30657866)
Lloyd-Price J, Arze C, Ananthakrishnan AN, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569(7758):655–662. https://doi.org/10.1038/s41586-019-1237-9 (PMID: 31142855)
Clark DJ, Dhanasekaran SM, Petralia F, et al. Integrated proteogenomic characterization of clear cell renal cell carcinoma. Cell. 2019;179(4):964–983.e31. https://doi.org/10.1016/j.cell.2019.10.007 (PMID: 31675502)
Katzir R, Polat IH, Harel M, et al. The landscape of tiered regulation of breast cancer cell metabolism. Scientific Reports. 2019;9:17760. https://doi.org/10.1038/s41598-019-54221-y (PMID: 31780802)
MOFA2 developers. FAQ — Multi-Omics Factor Analysis. n.d. https://biofam.github.io/MOFA2/faq.html
Alseekh S, Aharoni A, Brotman Y, et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Nature Methods. 2021;18(7):747–756. https://doi.org/10.1038/s41592-021-01197-1 (PMID: 34239102)
Reinhold D, Pielke-Lombardo H, Jacobson S, Ghosh D, Kechris K. Pre-analytic considerations for mass spectrometry based untargeted metabolomics data. Methods in Molecular Biology. 2019;1978:323–340. https://doi.org/10.1007/978-1-4939-9236-2_20 (PMID: 31119672)
Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry. 2006;78(3):779–787. https://doi.org/10.1021/ac051437y (PMID: 16448051)
Pang Z, Xu L, Viau C, et al. MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics. Nature Communications. 2024;15:3675. https://doi.org/10.1038/s41467-024-48009-6 (PMID: 38693118)
Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43(7):e47. https://doi.org/10.1093/nar/gkv007 (PMID: 25605792)
Aigensberger M, Bueschl C, Castillo-Lopez E, et al. Modular comparison of untargeted metabolomics processing steps. Analytica Chimica Acta. 2025;1336:343491. https://doi.org/10.1016/j.aca.2024.343491 (PMID: 39788662)

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Multi-Omics Integration Analysis Service — Matched-Sample Factor Modeling Across Metabolomics, Transcriptomics, and Proteomics

Key facts

What is multi-omics integration?

When should you use multi-omics integration?

How the analysis works — step by step

1. Scope study design and verify matched sample IDs

2. Validate per-omics inputs and metadata

3. Preprocess metabolomics layer

4. Preprocess transcriptomics layer

5. Preprocess proteomics layer

6. Regress batch effects per layer

7. Harmonize feature sets and align sample order

8. Select and train integration model

9. Interpret factors and cross-omics weights

10. Assess pathway concordance and package deliverables