Bioinformatics analysis service

Statistical Analysis Services — STROBE-MR-Aligned Causal Inference and Pre-Study Power Planning

Statistical analysis applies formal inference to test causal exposure–outcome relationships and whether an omics study is adequately powered before data collection. Pepkio's statistical analysis service delivers version-pinned Mendelian randomization workflows and prospective experimental-design reports with full R code, sensitivity analyses, and a Methods draft for academic, biotech, and pharma teams. Custom inputs, outputs, and non-standard analyses are scoped at kickoff.

Key facts

Key facts about statistical analysis analysis
Fact	Value
Data types supported	GWAS summary statistics (OpenGWAS IDs or client .tsv/.txt files); exposure and outcome GWAS metadata; pilot RNA-seq count matrices; cohort sample manifests; pre-specified instrument SNP lists; grant or protocol design briefs
Reference builds or standards used	STROBE-MR reporting checklist (Skrivankova et al., 2021); Burgess MR guidelines (2023); GRCh38 for variant harmonization; Conesa et al. (2016) RNA-seq replicate guidance where omics design applies
Primary tools (with versions)	TwoSampleMR 0.7.4; ieugwasr 1.1.0; MendelianRandomization 0.9.2; MR-PRESSO 1.0; PROPER 1.43.0; RNASeqDesign 0.1.0; scPower 1.0.0; RNASeqPower 1.52.0; DESeq2 1.52.0
Typical turnaround range	1–2 weeks (experimental design memo from pilot or public priors); 2–4 weeks (standard two-sample MR); 3–5 weeks (multivariable MR, colocalization extensions, or multi-contrast design optimization)—confirmed at kickoff
Deliverable formats	Harmonized GWAS .tsv; MR result tables (.csv, .xlsx); scatter, funnel, and leave-one-out plots (PDF/SVG); power-vs-sample-size curves; design memo; STROBE-MR mapping document; commented R scripts; Methods draft
Regulatory/reproducibility standards followed	STROBE-MR checklist mapping; Burgess MR sensitivity-analysis recommendations; version-pinned software with sessionInfo() or conda lock files; private Git or Zenodo archival on request
Custom / bespoke analysis	Non-standard MR methods (MR-RAPS, multivariable MR, mediation MR), custom FDR or clumping thresholds, alternative power models, client-specified figure or table formats, and analyses beyond standard two-sample MR—scoped at kickoff

Key terms: Mendelian randomization (MR) uses genetic variants as instrumental variables to estimate causal effects of modifiable exposures. An instrumental variable must associate with the exposure, be independent of confounders, and affect the outcome only through the exposure. The F-statistic measures instrument strength. False discovery rate (FDR) controls expected false positives across many tests. A negative binomial model models RNA-seq count overdispersion for power simulation.

What is statistical analysis?

Statistical analysis applies probability models to quantify uncertainty in biological conclusions—focused on causal inference from genetic instruments and prospective study design for omics experiments. Unlike observational regression alone, MR exploits random allele allocation to estimate whether a modifiable exposure causally affects a disease outcome (Burgess et al., 2023). Prospective design analysis asks whether a planned RNA-seq or scRNA-seq study has sufficient power before sequencing spend (Jeon et al., 2023). Adoption is substantial: the NHGRI-EBI GWAS Catalog hosts 176,855 full summary-statistics datasets across 7,714 publications (GWAS Catalog, 2026), and MR submissions to European Journal of Epidemiology rose from 3.1% of all submissions in 2020 to 13.0% in 2024 (Hemani et al., 2025).

What statistical analysis can answer

Published examples of biological questions statistical analysis can address:

Does LDL cholesterol causally increase coronary heart disease risk? Holmes et al. (2015) applied weighted allele scores in 17 studies (62,199 participants; 12,099 CHD events) and found both unrestricted and restricted LDL-C instrument sets associated with CHD, supporting a causal role for LDL-C.
Which phenotypes are causally affected by BMI across the human phenome? Millard et al. (2019) performed an MR phenome-wide association study (MR-pheWAS) in 334,968 UK Biobank participants, identifying 587 associations at 5% FDR—including adverse effects on diabetes and hypertension.
How many biological replicates are needed for RNA-seq differential expression? Ching et al. (2014) showed that increasing sample size yields more power than increasing sequencing depth once depth exceeds approximately 20 million reads per sample.
What sample size and cell count optimize scRNA-seq cell-type DE detection? Schmid et al. (2021) modelled power as a function of donors, cells per donor, and sequencing depth, finding that shallow sequencing of more cells often outperforms deep sequencing of fewer cells for cell-type-specific DE.
How should budget be split between replicates and sequencing depth before a grant submission? Lin et al. (2019) developed RNASeqDesign to optimize sample size and depth jointly from pilot RNA-seq data under a fixed budget constraint.

Services included in this category

Pepkio's statistical analysis category covers Mendelian randomization and experimental design consulting—each with a dedicated spoke page for inputs, tools, and deliverables.

Statistical analysis services offered by Pepkio
Service	Description	Primary tools
Mendelian randomization	Two-sample MR from GWAS summary data with pleiotropy sensitivity analyses and STROBE-MR-aligned reporting	TwoSampleMR 0.7.4, MendelianRandomization 0.9.2, MR-PRESSO 1.0
Experimental design	Pre-study power, sample-size, and replicate planning for bulk RNA-seq, scRNA-seq, and related omics	PROPER 1.43.0, RNASeqDesign 1.0.0, scPower 1.0.0, DESeq2 1.52.0

What Pepkio delivers

Pepkio returns reproducible, analysis-ready outputs—not summary slides alone.

MR outputs

Harmonized exposure and outcome summary-statistics tables (.tsv); primary MR tables (inverse-variance weighted, weighted median, MR-Egger) with beta, SE, p, and nsnp columns; MR-PRESSO global and outlier tests; scatter plots, funnel plots, and leave-one-out influence plots (PDF/SVG); instrument F-statistic and R² summary; STROBE-MR checklist mapping document

Experimental design outputs

Power-vs-sample-size and power-vs-depth curves (PDF/SVG); recommended biological replicate count, sequencing depth, and expected detectable fold change at target FDR; simulation parameter log from pilot data; design memo suitable for grant Methods or Power sections citing tool versions and assumptions

Code and documentation

Commented R scripts; conda or renv lock files; README with rerun instructions; Methods draft citing exact software versions and GWAS accession IDs—you retain full ownership

Support

Milestone check-ins with a dedicated PhD-level scientific contact; reviewer clarification and minor revisions within agreed scope (typically ≤20% of deliverables)

Non-standard MR extensions, custom power models, or client-specified table formats are scoped at kickoff.

How the analysis works — step by step

1. Scope the causal or design question
Confirm exposure–outcome pair for MR or contrast, effect size, and target FDR; assess biological plausibility and pre-register sensitivity analyses where feasible (Burgess et al., 2023).
Tools and outputs
Output: signed scope document with primary and secondary estimands
2. Inventory inputs and metadata
For MR: collect OpenGWAS IDs or client summary-stat files, population ancestry, and genome build. For design: collect pilot count matrix, sample manifest, and budget constraints.
Tools and outputs
Tools used: sample manifest template
Output: input_manifest.csv
3. Select instruments or estimate pilot parameters
MR: extract genome-wide significant SNPs, LD clump (r² < 0.001, 10,000 kb per TwoSampleMR convention), and compute per-SNP F-statistics (Burgess et al., 2023). Design: estimate mean dispersion and effect-size priors from pilot counts.
Tools and outputs
Tools used: TwoSampleMR 0.7.4; PROPER 1.43.0 estParam()
Output: instrument list or simulation parameter file
4. Harmonize alleles or simulate count data
MR: align effect alleles, resolve palindromic SNPs with allele frequency checks, and log harmonization decisions (Burgess et al., 2023). Design: simulate negative-binomial counts under proposed replicate counts and depths.
Tools and outputs
Tools used: TwoSampleMR 0.7.4; PROPER 1.43.0 or RNASeqDesign 0.1.0
Output: harmonized GWAS tables or simulation draws
5. Run primary analysis
MR: inverse-variance weighted two-sample MR as primary estimator (Burgess et al., 2013). Design: compute power at target FDR across replicate and depth grids.
Tools and outputs
Tools used: MendelianRandomization 0.9.2; PROPER 1.43.0; scPower 1.0.0 (scRNA-seq)
Output: mr_primary_results.csv or power summary table
6. Run sensitivity and robustness checks
MR: weighted median, MR-Egger, MR-PRESSO outlier removal, and leave-one-out analysis (Verbanck et al., 2018; Burgess & Thompson, 2017). Design: sweep depth vs replicate tradeoffs (Ching et al., 2014; Jeon et al., 2023).
Tools and outputs
Tools used: MR-PRESSO 1.0; MendelianRandomization 0.9.2; RNASeqPower 1.52.0
Output: sensitivity tables and diagnostic plots
7. Interpret assumptions and document limitations
Map results to STROBE-MR checklist items; state where exclusion restriction or instrument-strength assumptions may be violated (Skrivankova et al., 2021). For design memos, document pilot-data limitations and recommend minimum viable replicate counts.
Tools and outputs
Output: interpretation memo with assumption audit
8. Package deliverables and support reviewer requests
Export figures at publication resolution; bundle scripts, lock files, README, and Methods draft. Transfer via agreed secure channels. Post-delivery support covers methods clarification; substantial new exposure–outcome pairs or design revisions are scoped separately.
Tools and outputs
Output: final deliverable bundle

Tools and standards we use

Pepkio pins software versions at kickoff and cites primary references in the Methods draft. Representative tools across MR and experimental design:

Statistical analysis tools and standards
Tool	Version	Role	Primary citation
TwoSampleMR	0.7.4	Two-sample MR orchestration and OpenGWAS retrieval	https://doi.org/10.7554/eLife.34408
ieugwasr	1.1.0	Programmatic access to OpenGWAS summary statistics	https://mrcieu.github.io/ieugwasr/
MendelianRandomization	0.9.2	IVW, weighted median, MR-Egger estimators	https://doi.org/10.1093/ije/dyx034
MR-PRESSO	1.0	Pleiotropy outlier detection and distortion test	https://doi.org/10.1038/s41588-018-0304-7
PROPER	1.43.0	Simulation-based RNA-seq power evaluation	https://doi.org/10.1093/bioinformatics/btu640
RNASeqDesign	0.1.0	Multi-dimensional sample-size and depth optimization from pilot data	https://doi.org/10.1111/rssc.12330
scPower	1.0.0	scRNA-seq cell-type DE and eQTL power planning	https://doi.org/10.1038/s41467-021-26779-7
RNASeqPower	1.52.0	Closed-form RNA-seq sample-size estimation	https://doi.org/10.1089/cmb.2012.0283
DESeq2	1.52.0	Pilot dispersion estimation from count matrices	https://doi.org/10.1186/s13059-014-0550-8

MR projects follow Burgess et al. (2023) for harmonization and sensitivity reporting, TwoSampleMR conventions for LD clumping (r² < 0.001, 10,000 kb) and F-stat > 10 instrument strength, and STROBE-MR (Skrivankova et al., 2021) for reporting. Design projects follow Jeon et al. (2023) tool recommendations and Conesa et al. (2016) guidance of ≥3 biological replicates per condition where feasible.

Common challenges — and how we handle them

Researchers often struggle with weak instruments, pleiotropy, harmonization errors, underpowered designs, and automated analyses lacking biological grounding. Pepkio addresses each with documented sensitivity workflows and kickoff scoping.

Weak instruments and low F-statistics: Instruments with F < 10 (a conventional weak-IV threshold) can bias MR estimates; Burgess et al. (2023) recommends reporting F-statistics. Pepkio reports per-SNP and aggregate F-statistics, applies strict clumping, and documents leave-one-out stability before interpretation.
Horizontal pleiotropy violating exclusion restriction: SNPs may influence outcomes through pathways independent of the exposure (Verbanck et al., 2018). Pepkio runs MR-Egger, weighted median, and MR-PRESSO outlier tests and reports when estimates diverge materially.
Variant harmonization errors: Allele flipping, strand mismatches, and palindromic SNPs cause spurious MR associations (Burgess et al., 2023). Pepkio logs every harmonization decision and flags SNPs removed for ambiguity.
Underpowered omics designs before sequencing: Many RNA-seq studies are planned without simulation-based power assessment (Jeon et al., 2023). Pepkio runs PROPER or RNASeqDesign sweeps and recommends replicate count before library prep.
Hypothesis-free MR without biological plausibility: Automated exposure–outcome scans can yield technically valid but biologically nonsensical pairs (Hemani et al., 2025). Pepkio scopes MR questions with explicit biological rationale at kickoff and discourages undirected phenome scans unless pre-specified.

Common questions

What data do I need to provide for a statistical analysis project at Pepkio?

For MR: exposure and outcome GWAS summary statistics (OpenGWAS IDs or .tsv files), population ancestry, and genome build. For experimental design: a pilot count matrix or public reference dataset, target fold change, FDR, and budget constraints. Pepkio confirms scope at kickoff. Custom summary-stat formats are accepted when documented in advance.

How long does statistical analysis take at Pepkio?

Experimental design memos typically take 1–2 weeks. Standard two-sample MR takes 2–4 weeks. Multivariable MR, colocalization extensions, or multi-contrast design optimization take 3–5 weeks. Exact timelines are confirmed at kickoff.

What do Pepkio statistical analysis deliverables look like?

MR projects receive harmonized GWAS tables, primary and sensitivity MR result spreadsheets, scatter/funnel/leave-one-out plots, STROBE-MR mapping, commented R scripts, and a Methods draft. Design projects receive power curves, recommended replicate counts and sequencing depth, a simulation log, and a design memo suitable for grant Methods or Power sections.

Can Pepkio use my in-house GWAS summary statistics instead of OpenGWAS?

Yes. Provide per-SNP beta, SE, effect allele, other allele, and sample size in a documented format. Pepkio harmonizes in-house files against the same allele-alignment rules used for OpenGWAS retrieval (Burgess et al., 2023). Mixed-ancestry or non-GRCh38 builds are supported when scoped at kickoff.

How many RNA-seq replicates do I need for differential expression?

It depends on effect size, dispersion, and sequencing depth—not a fixed number. Ching et al. (2014) showed replicate count dominates power once depth exceeds ~20 million reads. Pepkio runs PROPER or RNASeqDesign simulations and recommends replicate count at your target FDR.

What is the difference between Mendelian randomization and experimental design at Pepkio?

Mendelian randomization estimates causal effects of an exposure on an outcome using genetic instruments and existing GWAS summary data—typically after data collection. Experimental design estimates whether a planned omics experiment can detect specified effects before you sequence—preventing underpowered studies. Both deliver version-pinned R code and a Methods draft.

Can Pepkio run multivariable or mediation Mendelian randomization?

Yes, when scoped at kickoff. Pepkio supports multivariable MR for correlated exposures and mediation-style extensions using TwoSampleMR and MendelianRandomization, with sensitivity analyses per Burgess et al. (2023). Complex network MR or sample-overlap corrections require explicit scoping.

Do you help with grant power-analysis sections?

Yes. Pepkio delivers a design memo with power curves, assumed effect sizes, FDR targets, recommended replicate counts, and cited tool versions suitable for grant Methods or Power sections. Simulations use pilot data when available or published reference datasets when not (Lin et al., 2019; Wu et al., 2015).

Do I receive the analysis code—and do I own it?

Yes—you retain full ownership. Pepkio delivers commented R scripts with conda or renv lock files via private Git or agreed file transfer. You can rerun, extend, or publish the code without restriction.

Can I be involved during the statistical analysis?

Yes. Checkpoint reviews follow instrument selection or simulation setup, after primary analysis, and before final delivery. You can review exposure–outcome pairs, clumping thresholds, FDR targets, and power assumptions within agreed scope.

What happens if a journal reviewer requests changes after delivery?

Methods clarification and minor revisions within agreed scope (typically ≤20% of deliverables) are covered. Adding new exposure–outcome pairs, rerunning with different GWAS releases, or expanding design to new modalities are scoped and priced separately.

Can Pepkio run custom or non-standard statistical analyses?

Yes—when scoped at kickoff: MR-RAPS, multivariable or mediation MR, custom clumping or FDR thresholds, alternative power simulators (ssizeRNA, powsimR), non-RNA-seq count models, or client-specified output formats.

Related services

Transcriptomics — Generate pilot count matrices for power analysis or expression data for MR colocalization with GWAS loci.
Genomics & variant analysis — Variant-level annotation and LD reference panels to support instrument selection and harmonization.
Machine learning — Build predictive models from omics features after causal screening or adequately powered data collection.
Bioinformatics consulting — Feasibility assessment, modality selection, and scoping before committing to MR or design projects.
Custom analysis — Non-standard statistical extensions, multi-trait integration, or bespoke reporting beyond standard MR and design workflows.

References

Burgess S, Davey Smith G, Davies NM, et al. Guidelines for performing Mendelian randomization investigations: update for summer 2023. Wellcome Open Research. 2023;4:186. https://doi.org/10.12688/wellcomeopenres.15555.3 (PMID: 32760811)
Skrivankova VW, Richmond RC, Woolf BAR, et al. Strengthening the reporting of observational studies in epidemiology using Mendelian randomisation (STROBE-MR): explanation and elaboration. BMJ. 2021;375:n2233. https://doi.org/10.1136/bmj.n2233 (PMID: 34702754)
Burgess S, Butterworth A, Thompson SG. Mendelian randomization analysis with multiple genetic variants using summarized data. Genetic Epidemiology. 2013;37(7):658–665. https://doi.org/10.1002/gepi.21758 (PMID: 24114802)
Burgess S, Thompson SG. Interpreting findings from Mendelian randomization using the MR-Egger method. European Journal of Epidemiology. 2017;32(5):377–389. https://doi.org/10.1007/s10654-017-0259-x (PMID: 28527048)
Verbanck M, Chen C-Y, Neale B, Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nature Genetics. 2018;50(5):693–698. https://doi.org/10.1038/s41588-018-0304-7 (PMID: 29686387)
Hemani G, Stender S, Wolters FJ, et al. The rapid growth in Mendelian randomization studies. European Journal of Epidemiology. 2025;40(10):1165–1171. https://doi.org/10.1007/s10654-025-01317-7 (PMID: 41196509)
Holmes MV, Asselbergs FW, Palmer TM, et al. Mendelian randomization of blood lipids for coronary heart disease. European Heart Journal. 2015;36(9):539–550. https://doi.org/10.1093/eurheartj/eht571 (PMID: 24474739)
Millard LAC, Davies NM, Tilling K, Gaunt TR, Davey Smith G. Searching for the causal effects of body mass index in over 300,000 participants in UK Biobank, using Mendelian randomization. PLoS Genetics. 2019;15(2):e1007951. https://doi.org/10.1371/journal.pgen.1007951 (PMID: 30707692)
Jeon H, Xie J, Jeon Y, et al. Statistical power analysis for designing bulk, single-cell, and spatial transcriptomics experiments: review, tutorial, and perspectives. Biomolecules. 2023;13(2):221. https://doi.org/10.3390/biomolecules13020221 (PMID: 36830591)
Wu H, Wang C, Wu Z. PROPER: comprehensive power evaluation for differential expression using RNA-seq. Bioinformatics. 2015;31(2):233–241. https://doi.org/10.1093/bioinformatics/btu640 (PMID: 25273110)
Lin CI, Liao SG, Liu P, Lee MLT, Park YS, Tseng GC. RNASeqDesign: a framework for RNA-Seq genome-wide power calculation and study design issues. Journal of the Royal Statistical Society Series C. 2019;68(3):683–704. https://doi.org/10.1111/rssc.12330 (PMID: 33692596)
Hart SN, Therneau TM, Zhang Y, et al. Calculating sample size estimates for RNA sequencing data. Journal of Computational Biology. 2013;20(12):970–978. https://doi.org/10.1089/cmb.2012.0283 (PMID: 23961961)
Ching T, Huang S, Garmire LX, et al. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–1693. https://doi.org/10.1261/rna.046011.114 (PMID: 25246651)
Schmid KT, Höllbacher B, Cruceanu C, et al. scPower accelerates and optimizes the design of multi-sample single cell transcriptomic studies. Nature Communications. 2021;12:6625. https://doi.org/10.1038/s41467-021-26779-7 (PMID: 34785648)
Hemani G, Zheng J, Elsworth B, et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife. 2018;7:e34408. https://doi.org/10.7554/eLife.34408 (PMID: 29846171)
Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):13. https://doi.org/10.1186/s13059-016-0881-8 (PMID: 26813401)
ENCODE Consortium. ENCODE Guidelines and Best Practices for RNA-Seq (Revised December 2016). https://www.encodeproject.org/documents/cede0cbe-d324-4ce7-ace4-f0c3eddf5972/@@download/attachment/ENCODE%20Best%20Practices%20for%20RNA_v2.pdf
GWAS Catalog. NHGRI-EBI GWAS Catalog statistics (accessed June 2026). https://www.ebi.ac.uk/gwas/
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):550. https://doi.org/10.1186/s13059-014-0550-8 (PMID: 25516281)

Individual services

Deep-dive pages for specific statistical analysis methods and workflows.

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Statistical Analysis Services — STROBE-MR-Aligned Causal Inference and Pre-Study Power Planning

Key facts

What is statistical analysis?

What statistical analysis can answer

Services included in this category

What Pepkio delivers

MR outputs

Experimental design outputs

Code and documentation

Support

How the analysis works — step by step

1. Scope the causal or design question

2. Inventory inputs and metadata

3. Select instruments or estimate pilot parameters

4. Harmonize alleles or simulate count data

5. Run primary analysis

6. Run sensitivity and robustness checks

7. Interpret assumptions and document limitations

8. Package deliverables and support reviewer requests

Tools and standards we use

Common challenges — and how we handle them

Common questions

Related services

Let's Talk About Your Science