Bioinformatics analysis service

Statistical Analysis Services — STROBE-MR-Aligned Causal Inference and Pre-Study Power Planning

Statistical analysis applies formal inference to test causal exposure–outcome relationships and whether an omics study is adequately powered before data collection. Pepkio's statistical analysis service delivers version-pinned Mendelian randomization workflows and prospective experimental-design reports with full R code, sensitivity analyses, and a Methods draft for academic, biotech, and pharma teams. Custom inputs, outputs, and non-standard analyses are scoped at kickoff.

Key facts

Key facts about statistical analysis analysis
FactValue
Data types supportedGWAS summary statistics (OpenGWAS IDs or client .tsv/.txt files); exposure and outcome GWAS metadata; pilot RNA-seq count matrices; cohort sample manifests; pre-specified instrument SNP lists; grant or protocol design briefs
Reference builds or standards usedSTROBE-MR reporting checklist (Skrivankova et al., 2021); Burgess MR guidelines (2023); GRCh38 for variant harmonization; Conesa et al. (2016) RNA-seq replicate guidance where omics design applies
Primary tools (with versions)TwoSampleMR 0.7.4; ieugwasr 1.1.0; MendelianRandomization 0.9.2; MR-PRESSO 1.0; PROPER 1.43.0; RNASeqDesign 0.1.0; scPower 1.0.0; RNASeqPower 1.52.0; DESeq2 1.52.0
Typical turnaround range1–2 weeks (experimental design memo from pilot or public priors); 2–4 weeks (standard two-sample MR); 3–5 weeks (multivariable MR, colocalization extensions, or multi-contrast design optimization)—confirmed at kickoff
Deliverable formatsHarmonized GWAS .tsv; MR result tables (.csv, .xlsx); scatter, funnel, and leave-one-out plots (PDF/SVG); power-vs-sample-size curves; design memo; STROBE-MR mapping document; commented R scripts; Methods draft
Regulatory/reproducibility standards followedSTROBE-MR checklist mapping; Burgess MR sensitivity-analysis recommendations; version-pinned software with sessionInfo() or conda lock files; private Git or Zenodo archival on request
Custom / bespoke analysisNon-standard MR methods (MR-RAPS, multivariable MR, mediation MR), custom FDR or clumping thresholds, alternative power models, client-specified figure or table formats, and analyses beyond standard two-sample MR—scoped at kickoff

Key terms: Mendelian randomization (MR) uses genetic variants as instrumental variables to estimate causal effects of modifiable exposures. An instrumental variable must associate with the exposure, be independent of confounders, and affect the outcome only through the exposure. The F-statistic measures instrument strength. False discovery rate (FDR) controls expected false positives across many tests. A negative binomial model models RNA-seq count overdispersion for power simulation.

What is statistical analysis?

Statistical analysis applies probability models to quantify uncertainty in biological conclusions—focused on causal inference from genetic instruments and prospective study design for omics experiments. Unlike observational regression alone, MR exploits random allele allocation to estimate whether a modifiable exposure causally affects a disease outcome (Burgess et al., 2023). Prospective design analysis asks whether a planned RNA-seq or scRNA-seq study has sufficient power before sequencing spend (Jeon et al., 2023). Adoption is substantial: the NHGRI-EBI GWAS Catalog hosts 176,855 full summary-statistics datasets across 7,714 publications (GWAS Catalog, 2026), and MR submissions to European Journal of Epidemiology rose from 3.1% of all submissions in 2020 to 13.0% in 2024 (Hemani et al., 2025).

What statistical analysis can answer

Published examples of biological questions statistical analysis can address:

  • Does LDL cholesterol causally increase coronary heart disease risk? Holmes et al. (2015) applied weighted allele scores in 17 studies (62,199 participants; 12,099 CHD events) and found both unrestricted and restricted LDL-C instrument sets associated with CHD, supporting a causal role for LDL-C.
  • Which phenotypes are causally affected by BMI across the human phenome? Millard et al. (2019) performed an MR phenome-wide association study (MR-pheWAS) in 334,968 UK Biobank participants, identifying 587 associations at 5% FDR—including adverse effects on diabetes and hypertension.
  • How many biological replicates are needed for RNA-seq differential expression? Ching et al. (2014) showed that increasing sample size yields more power than increasing sequencing depth once depth exceeds approximately 20 million reads per sample.
  • What sample size and cell count optimize scRNA-seq cell-type DE detection? Schmid et al. (2021) modelled power as a function of donors, cells per donor, and sequencing depth, finding that shallow sequencing of more cells often outperforms deep sequencing of fewer cells for cell-type-specific DE.
  • How should budget be split between replicates and sequencing depth before a grant submission? Lin et al. (2019) developed RNASeqDesign to optimize sample size and depth jointly from pilot RNA-seq data under a fixed budget constraint.

Services included in this category

Pepkio's statistical analysis category covers Mendelian randomization and experimental design consulting—each with a dedicated spoke page for inputs, tools, and deliverables.

Statistical analysis services offered by Pepkio
ServiceDescriptionPrimary tools
Mendelian randomizationTwo-sample MR from GWAS summary data with pleiotropy sensitivity analyses and STROBE-MR-aligned reportingTwoSampleMR 0.7.4, MendelianRandomization 0.9.2, MR-PRESSO 1.0
Experimental designPre-study power, sample-size, and replicate planning for bulk RNA-seq, scRNA-seq, and related omicsPROPER 1.43.0, RNASeqDesign 1.0.0, scPower 1.0.0, DESeq2 1.52.0

What Pepkio delivers

Pepkio returns reproducible, analysis-ready outputs—not summary slides alone.

MR outputs

  • Harmonized exposure and outcome summary-statistics tables (.tsv); primary MR tables (inverse-variance weighted, weighted median, MR-Egger) with beta, SE, p, and nsnp columns; MR-PRESSO global and outlier tests; scatter plots, funnel plots, and leave-one-out influence plots (PDF/SVG); instrument F-statistic and R² summary; STROBE-MR checklist mapping document

Experimental design outputs

  • Power-vs-sample-size and power-vs-depth curves (PDF/SVG); recommended biological replicate count, sequencing depth, and expected detectable fold change at target FDR; simulation parameter log from pilot data; design memo suitable for grant Methods or Power sections citing tool versions and assumptions

Code and documentation

  • Commented R scripts; conda or renv lock files; README with rerun instructions; Methods draft citing exact software versions and GWAS accession IDs—you retain full ownership

Support

  • Milestone check-ins with a dedicated PhD-level scientific contact; reviewer clarification and minor revisions within agreed scope (typically ≤20% of deliverables)

Non-standard MR extensions, custom power models, or client-specified table formats are scoped at kickoff.

How the analysis works — step by step

  1. 1. Scope the causal or design question

    Confirm exposure–outcome pair for MR or contrast, effect size, and target FDR; assess biological plausibility and pre-register sensitivity analyses where feasible (Burgess et al., 2023).

    Tools and outputs

    Output: signed scope document with primary and secondary estimands

  2. 2. Inventory inputs and metadata

    For MR: collect OpenGWAS IDs or client summary-stat files, population ancestry, and genome build. For design: collect pilot count matrix, sample manifest, and budget constraints.

    Tools and outputs

    Tools used: sample manifest template

    Output: input_manifest.csv

  3. 3. Select instruments or estimate pilot parameters

    MR: extract genome-wide significant SNPs, LD clump (r² < 0.001, 10,000 kb per TwoSampleMR convention), and compute per-SNP F-statistics (Burgess et al., 2023). Design: estimate mean dispersion and effect-size priors from pilot counts.

    Tools and outputs

    Tools used: TwoSampleMR 0.7.4; PROPER 1.43.0 estParam()

    Output: instrument list or simulation parameter file

  4. 4. Harmonize alleles or simulate count data

    MR: align effect alleles, resolve palindromic SNPs with allele frequency checks, and log harmonization decisions (Burgess et al., 2023). Design: simulate negative-binomial counts under proposed replicate counts and depths.

    Tools and outputs

    Tools used: TwoSampleMR 0.7.4; PROPER 1.43.0 or RNASeqDesign 0.1.0

    Output: harmonized GWAS tables or simulation draws

  5. 5. Run primary analysis

    MR: inverse-variance weighted two-sample MR as primary estimator (Burgess et al., 2013). Design: compute power at target FDR across replicate and depth grids.

    Tools and outputs

    Tools used: MendelianRandomization 0.9.2; PROPER 1.43.0; scPower 1.0.0 (scRNA-seq)

    Output: mr_primary_results.csv or power summary table

  6. 6. Run sensitivity and robustness checks

    MR: weighted median, MR-Egger, MR-PRESSO outlier removal, and leave-one-out analysis (Verbanck et al., 2018; Burgess & Thompson, 2017). Design: sweep depth vs replicate tradeoffs (Ching et al., 2014; Jeon et al., 2023).

    Tools and outputs

    Tools used: MR-PRESSO 1.0; MendelianRandomization 0.9.2; RNASeqPower 1.52.0

    Output: sensitivity tables and diagnostic plots

  7. 7. Interpret assumptions and document limitations

    Map results to STROBE-MR checklist items; state where exclusion restriction or instrument-strength assumptions may be violated (Skrivankova et al., 2021). For design memos, document pilot-data limitations and recommend minimum viable replicate counts.

    Tools and outputs

    Output: interpretation memo with assumption audit

  8. 8. Package deliverables and support reviewer requests

    Export figures at publication resolution; bundle scripts, lock files, README, and Methods draft. Transfer via agreed secure channels. Post-delivery support covers methods clarification; substantial new exposure–outcome pairs or design revisions are scoped separately.

    Tools and outputs

    Output: final deliverable bundle

Tools and standards we use

Pepkio pins software versions at kickoff and cites primary references in the Methods draft. Representative tools across MR and experimental design:

Statistical analysis tools and standards
ToolVersionRolePrimary citation
TwoSampleMR0.7.4Two-sample MR orchestration and OpenGWAS retrievalhttps://doi.org/10.7554/eLife.34408
ieugwasr1.1.0Programmatic access to OpenGWAS summary statisticshttps://mrcieu.github.io/ieugwasr/
MendelianRandomization0.9.2IVW, weighted median, MR-Egger estimatorshttps://doi.org/10.1093/ije/dyx034
MR-PRESSO1.0Pleiotropy outlier detection and distortion testhttps://doi.org/10.1038/s41588-018-0304-7
PROPER1.43.0Simulation-based RNA-seq power evaluationhttps://doi.org/10.1093/bioinformatics/btu640
RNASeqDesign0.1.0Multi-dimensional sample-size and depth optimization from pilot datahttps://doi.org/10.1111/rssc.12330
scPower1.0.0scRNA-seq cell-type DE and eQTL power planninghttps://doi.org/10.1038/s41467-021-26779-7
RNASeqPower1.52.0Closed-form RNA-seq sample-size estimationhttps://doi.org/10.1089/cmb.2012.0283
DESeq21.52.0Pilot dispersion estimation from count matriceshttps://doi.org/10.1186/s13059-014-0550-8

MR projects follow Burgess et al. (2023) for harmonization and sensitivity reporting, TwoSampleMR conventions for LD clumping (r² < 0.001, 10,000 kb) and F-stat > 10 instrument strength, and STROBE-MR (Skrivankova et al., 2021) for reporting. Design projects follow Jeon et al. (2023) tool recommendations and Conesa et al. (2016) guidance of ≥3 biological replicates per condition where feasible.

Common challenges — and how we handle them

Researchers often struggle with weak instruments, pleiotropy, harmonization errors, underpowered designs, and automated analyses lacking biological grounding. Pepkio addresses each with documented sensitivity workflows and kickoff scoping.

Weak instruments and low F-statistics
Instruments with F < 10 (a conventional weak-IV threshold) can bias MR estimates; Burgess et al. (2023) recommends reporting F-statistics. Pepkio reports per-SNP and aggregate F-statistics, applies strict clumping, and documents leave-one-out stability before interpretation.
Horizontal pleiotropy violating exclusion restriction
SNPs may influence outcomes through pathways independent of the exposure (Verbanck et al., 2018). Pepkio runs MR-Egger, weighted median, and MR-PRESSO outlier tests and reports when estimates diverge materially.
Variant harmonization errors
Allele flipping, strand mismatches, and palindromic SNPs cause spurious MR associations (Burgess et al., 2023). Pepkio logs every harmonization decision and flags SNPs removed for ambiguity.
Underpowered omics designs before sequencing
Many RNA-seq studies are planned without simulation-based power assessment (Jeon et al., 2023). Pepkio runs PROPER or RNASeqDesign sweeps and recommends replicate count before library prep.
Hypothesis-free MR without biological plausibility
Automated exposure–outcome scans can yield technically valid but biologically nonsensical pairs (Hemani et al., 2025). Pepkio scopes MR questions with explicit biological rationale at kickoff and discourages undirected phenome scans unless pre-specified.

Common questions

What data do I need to provide for a statistical analysis project at Pepkio?

For MR: exposure and outcome GWAS summary statistics (OpenGWAS IDs or .tsv files), population ancestry, and genome build. For experimental design: a pilot count matrix or public reference dataset, target fold change, FDR, and budget constraints. Pepkio confirms scope at kickoff. Custom summary-stat formats are accepted when documented in advance.

How long does statistical analysis take at Pepkio?

Experimental design memos typically take 1–2 weeks. Standard two-sample MR takes 2–4 weeks. Multivariable MR, colocalization extensions, or multi-contrast design optimization take 3–5 weeks. Exact timelines are confirmed at kickoff.

What do Pepkio statistical analysis deliverables look like?

MR projects receive harmonized GWAS tables, primary and sensitivity MR result spreadsheets, scatter/funnel/leave-one-out plots, STROBE-MR mapping, commented R scripts, and a Methods draft. Design projects receive power curves, recommended replicate counts and sequencing depth, a simulation log, and a design memo suitable for grant Methods or Power sections.

Can Pepkio use my in-house GWAS summary statistics instead of OpenGWAS?

Yes. Provide per-SNP beta, SE, effect allele, other allele, and sample size in a documented format. Pepkio harmonizes in-house files against the same allele-alignment rules used for OpenGWAS retrieval (Burgess et al., 2023). Mixed-ancestry or non-GRCh38 builds are supported when scoped at kickoff.

How many RNA-seq replicates do I need for differential expression?

It depends on effect size, dispersion, and sequencing depth—not a fixed number. Ching et al. (2014) showed replicate count dominates power once depth exceeds ~20 million reads. Pepkio runs PROPER or RNASeqDesign simulations and recommends replicate count at your target FDR.

What is the difference between Mendelian randomization and experimental design at Pepkio?

Mendelian randomization estimates causal effects of an exposure on an outcome using genetic instruments and existing GWAS summary data—typically after data collection. Experimental design estimates whether a planned omics experiment can detect specified effects before you sequence—preventing underpowered studies. Both deliver version-pinned R code and a Methods draft.

Can Pepkio run multivariable or mediation Mendelian randomization?

Yes, when scoped at kickoff. Pepkio supports multivariable MR for correlated exposures and mediation-style extensions using TwoSampleMR and MendelianRandomization, with sensitivity analyses per Burgess et al. (2023). Complex network MR or sample-overlap corrections require explicit scoping.

Do you help with grant power-analysis sections?

Yes. Pepkio delivers a design memo with power curves, assumed effect sizes, FDR targets, recommended replicate counts, and cited tool versions suitable for grant Methods or Power sections. Simulations use pilot data when available or published reference datasets when not (Lin et al., 2019; Wu et al., 2015).

Do I receive the analysis code—and do I own it?

Yes—you retain full ownership. Pepkio delivers commented R scripts with conda or renv lock files via private Git or agreed file transfer. You can rerun, extend, or publish the code without restriction.

Can I be involved during the statistical analysis?

Yes. Checkpoint reviews follow instrument selection or simulation setup, after primary analysis, and before final delivery. You can review exposure–outcome pairs, clumping thresholds, FDR targets, and power assumptions within agreed scope.

What happens if a journal reviewer requests changes after delivery?

Methods clarification and minor revisions within agreed scope (typically ≤20% of deliverables) are covered. Adding new exposure–outcome pairs, rerunning with different GWAS releases, or expanding design to new modalities are scoped and priced separately.

Can Pepkio run custom or non-standard statistical analyses?

Yes—when scoped at kickoff: MR-RAPS, multivariable or mediation MR, custom clumping or FDR thresholds, alternative power simulators (ssizeRNA, powsimR), non-RNA-seq count models, or client-specified output formats.

Related services

  • TranscriptomicsGenerate pilot count matrices for power analysis or expression data for MR colocalization with GWAS loci.
  • Genomics & variant analysisVariant-level annotation and LD reference panels to support instrument selection and harmonization.
  • Machine learningBuild predictive models from omics features after causal screening or adequately powered data collection.
  • Bioinformatics consultingFeasibility assessment, modality selection, and scoping before committing to MR or design projects.
  • Custom analysisNon-standard statistical extensions, multi-trait integration, or bespoke reporting beyond standard MR and design workflows.
References
  1. Burgess S, Davey Smith G, Davies NM, et al. Guidelines for performing Mendelian randomization investigations: update for summer 2023. Wellcome Open Research. 2023;4:186. https://doi.org/10.12688/wellcomeopenres.15555.3 (PMID: 32760811)
  2. Skrivankova VW, Richmond RC, Woolf BAR, et al. Strengthening the reporting of observational studies in epidemiology using Mendelian randomisation (STROBE-MR): explanation and elaboration. BMJ. 2021;375:n2233. https://doi.org/10.1136/bmj.n2233 (PMID: 34702754)
  3. Burgess S, Butterworth A, Thompson SG. Mendelian randomization analysis with multiple genetic variants using summarized data. Genetic Epidemiology. 2013;37(7):658–665. https://doi.org/10.1002/gepi.21758 (PMID: 24114802)
  4. Burgess S, Thompson SG. Interpreting findings from Mendelian randomization using the MR-Egger method. European Journal of Epidemiology. 2017;32(5):377–389. https://doi.org/10.1007/s10654-017-0259-x (PMID: 28527048)
  5. Verbanck M, Chen C-Y, Neale B, Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nature Genetics. 2018;50(5):693–698. https://doi.org/10.1038/s41588-018-0304-7 (PMID: 29686387)
  6. Hemani G, Stender S, Wolters FJ, et al. The rapid growth in Mendelian randomization studies. European Journal of Epidemiology. 2025;40(10):1165–1171. https://doi.org/10.1007/s10654-025-01317-7 (PMID: 41196509)
  7. Holmes MV, Asselbergs FW, Palmer TM, et al. Mendelian randomization of blood lipids for coronary heart disease. European Heart Journal. 2015;36(9):539–550. https://doi.org/10.1093/eurheartj/eht571 (PMID: 24474739)
  8. Millard LAC, Davies NM, Tilling K, Gaunt TR, Davey Smith G. Searching for the causal effects of body mass index in over 300,000 participants in UK Biobank, using Mendelian randomization. PLoS Genetics. 2019;15(2):e1007951. https://doi.org/10.1371/journal.pgen.1007951 (PMID: 30707692)
  9. Jeon H, Xie J, Jeon Y, et al. Statistical power analysis for designing bulk, single-cell, and spatial transcriptomics experiments: review, tutorial, and perspectives. Biomolecules. 2023;13(2):221. https://doi.org/10.3390/biomolecules13020221 (PMID: 36830591)
  10. Wu H, Wang C, Wu Z. PROPER: comprehensive power evaluation for differential expression using RNA-seq. Bioinformatics. 2015;31(2):233–241. https://doi.org/10.1093/bioinformatics/btu640 (PMID: 25273110)
  11. Lin CI, Liao SG, Liu P, Lee MLT, Park YS, Tseng GC. RNASeqDesign: a framework for RNA-Seq genome-wide power calculation and study design issues. Journal of the Royal Statistical Society Series C. 2019;68(3):683–704. https://doi.org/10.1111/rssc.12330 (PMID: 33692596)
  12. Hart SN, Therneau TM, Zhang Y, et al. Calculating sample size estimates for RNA sequencing data. Journal of Computational Biology. 2013;20(12):970–978. https://doi.org/10.1089/cmb.2012.0283 (PMID: 23961961)
  13. Ching T, Huang S, Garmire LX, et al. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–1693. https://doi.org/10.1261/rna.046011.114 (PMID: 25246651)
  14. Schmid KT, Höllbacher B, Cruceanu C, et al. scPower accelerates and optimizes the design of multi-sample single cell transcriptomic studies. Nature Communications. 2021;12:6625. https://doi.org/10.1038/s41467-021-26779-7 (PMID: 34785648)
  15. Hemani G, Zheng J, Elsworth B, et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife. 2018;7:e34408. https://doi.org/10.7554/eLife.34408 (PMID: 29846171)
  16. Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):13. https://doi.org/10.1186/s13059-016-0881-8 (PMID: 26813401)
  17. ENCODE Consortium. ENCODE Guidelines and Best Practices for RNA-Seq (Revised December 2016). https://www.encodeproject.org/documents/cede0cbe-d324-4ce7-ace4-f0c3eddf5972/@@download/attachment/ENCODE%20Best%20Practices%20for%20RNA_v2.pdf
  18. GWAS Catalog. NHGRI-EBI GWAS Catalog statistics (accessed June 2026). https://www.ebi.ac.uk/gwas/
  19. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):550. https://doi.org/10.1186/s13059-014-0550-8 (PMID: 25516281)

Individual services

Deep-dive pages for specific statistical analysis methods and workflows.

Let's Talk About Your Science

Tell us:

  • • Your biological question
  • • Data type and size
  • • Timeline constraints

We'll tell you:

  • • What's feasible
  • • How long it will take
  • • Exactly what it will cost
Contact Us

Contact us to start with a free consultation. Need everyday bench calculators? Try our free lab tools.