Statistical Analysis

Experimental Design Consulting Analysis Service — Prospective Power and Sample-Size Planning Before Sequencing

Experimental design consulting uses simulation- or voom-based power analysis to estimate whether a planned omics study can detect specified effects before sequencing (Jeon et al., 2023). Pepkio delivers design memos with power curves, replicate recommendations, and commented R scripts; custom designs are scoped at kickoff. For academic, biotech, and pharma grant planning; biological replicate count dominates power once depth reaches roughly 20 million reads per sample (Ching et al., 2014).

Key facts

Key facts about Experimental Design
Fact	Value
Supported platforms / instruments	Illumina NovaSeq, NextSeq, and HiSeq bulk RNA-seq (poly(A)+ and rRNA-depleted); 10x Genomics Chromium scRNA-seq; 10x Visium and Xenium spatial transcriptomics scoped separately; proteomics and metagenomics count matrices scoped separately
Input requirements	Pilot gene-level count matrix (≥2 biological replicates per ENCODE minimum; ≥3 preferred for dispersion estimates) or documented public reference dataset; target FDR; minimum detectable fold change or effect size; budget or per-sample cost constraints; sample manifest with biological vs technical replicate labels
Reference builds supported	Human GRCh38 (GENCODE v44 / Ensembl 110); mouse GRCm39 (GENCODE vM33 / Ensembl 110); custom references on request for pilot annotation alignment
Primary tools (with versions)	PROPER 1.44+; RNASeqDesign (GitHub); scPower 1.0+ (GitHub); RNASeqPower 1.52+; DESeq2 1.52+; ssizeRNA 1.3.3 (scoped on request)—pinned per project
Typical turnaround time	1–2 weeks (standard design memo from pilot or public priors); 2–3 weeks (multi-contrast designs, scRNA-seq cell-type power, or budget optimization across modalities)—confirmed at kickoff
Deliverable formats	Power-vs-sample-size and power-vs-depth curves (PDF/SVG); recommended replicate and depth tables (.csv, .xlsx); simulation parameter log; grant-ready design memo; commented R scripts; Methods draft
Key cited best-practice reference	Jeon et al. (2023), Biomolecules; Ching et al. (2014), RNA; Schurch et al. (2016), RNA
Custom / bespoke analysis	Non-standard contrasts, proteomics or spatial power models, multi-omics replicate planning, client-specified FDR or effect-size targets, and analyses beyond standard bulk or scRNA-seq design workflows—scoped at kickoff

What is experimental design consulting?

Experimental design consulting applies negative-binomial or cell-level models to estimate statistical power, sample size, and sequencing depth before data collection—not after differential expression on finished datasets (Jeon et al., 2023). With three biological replicates, Schurch et al. (2016) found that common DE tools recover only 20–40% of significantly differentially expressed genes compared with a 48-replicate reference set. Pepkio scopes each project to the modality and estimand at kickoff; custom designs beyond standard bulk or scRNA-seq workflows are supported when documented in advance. See the experimental design glossary.

When should you use experimental design consulting?

Experimental design consulting fits when you have not yet committed to final sample counts or sequencing depth and need a defensible power justification for a grant, protocol, or sequencing core quote.

Comparison of simulation-based experimental design consulting, rule-of-thumb guidelines, and post-hoc analysis
Approach	Best for	Limitations	Approximate cost range
Simulation-based experimental design consulting	Grant power sections, pilot-to-production scaling, budget allocation between replicates and depth	Requires effect-size and dispersion assumptions; spatial and proteomics frameworks less standardized than bulk RNA-seq	Lower than full downstream analysis; 1–2 week design memo typical
Rule-of-thumb guidelines only (e.g., ENCODE n≥2, ~20–30M reads)	Quick feasibility screening	Ignores study-specific dispersion and effect size; often underpowered for comprehensive DE (Schurch et al., 2016)	No consulting fee; higher risk of failed or inconclusive study
Post-hoc analysis after sequencing	When data already exist and the question is inference, not planning	Cannot recover missing biological replication; sunk sequencing cost if underpowered	Full analysis cost; may require additional samples

Budget-constrained bulk RNA-seq: Lin et al. (2019) developed RNASeqDesign to jointly optimize biological replicate count and sequencing depth under a fixed per-study budget, showing that local power optima depend on dispersion and cost ratios rather than maximum depth alone.
Multi-sample scRNA-seq cell-type DE: Schmid et al. (2021) applied scPower to design multi-donor scRNA-seq studies, modeling power as a function of donors, cells per donor, and sequencing depth for cell-type-specific differential expression.
Replicate planning before a large DE study: Schurch et al. (2016) simulated 48 biological replicates per condition; three replicates recover only 20–40% of SDE genes, and at least 12 are needed to identify SDE genes across all fold changes.

How the analysis works — step by step

1. Scope the design question and estimands
Pepkio confirms modality, primary contrast, target FDR, minimum detectable effect size, and unit of inference (sample, donor, or cell type). Paired, blocked, or multi-factor designs are documented before simulation (Jeon et al., 2023). Confounded batch-by-condition layouts are flagged.
Tools and outputs
Tools used: Custom scoping template
Output: design_scope.md with primary estimand, contrast definition, and assumption log
2. Inventory inputs and constraints
Client pilot count matrices, public reference datasets, sample manifests, and budget ceilings are catalogued. Technical replicates from the same library are not treated as independent biological units (ENCODE Consortium, 2016).
Tools and outputs
Tools used: Custom validation scripts
Output: input_manifest.csv with sample IDs, condition, batch, replicate type, and budget fields
3. Validate pilot count matrix QC
When a pilot matrix is provided, Pepkio assesses library sizes, gene detection rates, and outlier libraries before parameter estimation. Matrices with fewer than three biological replicates per condition are accepted with documented uncertainty (ENCODE Consortium, 2016; Schurch et al., 2016).
Tools and outputs
Tools used: DESeq2 1.52+ (diagnostic size factors and dispersion previews)
Output: pilot_qc_summary.csv with library size, detected genes, and outlier flags per sample
4. Estimate dispersion and effect-size priors
Mean expression, dispersion, and fold-change distributions are estimated from pilot counts using PROPER estParam() or DESeq2 dispersion fitting; public reference datasets (e.g., Bottomly, Cheung) are used when no pilot exists (Wu et al., 2015).
Tools and outputs
Tools used: PROPER 1.44+; DESeq2 1.52+
Output: simulation_priors.rds and simulation_priors.csv with mean, dispersion, and DE-gene fraction parameters
5. Select modality-specific power framework
Bulk RNA-seq uses PROPER 1.44+ or RNASeqPower 1.52+; budget-constrained studies add RNASeqDesign. scRNA-seq cell-type DE uses scPower 1.0+ (Jeon et al., 2023; Schmid et al., 2021). ssizeRNA 1.3.3 provides faster voom-based estimates when scoped (Bi and Liu, 2016).
Tools and outputs
Tools used: PROPER 1.44+; RNASeqPower 1.52+; scPower 1.0+; RNASeqDesign
Output: framework_selection_log.txt documenting tool choice and rationale
6. Run prospective power simulations
Negative-binomial counts are simulated under proposed replicate counts and depths; DE detection runs within each iteration. PROPER recommends at least 20 iterations for stable estimates (Wu et al., 2015).
Tools and outputs
Tools used: PROPER 1.44+ (runSims(), comparePower()); scPower 1.0+ (scRNA-seq)
Output: power_simulation_results.rds; per-iteration DE detection summaries
7. Sweep replicate vs depth tradeoffs
Pepkio evaluates power across replicate count and reads-per-sample grids. Ching et al. (2014) showed sample size is more potent than depth beyond ~20 million reads. RNASeqDesign finds the local budget optimum when a fixed budget is supplied (Lin et al., 2019).
Tools and outputs
Tools used: PROPER 1.44+ (power.seqDepth()); RNASeqDesign
Output: replicate_depth_grid.csv with columns: n_replicates_per_group, sequencing_depth_m_reads, marginal_power, estimated_cost
8. Model batch and blocking factors
Randomized block designs and balanced batch allocation are evaluated for effective power (ENCODE Consortium, 2016). Pepkio recommends blocking at the design stage rather than relying on post-hoc correction after confounded library prep.
Tools and outputs
Tools used: Custom design scripts; PROPER 1.44+ (paired and multi-factor extensions when scoped)
Output: blocking_recommendations.md with batch layout diagram and confounding audit
9. Compute detectable effect size at target power and FDR
For each design point, Pepkio reports minimum detectable log₂ fold change (or cell-type DE effect) at 80% power and target FDR. Underpowered designs are flagged with alternatives (Ching et al., 2014).
Tools and outputs
Tools used: PROPER 1.44+; RNASeqPower 1.52+; scPower 1.0+
Output: power_summary_table.csv with columns: modality, n_replicates_per_group, sequencing_depth_m_reads, target_fdr, power, detectable_log2fc
10. Draft grant-ready design memo and package deliverables
Power curves, assumption tables, and recommended replicate counts are compiled into a grant-ready design memo. Commented R scripts reproduce simulations; a Methods draft cites software versions and reference datasets.
Tools and outputs
Tools used: PROPER 1.44+; RNASeqDesign; scPower 1.0+; ggplot2
Output: design_memo.pdf; power curve figures (PDF/SVG); commented R scripts; README; Methods draft

What Pepkio delivers

Processed data files

power_summary_table.csv, replicate_depth_grid.csv, simulation_priors.csv, and pilot_qc_summary.csv when a pilot matrix is supplied

Figures (PDF/SVG)

Power-vs-sample-size curves; power-vs-sequencing-depth curves
Detectable log₂ fold change vs replicate count
Stratified power histograms when multiple effect-size tiers are modeled

Tables

input_manifest.csv; power_summary_table.csv; replicate_depth_grid.csv
simulation_priors.csv (columns: parameter, estimate, source_dataset)

Code

Commented R scripts per analysis stage; environment lock files (renv.lock, sessionInfo(), or conda export)
Delivery via private Git repository or agreed file transfer

Documentation

Grant-ready design memo with assumptions audit and limitation section; README with rerun instructions
Methods draft citing software versions, reference builds, and public datasets used for priors
Post-delivery reviewer support for clarification and minor revisions within agreed scope (typically ≤20% of project scope)

Technical decisions we make — and why

Power engine: PROPER simulation vs ssizeRNA analytical: Pepkio defaults to PROPER 1.44+ for pilot-calibrated negative-binomial simulation (Wu et al., 2015). ssizeRNA 1.3.3 provides faster voom-based estimates when scoped (Bi and Liu, 2016); Jeon et al. (2023) recommend ssizeRNA for bulk RNA-seq when pilot parameters are available.
Prior source: client pilot vs public reference: Client pilots are preferred because dispersion varies by tissue and library prep (Wu et al., 2015). Without a pilot, Pepkio uses published reference matrices (e.g., Bottomly, Cheung) and documents mismatch risk.
DE test in simulations: DESeq2 Wald vs edgeR: PROPER runSims() defaults to edgeR; DESeq2 1.52+ is used when the downstream plan specifies DESeq2 (Love et al., 2014).
FDR control: Benjamini–Hochberg at 0.05: Unless pre-specified otherwise, simulations target BH-adjusted FDR ≤0.05 (Wu et al., 2015). Stricter thresholds are scoped separately.
scRNA-seq unit of inference: donor-level or scPower cell-type models: Cells from the same donor are not independent biological replicates (Schmid et al., 2021). scPower 1.0+ models donors, cells per donor, and cell-type prevalence; pseudobulk per donor is recommended when donor count is limiting.

Common questions

What is the minimum input required for experimental design consulting at Pepkio?

A pilot gene-level count matrix with metadata is preferred; ENCODE requires ≥2 biological replicates, and ≥3 per condition improve dispersion estimates (ENCODE Consortium, 2016). Without a pilot, Pepkio uses published reference datasets plus your target FDR, effect size, and budget. A sample manifest and contrast definition are required at kickoff.

Can Pepkio run power analysis if I have no pilot data yet?

Yes. Pepkio estimates dispersion from published reference matrices matched to tissue and library type (Wu et al., 2015; Lin et al., 2019). The design memo documents literature-derived priors and recommends a confirmatory pilot when budget allows.

How do you handle poor-quality or low-yield pilot libraries?

Outlier pilot samples are flagged in pilot_qc_summary.csv before parameter estimation. Very low-yield libraries may not represent production data; Pepkio discusses re-sequencing or using public priors with inflated dispersion (Hart et al., 2013).

Do you support Illumina bulk RNA-seq, 10x Chromium scRNA-seq, and spatial platforms?

Yes, for formats we can parameterize after kickoff. Bulk RNA-seq uses PROPER 1.44+ or RNASeqDesign; scRNA-seq cell-type DE uses scPower 1.0+ for 10x Chromium. Spatial (Visium, Xenium) and proteomics models are scoped separately—frameworks are less mature (Jeon et al., 2023).

How long does experimental design consulting take at Pepkio?

Standard memos typically complete in 1–2 weeks; multi-contrast, multi-population scRNA-seq, or cross-modality budget optimization may take 2–3 weeks. Timelines are confirmed at kickoff.

How are batch effects handled in experimental design—not after sequencing?

Pepkio recommends randomized, balanced blocking at the library-prep and sequencing stage so batch is not confounded with condition (ENCODE Consortium, 2016). The design memo includes a batch layout diagram and flags fully confounded designs. Post-hoc ComBat or surrogate variable correction is not a substitute for proper blocking and is out of scope for this service.

Do I own the code—and in what format is it delivered?

Yes—you retain full ownership of all scripts and outputs. Pepkio delivers commented R scripts with renv.lock or equivalent environment locks via private Git or agreed file transfer. Simulation objects use standard .rds and .csv formats reproducible on R ≥4.3 installations matching the locked environment.

Can I be involved during the design analysis?

Yes. Checkpoint reviews when scoped at kickoff occur after pilot QC and parameter estimation, after the replicate-vs-depth sweep, and before final memo delivery. You can review target FDR, effect-size assumptions, budget constraints, and blocking layout within agreed scope.

What does post-delivery reviewer support cover?

Support covers clarification of simulation methods, assumption tables, and minor figure or table revisions within agreed scope (typically ≤20% of project scope). Requests to re-run with a different modality, new contrasts, or updated pilot data after sequencing begins are scoped and priced separately.

Is co-authorship required?

No. Pepkio operates strictly as a fee-for-service provider unless co-authorship is explicitly discussed in advance.

Should I prioritize more biological replicates or higher sequencing depth?

Increasing biological replicate count yields more power than increasing sequencing depth once depth reaches approximately 20 million reads per sample (Ching et al., 2014). Schurch et al. (2016) suggest at least six biological replicates per condition, rising to at least 12 when identifying SDE genes across all fold changes. Pepkio quantifies the tradeoff for your pilot dispersion and budget using PROPER and RNASeqDesign sweeps.

How many donors vs cells per donor do I need for scRNA-seq cell-type DE?

Individual cells are not independent biological replicates—donor-level variation drives inferential power (Schmid et al., 2021). scPower 1.0+ models power as a function of donor count, cells per donor, cell-type prevalence, and sequencing depth. Shallow sequencing of more cells often outperforms deep sequencing of fewer cells for cell-type-specific DE when donor count is fixed (Schmid et al., 2021).

Can Pepkio write the power-analysis section for my grant application?

The design memo includes power curves, effect-size assumptions, FDR targets, replicate and depth recommendations, and cited tool versions suitable for grant Power sections (Lin et al., 2019; Wu et al., 2015). Final grant prose remains the client's responsibility.

Related services

Mendelian randomization — Causal inference from GWAS summary statistics after data collection, complementing pre-study power planning.
Bulk RNA-seq — Execute differential expression analysis once an adequately powered bulk design is finalized.
Single-cell RNA-seq — Cell-type resolution analysis when scPower design targets heterogeneous tissue.
Bioinformatics consulting — Modality and assay selection before committing to a power-analysis project.
Custom analysis — Non-standard power models, multi-omics replicate planning, or bespoke simulation frameworks beyond standard bulk and scRNA-seq design.

References

Jeon H, Xie J, Jeon Y, et al. Statistical power analysis for designing bulk, single-cell, and spatial transcriptomics experiments: review, tutorial, and perspectives. Biomolecules. 2023;13(2):221. https://doi.org/10.3390/biom13020221 (PMID: 36830591)
Hart SN, Therneau TM, Zhang Y, et al. Calculating sample size estimates for RNA sequencing data. Journal of Computational Biology. 2013;20(12):970–978. https://doi.org/10.1089/cmb.2012.0283 (PMID: 23961961)
Ching T, Huang Y, Garmire LGX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–1696. https://doi.org/10.1261/rna.046011.114 (PMID: 25246651)
Schurch NJ, Schofield P, Gierliński M, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016;22(6):839–851. https://doi.org/10.1261/rna.053959.115 (PMID: 27022035)
Wu H, Wang C, Wu Z. PROPER: comprehensive power evaluation for differential expression using RNA-seq. Bioinformatics. 2015;31(2):233–241. https://doi.org/10.1093/bioinformatics/btu640 (PMID: 25273110)
Lin CW, Liao SG, Liu P, et al. RNASeqDesign: a framework for RNA-Seq genome-wide power calculation and study design issues. Journal of the Royal Statistical Society Series C: Applied Statistics. 2021;68(3):683–704. https://doi.org/10.1111/rssc.12330 (PMID: 33692596)
Schmid KT, Höllbacher B, Cruceanu C, et al. scPower accelerates and optimizes the design of multi-sample single cell transcriptomic studies. Nature Communications. 2021;12:6625. https://doi.org/10.1038/s41467-021-26779-7 (PMID: 34785648)
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):550. https://doi.org/10.1186/s13059-014-0550-8 (PMID: 25516281)
ENCODE Consortium. ENCODE Guidelines and Best Practices for RNA-Seq (Revised December 2016). https://www.encodeproject.org/documents/cede0cbe-d324-4ce7-ace4-f0c3eddf5972/@@download/attachment/ENCODE%20Best%20Practices%20for%20RNA_v2.pdf
Bi R, Liu P. Sample size calculation while controlling false discovery rate for differential expression analysis with RNA-sequencing experiments. BMC Bioinformatics. 2016;17:146. https://doi.org/10.1186/s12859-016-0994-9 (PMID: 27029470)
Bioconductor. PROPER. https://bioconductor.org/packages/release/bioc/html/PROPER.html
GitHub. scPower (heiniglab/scPower). https://github.com/heiniglab/scPower
CRAN. ssizeRNA 1.3.3. https://cran.r-project.org/package=ssizeRNA
Bioconductor. RNASeqPower. https://bioconductor.org/packages/release/bioc/html/RNASeqPower.html
GitHub. RNASeqDesign (MasakiLin/RNASeqDesign). https://github.com/MasakiLin/RNASeqDesign
Bioconductor. DESeq2. https://bioconductor.org/packages/release/bioc/html/DESeq2.html

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Experimental Design Consulting Analysis Service — Prospective Power and Sample-Size Planning Before Sequencing

Key facts

What is experimental design consulting?

When should you use experimental design consulting?

How the analysis works — step by step

1. Scope the design question and estimands

2. Inventory inputs and constraints

3. Validate pilot count matrix QC

4. Estimate dispersion and effect-size priors

5. Select modality-specific power framework

6. Run prospective power simulations

7. Sweep replicate vs depth tradeoffs

8. Model batch and blocking factors

9. Compute detectable effect size at target power and FDR

10. Draft grant-ready design memo and package deliverables