Transcriptomics

Bulk RNA Sequencing (RNA-seq) Analysis Service — Strand-Aware Quantification from Raw FASTQs to DESeq2 Contrasts

Bulk RNA sequencing (RNA-seq) measures pooled-tissue gene expression to compare conditions or treatments (Conesa et al., 2016). Pepkio delivers version-pinned bulk RNA-seq analysis from FASTQs or count matrices to DESeq2 contrasts, with full support for custom inputs, outputs, and non-standard workflows. For academic, biotech, and pharma clients at ≥30 million aligned reads per replicate (ENCODE Consortium, 2016); documented scripts, figures, and a Methods draft included.

Key facts

Key facts about Bulk RNA-seq
FactValue
Supported platforms / instrumentsIllumina NovaSeq, NextSeq, and HiSeq (paired- and single-end); poly(A)+ and rRNA-depleted total RNA; strand-specific and non-strand-specific libraries; pre-built count matrices accepted
Input requirements≥30 million aligned reads per replicate for standard gene-level DE (ENCODE Consortium, 2016); ≥50 bp read length; ≥3 biological replicates per condition recommended (Conesa et al., 2016); higher depth than the ENCODE minimum may be needed for isoform- or fusion-focused work—confirmed at kickoff
Reference builds supportedHuman GRCh38 (GENCODE v44 / Ensembl 110); mouse GRCm39 (GENCODE vM33 / Ensembl 110); custom references on request
Primary tools (with versions)STAR 2.7.11b; Subread/featureCounts 2.0.8; DESeq2 1.52.0; Salmon 1.10.3; tximport 1.34.0; fastp 0.24.0; FastQC 0.12.1; MultiQC 1.25.1; samtools 1.21; RSeQC 5.0.3; clusterProfiler 4.20.0
Typical turnaround time2–4 weeks (standard cohort, 4–12 samples, one contrast); 4–6 weeks (multi-contrast designs, large cohorts, or extensive pathway modeling) — confirmed at kickoff
Deliverable formats.csv count matrices; .rds DESeq2 objects; sorted BAM on request; PDF/SVG figures; HTML QC report; documented R scripts; Methods draft
Key cited best-practice referenceConesa et al. (2016), Genome Biology; Love et al. (2014), Genome Biology (DESeq2)
Custom / bespoke analysisNon-standard inputs, outputs, and methods scoped at kickoff—e.g., Salmon-only quantification, edgeR/limma-voom, rMATS splicing, fusion calling, or client-specified contrasts and figures

What is bulk RNA sequencing (RNA-seq)?

Bulk RNA-seq aligns or quasi-maps short reads to a reference genome or transcriptome and aggregates signal into a sample-by-gene count matrix—measuring average expression across millions of cells per library rather than per-cell resolution (Conesa et al., 2016). It differs from scRNA-seq, which assigns UMIs to individual barcodes, and from microarrays by capturing novel splice junctions and unannotated transcripts when depth permits. The GTEx v8 resource sequenced 17,382 RNA-seq samples across 54 tissue sites from 948 donors, establishing bulk RNA-seq as the standard for tissue-level expression atlases (GTEx Consortium, 2020). Pepkio starts from FASTQs or existing count matrices and returns QC-audited DE results with documented parameters at each stage. Custom inputs, deliverable formats, and analyses beyond the standard workflow are agreed at kickoff. See the bulk RNA-seq glossary.

When should you use bulk RNA sequencing (RNA-seq)?

Bulk RNA-seq fits when biological variation operates at sample or condition level—drug response, genotype, disease stage, or treatment arms—and tissue composition is stable or deconvolution is acceptable.

Comparison of bulk RNA-seq, scRNA-seq, and spatial transcriptomics
ApproachBest forLimitationsApproximate cost range
Bulk RNA-seqCondition contrasts (treated vs. control), pathway enrichment, eQTL-style cohorts, homogeneous cell linesAverages across cell types; rare populations below detection; composition shifts can confound DELower per-sample sequencing and analysis cost than scRNA-seq for modest cohorts
scRNA-seq (droplet)Cell-type discovery, rare populations, trajectory inference in heterogeneous tissueDissociation artifacts; higher per-cell cost; no native spatial contextLibrary prep + sequencing and bioinformatics vary by cell count and integration scope
Spatial transcriptomics (e.g., 10x Visium, Xenium)Tissue architecture, microenvironment niches, region-specific expressionLower depth per cell; sectioning constraints; platform-specific captureHigher per-study cost than dissociated bulk or scRNA-seq for equivalent gene coverage
  • Cross-tissue regulation: GTEx v8 profiled 17,382 samples across 54 tissues, linking cis-eQTLs to tissue-specific gene regulation (GTEx Consortium, 2020).
  • Primary vs. metastatic melanoma: Sun et al. (2019) analyzed TCGA bulk RNA-seq from 103 primary and 368 metastatic melanoma samples, identifying 246 differentially expressed lncRNAs and 856 differentially expressed mRNAs (Sun et al., 2019).
  • Hypoxia-driven plasticity: Guo et al. (2019) used RNA-seq in LNCaP and PC3 prostate cancer lines to show ONECUT2 and hypoxia synergistically activate neuroendocrine marker programs.

How the analysis works — step by step

  1. 1. Validate inputs and sample metadata

    Pepkio confirms FASTQ integrity (or count-matrix dimensions), read layout, library type (poly(A)+, rRNA-depleted, strand-specific), and experimental design. Sample IDs, batch, condition, and covariates are recorded in sample_manifest.csv. Designs with fewer than three biological replicates per condition are flagged before DE testing (Conesa et al., 2016).

    Tools and outputs

    Tools used: FastQC 0.12.1 / custom validation scripts

    Output: sample_manifest.csv with library IDs, read counts, strandness flags, and QC notes

  2. 2. QC and trim raw reads

    Adapter contamination, low-quality tails, and overrepresented sequences are assessed per library. When trimming is warranted, reads are filtered before alignment. Aggregated metrics are compiled for review (Ewels et al., 2016).

    Tools and outputs

    Tools used: FastQC 0.12.1; fastp 0.24.0; MultiQC 1.25.1

    Output: Per-sample FastQC/fastp reports; multiqc_report.html

  3. 3. Align reads to the reference genome

    Paired- or single-end reads are aligned with splice-aware mapping to GRCh38 or GRCm39 with GENCODE annotation. Strand-specific library orientation is set from metadata or inferred from RSeQC when ambiguous (Wang et al., 2012). Mapping rates, uniquely mapped reads, and splice junction counts are compared against expected ranges for the library type.

    Tools and outputs

    Tools used: STAR 2.7.11b; samtools 1.21

    Output: Coordinate-sorted, indexed BAM files; star_log_final.txt per sample

  4. 4. Quantify gene-level counts

    Gene-level read counts are summarized from aligned BAMs using exon–gene overlap rules consistent with the annotation build. Multi-mapping reads are handled per Subread defaults; parameters are recorded for reproducibility (Liao et al., 2014).

    Tools and outputs

    Tools used: Subread/featureCounts 2.0.8

    Output: Raw count matrix (counts_raw.csv); featureCounts_summary.txt

  5. 5. Audit count matrix quality

    Sample-level library sizes, gene detection rates, and pairwise correlations are computed. PCA and hierarchical clustering separate biological condition from sequencing batch when possible. RSeQC reports read distribution over genomic features and confirms strandness (Wang et al., 2012). Outlier samples are flagged before modeling.

    Tools and outputs

    Tools used: DESeq2 1.52.0 (diagnostics); RSeQC 5.0.3

    Output: sample_qc_summary.csv; PCA and correlation heatmaps; RSeQC read-distribution plots

  6. 6. Filter low-count genes

    Genes with insufficient counts across samples are removed using a pre-specified rule—typically ≥10 counts in at least n samples, where n equals the smallest replicate group size (Love et al., 2014). Retained and excluded gene counts are logged.

    Tools and outputs

    Tools used: DESeq2 1.52.0

    Output: Filtered count matrix; gene_filter_log.csv

  7. 7. Model and test differential expression

    A design matrix encodes condition, batch, and covariates (e.g., sex, RIN, paired blocking). DESeq2 estimates size factors, dispersions, and Wald statistics for each contrast (Love et al., 2014). Paired designs use ~ subject + condition; multi-factor designs include interaction terms when biologically justified.

    Tools and outputs

    Tools used: DESeq2 1.52.0

    Output: DESeqDataSet .rds object; dds_results_<contrast>.csv

  8. 8. Apply shrinkage and multiple-testing correction

    Log₂ fold-changes can be shrunk with DESeq2 lfcShrink using apeglm or ashr on request. Benjamini–Hochberg adjusted p-values from DESeq2 control the false discovery rate across tested genes (Love et al., 2014). Contrasts with no genes passing a pre-agreed FDR threshold are reported with full ranked tables.

    Tools and outputs

    Tools used: DESeq2 1.52.0; Bioconductor apeglm on request

    Output: deg_results.csv with columns: gene_id, gene_name, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj

  9. 9. Run pathway and GO enrichment

    Significant genes (padj threshold agreed at kickoff) are tested for over-representation in GO and KEGG gene sets. Dot plots and bar charts summarize enriched terms with gene counts and adjusted p-values (Yu et al., 2012).

    Tools and outputs

    Tools used: clusterProfiler 4.20.0

    Output: go_enrichment.csv, kegg_enrichment.csv; enrichment dot plots

  10. 10. Package figures, scripts, and Methods draft

    MA plots, volcano plots, and top-gene heatmaps are exported at publication resolution. Commented R scripts reproduce each stage from raw inputs. A journal-formatted Methods section cites exact software versions and reference builds. Salmon 1.10.3 + tximport 1.34.0 quantification is available as an alternative entry point when clients provide transcriptome references or prefer quasi-mapping without full BAM alignment (Patro et al., 2017)—scoped at kickoff.

    Tools and outputs

    Tools used: DESeq2 1.52.0; pheatmap; EnhancedVolcano on request

    Output: PDF/SVG figures; R scripts; README; Methods draft; final deliverable bundle

What Pepkio delivers

Processed data files

  • Raw and filtered count matrices (.csv), DESeq2 objects (.rds), alignment and counting summaries, and coordinate-sorted BAM files on request.

Figures (PDF/SVG)

  • MultiQC summary, read-quality plots, mapping rates, sample PCA, correlation heatmap, MA and volcano plots, top-DE gene heatmaps, GO/KEGG enrichment plots.

Tables

  • sample_manifest.csv, sample_qc_summary.csv, gene_filter_log.csv, deg_results.csv, go_enrichment.csv, kegg_enrichment.csv.

Code

  • Commented R scripts per analysis stage
  • Environment lock files (sessionInfo(), conda export, or equivalent)
  • Delivery via private Git repository or agreed file transfer

Documentation

  • HTML/PDF QC report with thresholds and outlier flags
  • README with reproduction instructions
  • Methods draft citing software versions and reference builds
  • Post-delivery reviewer support: clarification of methods and minor revisions within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

Alignment + counting: STAR 2.7.11b + featureCounts
Splice-aware genome alignment with explicit exon–gene overlap counting; Salmon + tximport on request for quasi-mapping workflows (Dobin et al., 2013; Liao et al., 2014; Patro et al., 2017).
DE engine: DESeq2 median-ratio normalization + Wald test
Operates on raw counts with a negative-binomial model; edgeR QL F-test or limma-voom on request (Love et al., 2014; Robinson et al., 2010; Law et al., 2014).
Batch handling: include batch in the design formula
Preferred over post-hoc correction when batch is known and not confounded with condition; ComBat-seq or surrogate variable analysis scoped separately when the design requires it (Love et al., 2014).
Gene filtering: ≥10 counts in ≥n samples
Removes genes with near-zero signal before dispersion estimation; DESeq2 independent filtering retained for the test step (Love et al., 2014).
Enrichment: clusterProfiler ORA on padj-filtered gene lists
GSEA or ReactomePA extensions on request (Yu et al., 2012).

Common questions

What is the minimum sample size, sequencing depth, and replicate count for bulk RNA-seq?

ENCODE recommends ≥30 million aligned reads per replicate for long poly(A)+ RNA-seq libraries and read lengths ≥50 bp (ENCODE Consortium, 2016). Conesa et al. (2016) recommend at least three biological replicates per condition. Pepkio can analyze smaller designs but documents reduced statistical power in the QC report. Targets are confirmed at kickoff.

Can you analyze low-quality or low-yield libraries?

Yes, with caveats documented in the QC report. Libraries below 30 million aligned reads may lack power for low-abundance genes or isoform-level work (ENCODE Consortium, 2016). Outlier samples in PCA are flagged before DE; re-sequencing is discussed when yield threatens the study question.

Do you support Illumina data from poly(A)+, rRNA-depleted, and strand-specific libraries?

Yes. Pepkio processes paired- and single-end Illumina FASTQs from NovaSeq, NextSeq, and HiSeq instruments. Poly(A)+ and rRNA-depleted libraries are supported; strand-specific orientation is set from library metadata or RSeQC inference (Wang et al., 2012). Pre-built count matrices in .csv or .txt can be imported when alignment is complete.

How long does bulk RNA-seq analysis take at Pepkio?

A standard project (roughly 4–12 samples, one primary contrast, gene-level DE, and GO/KEGG enrichment) typically completes in 2–4 weeks from data receipt. Multi-contrast designs, large cohorts (>24 samples), or extensions such as splicing or fusion analysis may take 4–6 weeks. Milestone check-ins during the project; exact timelines confirmed at kickoff.

How do you handle batch effects across sequencing runs or processing dates?

When batch is known and not fully confounded with condition, Pepkio includes batch in the DESeq2 design formula (Love et al., 2014). PCA and correlation heatmaps are reviewed before modeling. Post-hoc batch adjustment with ComBat-seq or surrogate variable analysis is scoped separately when required.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of all code, scripts, and results. Pepkio delivers commented R scripts and environment lock files. Objects use standard .rds and .csv formats; R Markdown delivery is available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after QC, after sample-level audit, and before final delivery. You can review contrast definitions, covariate choices, and gene-filtering thresholds within agreed scope. A dedicated scientific contact leads the project.

What does post-delivery reviewer support include?

Support covers clarification of methods, QC thresholds, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Substantial new analyses requested by reviewers are scoped separately.

Is co-authorship required?

No. Pepkio operates strictly as a fee-for-service provider unless co-authorship is explicitly discussed in advance.

Can you start from an existing count matrix instead of raw FASTQs?

Yes. Pepkio imports gene-level count matrices in .csv, .tsv, or SummarizedExperiment-compatible formats, validates names against the agreed annotation, and proceeds from sample QC through DESeq2. The Methods draft states the upstream quantification source you provide.

Should I use DESeq2, edgeR, or limma-voom for my bulk RNA-seq DE analysis?

Pepkio defaults to DESeq2 for standard negative-binomial DE on raw counts (Love et al., 2014). edgeR quasi-likelihood F-tests suit some designs with multiple factors; limma-voom fits when voom transformation is preferred or for microarray-style workflows (Robinson et al., 2010; Law et al., 2014). The Methods draft states the engine and design formula used.

How do you handle heterogeneous tissues where cell-type composition may shift between conditions?

Bulk RNA-seq averages across cell types; composition shifts can mimic or mask condition effects (Conesa et al., 2016). Pepkio documents this limitation in the QC report and recommends scRNA-seq or deconvolution when heterogeneity is expected. Cell-type deconvolution, rMATS splicing, fusion calling, and other extensions are scoped as separate milestones at kickoff.

Related services

References
  1. Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):13. https://doi.org/10.1186/s13059-016-0881-8 (PMID: 26813401)
  2. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):550. https://doi.org/10.1186/s13059-014-0550-8 (PMID: 25516281)
  3. ENCODE Consortium. ENCODE Guidelines and Best Practices for RNA-Seq (Revised December 2016). https://www.encodeproject.org/documents/cede0cbe-d324-4ce7-ace4-f0c3eddf5972/@@download/attachment/ENCODE%20Best%20Practices%20for%20RNA_v2.pdf
  4. GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–1330. https://doi.org/10.1126/science.aaz1776 (PMID: 32913098)
  5. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. https://doi.org/10.1093/bioinformatics/bts635 (PMID: 23104886)
  6. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–930. https://doi.org/10.1093/bioinformatics/btt656 (PMID: 24227677)
  7. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 2017;14(4):417–419. https://doi.org/10.1038/nmeth.4197 (PMID: 28263959)
  8. Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012;28(16):2184–2185. https://doi.org/10.1093/bioinformatics/bts356 (PMID: 22743226)
  9. Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–287. https://doi.org/10.1089/omi.2011.0118 (PMID: 22455463)
  10. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. https://doi.org/10.1093/bioinformatics/btp616 (PMID: 19910308)
  11. Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology. 2014;15(2):R29. https://doi.org/10.1186/gb-2014-15-2-r29 (PMID: 24485249)
  12. Sun L, Guan Z, Wei S, Tan R, Li P, Yan L. Identification of long non-coding and messenger RNAs differentially expressed between primary and metastatic melanoma. Frontiers in Genetics. 2019;10:292. https://doi.org/10.3389/fgene.2019.00292 (PMID: 31024618)
  13. Guo H, Ahmed M, Zhang F, et al. ONECUT2 is a driver of neuroendocrine prostate cancer. Nature Communications. 2019;10:278. https://doi.org/10.1038/s41467-018-08133-6 (PMID: 30655535)
  14. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. https://doi.org/10.1093/bioinformatics/btw354 (PMID: 27312411)
  15. Dobin A, Davis CA, Schlesinger F, et al. STAR 2.7.11b release. https://github.com/alexdobin/STAR/releases/tag/2.7.11b
  16. Bioconductor. DESeq2 1.52.0 release manual. https://bioconductor.org/packages/release/bioc/html/DESeq2.html
  17. Bioconductor. clusterProfiler 4.20.0. https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html

Let's Talk About Your Science

Tell us:

  • • Your biological question
  • • Data type and size
  • • Timeline constraints

We'll tell you:

  • • What's feasible
  • • How long it will take
  • • Exactly what it will cost
Contact Us

Contact us to start with a free consultation. Need everyday bench calculators? Try our free lab tools.