Metagenomics

Metatranscriptomics Analysis Service — Active Pathway Expression from rRNA-Filtered FASTQs to HUMAnN Profiles and Differential-Activity Tables

Metatranscriptomics profiles actively expressed microbial genes and pathways—answering which community members are metabolically active, not just present (Zhang et al., 2021). Pepkio delivers version-pinned QC, rRNA filtering, MetaPhlAn and HUMAnN quantification, differential-activity testing, code, and a Methods draft for academic, biotech, and pharma teams. Custom inputs, outputs, and non-standard analyses are scoped at kickoff. Human gut projects typically target ≥40–50 million raw paired-end reads per sample (Westreich et al., 2016).

Key facts

Key facts about Metatranscriptomics
Fact	Value
Supported platforms / instruments	Illumina NovaSeq X / 6000 / NextSeq 2000, HiSeq 2500/4000; MGI DNBSEQ-T7 / G400 / G99 when scoped at kickoff; ribo-depleted total RNA (e.g., Ribo-Zero, QIAseq FastSelect, NEBNext rRNA Depletion); stranded libraries when metadata provided; pre-built HUMAnN or MetaPhlAn outputs accepted on request
Input requirements	≥40–50 million raw paired-end reads per human gut sample for >90% accuracy on low-abundance transcript estimates after annotation (Westreich et al., 2016); ≥2×100 bp or 2×150 bp typical; ≥3 biological replicates per condition recommended for differential testing—fewer flagged at kickoff; RIN and RNA extraction metadata encouraged; paired shotgun metagenomes optional for gene-copy normalization
Reference builds supported	ChocoPhlAn SGB (Jun 2023) for HUMAnN 3.9; MetaPhlAn 4.1.0 marker database; SILVA 138 (SortMeRNA rRNA filtering); UniRef90 for HUMAnN; host subtraction against GRCh38 or GRCm39 when scoped
Primary tools (with versions)	SortMeRNA 4.3.7; HUMAnN 3.9; MetaPhlAn 4.1.0; Kraken2 2.1.3; Bracken 2.9; Bowtie2 2.5.4; Salmon 1.10.3; fastp 0.24.0; FastQC 0.12.1; MultiQC 1.25.1; MaAsLin2 1.18.0; ANCOM-BC 2.4.0; MEGAHIT 1.2.9 (assembly scoped on request)
Typical turnaround time	5–8 weeks (standard cohort, ≤24 samples, one contrast, profiling through differential activity); multi-contrast or paired DNA+RNA designs may extend timeline — confirmed at kickoff
Deliverable formats	HUMAnN pathabundance.tsv and genefamilies.tsv; MetaPhlAn relative-abundance profiles; differential-activity tables (.csv); PDF/SVG figures; HTML MultiQC report; commented R/Python scripts; Methods draft
Key cited best-practice reference	Zhang et al. (2021), Annual Review of Biomedical Data Science; Franzosa et al. (2018), Nature Methods (HUMAnN functional profiling)
Custom / bespoke analysis	Paired DNA normalization, Salmon assembly quantification, co-expression networks, AMR/virulence panels, custom references, or client-specified models — scoped at kickoff

What is metatranscriptomics?

Metatranscriptomics aligns and quantifies RNA-seq reads from mixed microbial communities to measure which genes and pathways are actively transcribed at sampling time—not merely encoded in community DNA. Unlike shotgun metagenomics, which reports gene catalog presence and copy number, metatranscriptomics captures dynamic activity such as nutrient utilization and stress responses (Franzosa et al., 2018). Unlike 16S amplicon sequencing, it resolves functional expression without inferring activity from taxonomy. Automated platforms have profiled cohorts exceeding 10,000 human stool samples (Hatch et al., 2020). Pepkio processes ribo-depleted FASTQs through host and rRNA filtering, taxonomic activity profiling, and HUMAnN pathway quantification with documented parameters; custom entry points are agreed at kickoff. See the metatranscriptomics glossary.

When should you use metatranscriptomics?

Metatranscriptomics fits when the research question requires active microbial function—pathway upregulation after treatment, transcriptional response to host immune activation, or time-resolved community activity—rather than static community membership.

Comparison of metatranscriptomics, shotgun metagenomics, and 16S amplicon sequencing
Approach	Best for	Limitations	Approximate cost range
Metatranscriptomics	Active pathway and gene-family expression; perturbation and treatment-response studies; pairing with host phenotypes	RNA degradation; rRNA depletion kit bias; several-fold higher depth than metagenomics for rare transcripts (Ojala et al., 2023)	Higher per-sample sequencing and bioinformatics than 16S or shallow shotgun
Shotgun metagenomics	Species/strain catalog, gene presence, MAG recovery, gene copy number	Does not distinguish expressed from silent genes	Moderate–high sequencing and storage cost
16S rRNA amplicon	Large cohorts, cost-effective taxonomy, longitudinal membership tracking	No direct functional expression; species resolution limited	Lowest per-sample cost

Drug metabolism by gut microbes: Javdan et al. (2020) mapped microbial drug-metabolizing activity in human communities and validated robust expression of a widespread 20β-HSDH gene in metatranscriptomic data from a patient-derived consortium.
Immune activation without compositional shift: Becattini et al. (2021) showed commensal transcription reprogrammed within 6 hours of host innate or adaptive immune activation—stress genes up, carbohydrate-utilization genes down—while 16S-based community composition remained stable.
Paired DNA and RNA in the human gut: Franzosa et al. (2018) demonstrated that metatranscriptomic pathway profiles resolve active metabolic responses that taxonomic or metagenomic DNA profiles alone do not capture.

How the analysis works — step by step

1. Validate inputs and sample metadata
Pepkio confirms FASTQ integrity (MD5 checksums), read layout, platform, ribo-depletion method, and experimental design. Sample metadata are recorded in sample_manifest.csv. Designs with fewer than three biological replicates per condition are flagged before differential testing.
Tools and outputs
Tools used: Custom validation scripts; md5sum
Output: sample_manifest.csv with library IDs, read counts, depletion kit, host species, and QC flags
2. QC and trim raw reads
Adapter contamination, low-quality tails, and overrepresented sequences are assessed per library. When trimming is warranted, reads are filtered before downstream steps (Chen et al., 2018). Aggregated metrics are compiled for review (Ewels et al., 2016).
Tools and outputs
Tools used: FastQC 0.12.1; fastp 0.24.0; MultiQC 1.25.1
Output: Per-sample FastQC/fastp reports; multiqc_report.html
3. Remove host reads
In mucosal swabs and tissue, host RNA can dominate without subtraction (Ojala et al., 2023). Stool often shows low host fractions (Westreich et al., 2016); host mapping is applied when scoped. Excessive host fractions are flagged before profiling.
Tools and outputs
Tools used: Bowtie2 2.5.4; GRCh38 or GRCm39 reference index
Output: Host-depleted FASTQ; host_removal_summary.csv with columns: sample_id, total_reads, host_reads, host_fraction
4. Filter ribosomal RNA
Remaining rRNA is removed after experimental depletion. Kit depletion leaves species-skewed fractions; SortMeRNA filtering improves functional yield (Westreich et al., 2016; Ojala et al., 2023).
Tools and outputs
Tools used: SortMeRNA 4.3.7; SILVA 138 SSU/LSU databases
Output: rRNA-depleted FASTQ; rrna_filter_summary.csv with columns: sample_id, reads_pre_filter, rrna_reads, rrna_fraction, reads_post_filter
5. Profile transcriptionally active taxa
MetaPhlAn 4.1.0 estimates relative abundance of transcriptionally active lineages via marker genes (Blanco-Míguez et al., 2023). Kraken2 2.1.3 + Bracken 2.9 is run as an optional cross-check when scoped (Wood et al., 2019; Lu et al., 2017).
Tools and outputs
Tools used: MetaPhlAn 4.1.0; Kraken2 2.1.3 + Bracken 2.9 (on request)
Output: metaphlan_profile.tsv; optional kraken2_bracken_abundance.csv
6. Quantify pathway and gene-family expression
HUMAnN 3.9 profiles microbial pathways and gene families from filtered reads, stratified by MetaPhlAn community composition (Franzosa et al., 2018). Reads are mapped to ChocoPhlAn SGB (Jun 2023) and UniRef90; pathway abundances are reported in reads per kilobase (RPK).
Tools and outputs
Tools used: HUMAnN 3.9; ChocoPhlAn SGB (Jun 2023); UniRef90
Output: pathabundance.tsv; genefamilies.tsv; per-sample HUMAnN logs
7. Normalize and audit expression tables
Pathway and gene-family tables are transformed (CPM, log-CPM, or CLR as agreed at kickoff). Library sizes, detection rates, and sample correlations are audited; PCA or NMDS ordination is reviewed for batch effects. Samples below agreed depth thresholds are flagged before testing (Westreich et al., 2016).
Tools and outputs
Tools used: R vegan 2.6-8.1; custom Python/R audit scripts
Output: expression_qc_summary.csv; PCA/NMDS ordination plots; sample correlation heatmap
8. Test differential pathway and taxon activity
MaAsLin2 1.18.0 fits multivariable models with covariates (condition, batch, age, BMI, etc.) on transformed tables for pathway and taxonomic features as agreed at kickoff (Mallick et al., 2021). ANCOM-BC 2.4.0 is used for compositional taxonomic features when appropriate (Lin & Peddada, 2020). Benjamini–Hochberg q-values control FDR across tested features.
Tools and outputs
Tools used: MaAsLin2 1.18.0; ANCOM-BC 2.4.0
Output: da_results_<contrast>.csv with columns: feature, coef, stderr, pval, qval, N; MaAsLin2 coefficient plots
9. Normalize by paired metagenome gene copy (when scoped)
When matched shotgun metagenomes are available, HUMAnN expression estimates can be adjusted for gene copy number to distinguish transcriptional upregulation from DNA abundance changes (Franzosa et al., 2018). This step is scoped at kickoff when paired DNA FASTQs or pre-computed metagenomic profiles are provided.
Tools and outputs
Tools used: HUMAnN 3.9 --taxonomic-profile; MetaPhlAn 4.1.0 on paired DNA
Output: pathabundance_copy_normalized.tsv; normalization log documenting paired sample mapping
10. Package figures, scripts, and Methods draft
Pathway heatmaps, taxonomic barplots, and differential-activity plots are exported at publication resolution. Commented scripts reproduce agreed pipeline stages within project scope. A Methods draft cites software versions and database builds (Meyer et al., 2022). Salmon 1.10.3 transcript-level quantification against de novo or reference assemblies is available when clients require gene-level counts beyond HUMAnN gene families—scoped at kickoff (Patro et al., 2017).
Tools and outputs
Tools used: R/Python plotting scripts; documented workflow archive
Output: PDF/SVG figures; R/Python scripts; README; Methods draft; final deliverable bundle

What Pepkio delivers

Processed data files

HUMAnN pathabundance.tsv and genefamilies.tsv; MetaPhlAn metaphlan_profile.tsv
da_results_<contrast>.csv; optional copy-normalized pathway tables; host/rRNA QC summaries

Figures (PDF/SVG)

MultiQC summary; read-quality, host/rRNA removal, and taxonomic activity plots
Pathway heatmaps; PCA/NMDS ordination; MaAsLin2 coefficient plots

Tables

sample_manifest.csv; da_results_<contrast>.csv
Optional Kraken2/Bracken abundances

Code

Commented R and Python scripts per stage; conda lockfiles or sessionInfo()
Delivery via private Git or agreed transfer

Documentation

QC report; README; Methods draft
Post-delivery support within agreed scope (typically ≤20% of deliverables)

Technical decisions we make — and why

rRNA removal: SortMeRNA 4.3.7 after experimental depletion: Kits leave species-skewed residual rRNA; remaining reads should be discarded before functional profiling (Westreich et al., 2016; Kopylova et al., 2012). Kit depletion alone is rejected because uneven depletion skews functional yield.
Functional quantification: HUMAnN 3.9 with MetaPhlAn-informed stratification: Widely used for cross-study pathway comparison with community-aware stratification (Franzosa et al., 2018). Assembly-based or Kraken-only alternatives are scoped when reference coverage is insufficient.
Differential testing: MaAsLin2 1.18.0 with explicit covariates: Multivariable models with batch and continuous covariates on transformed meta-omic tables (Mallick et al., 2021). ANCOM-BC 2.4.0 for compositional taxonomic features when appropriate (Lin & Peddada, 2020).
Host subtraction: Bowtie2 2.5.4 against GRCh38 or GRCm39: Host RNA can dominate mucosal and tissue samples (Ojala et al., 2023); stool often shows low host fractions (Westreich et al., 2016). Environmental samples skip this step; host species is confirmed at kickoff.
Paired metagenome normalization: scoped when DNA is available: Paired DNA+RNA enables copy-number correction for pathway activity (Franzosa et al., 2018). RNA-only projects document this limitation in the QC report.

Common questions

What is the minimum sequencing depth and replicate count for metatranscriptomics?

For human gut metatranscriptomes, Westreich et al. (2016) recommend ribo-depleted, 100 bp paired-end sequencing with 40–50 million raw reads per sample—yielding roughly 5–10 million annotated reads and >90% accuracy on low-abundance transcript estimates. At least three biological replicates per condition are recommended for differential testing; fewer are flagged at kickoff. Environmental or low-biomass matrices may require project-specific depth targets; Pepkio confirms thresholds at kickoff based on sample matrix and study goals.

Can you analyze low-quality or low-yield RNA libraries?

Yes, with caveats documented in the QC report. Samples with low RIN, insufficient reads after host and rRNA filtering, or high residual rRNA fractions may lack power for rare pathway detection (Ojala et al., 2023). Outlier samples in ordination are flagged before differential testing; re-sequencing is discussed when yield threatens the study question.

Do you support Illumina and MGI DNBSEQ metatranscriptomic data?

Yes. Pepkio processes ribo-depleted Illumina FASTQs from NovaSeq X, 6000, NextSeq 2000, and HiSeq instruments using the standard workflow. MGI DNBSEQ-T7, G400, and G99 FASTQs are processed when scoped at kickoff with adapter and QC validation in the report. Pre-built HUMAnN or MetaPhlAn outputs from either platform can be imported when upstream processing is complete.

How long does metatranscriptomics analysis take at Pepkio?

A standard project (roughly 4–24 samples, one primary contrast, profiling through differential activity) typically completes in 5–8 weeks from data receipt. Multi-contrast designs, paired DNA+RNA normalization, assembly-based quantification, or >24 samples may extend the timeline. Milestone check-ins occur during QC, after profiling, and before delivery; exact timelines are confirmed at kickoff.

How do you handle batch effects across sequencing runs or rRNA depletion kit lots?

When batch is known and not fully confounded with condition, Pepkio includes batch as a covariate in MaAsLin2 models (Mallick et al., 2021). PCA and correlation heatmaps are reviewed before modeling. rRNA depletion kit lot and extraction batch are recorded in sample_manifest.csv; post-hoc correction beyond the design formula is scoped separately when required.

Do I own the code — and in what format is it delivered?

Yes — you retain full ownership of all code, scripts, and results. Pepkio delivers commented R and Python scripts with conda lockfiles or sessionInfo() exports so you can rerun agreed stages on Linux or HPC. Tables use standard .csv and .tsv formats; deliverables are organized by pipeline stage with README instructions. R Markdown or Jupyter notebooks are available on request.

Can I be involved during analysis?

Yes. Checkpoint reviews occur after QC, after host/rRNA filtering, and before final delivery. Within agreed scope, you can review metadata, covariate choices, filtering thresholds, and contrast definitions before final statistics. A PhD-level scientific contact leads the project, coordinates milestone feedback, and records decisions in the shared project file throughout the engagement.

What does post-delivery reviewer support include?

Post-delivery support covers clarification of methods, QC thresholds, database builds, and minor figure or table revisions within agreed scope (typically ≤20% of deliverables). Methods drafts cover analyses Pepkio performed. Substantial new reviewer requests—additional contrasts, assembly-based requantification, or new covariate models—are scoped as separate milestones with updated pricing and timeline estimates.

Is co-authorship required?

No. Pepkio does not require co-authorship unless explicitly discussed and agreed in writing before project start. We operate as a fee-for-service CRO with no authorship conditions in standard statements of work. Acknowledgment of bioinformatics support in Methods or Acknowledgments is standard practice on our projects and appreciated by our team.

Do I need paired shotgun metagenomic DNA for metatranscriptomics?

No, but paired DNA improves interpretation. Metagenomic DNA reports gene presence and copy number; metatranscriptomics reports expression. Franzosa et al. (2018) showed that copy-number normalization with paired DNA distinguishes transcriptional upregulation from abundance changes. Pepkio analyzes RNA-only projects and documents this limitation; paired DNA normalization is scoped when metagenomic FASTQs or profiles are available.

How do you handle remaining rRNA after kit depletion?

Pepkio runs SortMeRNA 4.3.7 against SILVA 138 after experimental ribodepletion to remove residual rRNA reads (Kopylova et al., 2012; Westreich et al., 2016). rRNA fractions before and after filtering are reported per sample. Samples with high post-filter rRNA or low mRNA yield are flagged in the QC report before HUMAnN profiling.

Can you quantify virulence, AMR, or custom pathway expression?

Yes, when scoped at kickoff. HUMAnN gene-family tables can be filtered to client-specified gene sets annotated against CARD, VFDB, or custom reference databases. Assembly-based Salmon quantification and co-expression networks are supported as bespoke extensions beyond the standard HUMAnN workflow; annotation scope, reference versions, and feasibility are confirmed before analysis begins.

Related services

Shotgun metagenomics — Paired DNA for gene-copy normalization and species-level catalog alongside active expression.
16S amplicon — Cost-effective taxonomic profiling when functional expression data are not required.
Bulk RNA-seq — Host transcriptome profiling alongside microbiome activity in the same cohort.
Metabolomics — Small-molecule validation of pathway activity inferred from metatranscriptomic data.
Experimental design — Replicate planning and sequencing depth estimation before library prep.

References

Zhang Y, Thompson KN, Branck T, et al. Metatranscriptomics for the Human Microbiome and Microbial Community Functional Profiling. Annual Review of Biomedical Data Science. 2021;4:279–311. https://doi.org/10.1146/annurev-biodatasci-031121-103035 (PMID: 34465175)
Franzosa EA, McIver LJ, Rahnavard G, et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods. 2018;15(11):962–968. https://doi.org/10.1038/s41592-018-0176-y (PMID: 30377376)
Westreich ST, Korf I, Mills DA, Lemay DG. SAMSA: a comprehensive metatranscriptome analysis pipeline. BMC Bioinformatics. 2016;17:399. https://doi.org/10.1186/s12859-016-1270-8 (PMID: 27687690)
Ojala T, Häkkinen A-E, Kankuri E, Kankainen M. Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics. Trends in Genetics. 2023;39(9):686–702. https://doi.org/10.1016/j.tig.2023.05.004
Javdan B, Lopez JG, Chankhamjon P, et al. Personalized mapping of drug metabolism by the human gut microbiome. Cell. 2020;181(7):1661–1679.e22. https://doi.org/10.1016/j.cell.2020.05.001 (PMID: 32526207)
Becattini S, Sorbara MT, Kim SG, et al. Rapid transcriptional and metabolic adaptation of intestinal microbes to host immune activation. Cell Host & Microbe. 2021;29(3):378–393.e5. https://doi.org/10.1016/j.chom.2021.01.003 (PMID: 33539766)
Hatch A, Horne J, Toma R, et al. A robust metatranscriptomic technology for population-scale studies of diet, gut microbiome, and human health. International Journal of Genomics. 2019;2019:1718741. https://doi.org/10.1155/2019/1718741 (PMID: 31662956)
Blanco-Míguez A, Beghini F, Cumbo F, et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nature Biotechnology. 2023;41(4):555–568. https://doi.org/10.1038/s41587-023-01688-w (PMID: 36823356)
Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012;28(24):3211–3217. https://doi.org/10.1093/bioinformatics/bts611 (PMID: 23071270)
Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology. 2019;20(1):257. https://doi.org/10.1186/s13059-019-1891-0 (PMID: 31779668)
Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science. 2017;3:e104. https://doi.org/10.7717/peerj-cs.104
Mallick H, Rahnavard G, McIver LJ, et al. Multivariable association discovery in population-scale meta-omics studies. PLOS Computational Biology. 2021;17(11):e1009442. https://doi.org/10.1371/journal.pcbi.1009442 (PMID: 34784344)
Lin H, Peddada SD. Analysis of compositions of microbiomes with bias correction. Nature Communications. 2020;11:3514. https://doi.org/10.1038/s41467-020-17041-7 (PMID: 32665548)
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 2017;14(4):417–419. https://doi.org/10.1038/nmeth.4197 (PMID: 28263959)
Meyer F, Fritz A, Deng Z-L, et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nature Methods. 2022;19(4):429–440. https://doi.org/10.1038/s41592-022-01431-4 (PMID: 35396482)
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. https://doi.org/10.1093/bioinformatics/btw354 (PMID: 27312411)
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884–i890. https://doi.org/10.1093/bioinformatics/bty624 (PMID: 30423086)

Let's Talk About Your Science

Tell us:

• Your biological question
• Data type and size
• Timeline constraints

We'll tell you:

• What's feasible
• How long it will take
• Exactly what it will cost

Metatranscriptomics Analysis Service — Active Pathway Expression from rRNA-Filtered FASTQs to HUMAnN Profiles and Differential-Activity Tables

Key facts

What is metatranscriptomics?

When should you use metatranscriptomics?

How the analysis works — step by step

1. Validate inputs and sample metadata

2. QC and trim raw reads

3. Remove host reads

4. Filter ribosomal RNA

5. Profile transcriptionally active taxa

6. Quantify pathway and gene-family expression

7. Normalize and audit expression tables

8. Test differential pathway and taxon activity

9. Normalize by paired metagenome gene copy (when scoped)

10. Package figures, scripts, and Methods draft