Day 1 — Who Is Active, Not Just Present? An Introduction to Metatranscriptomics
The central question of microbial ecology is not “who is there?” — it’s “who is doing what, right now?”
For years, metagenomics gave us a powerful answer to the first question. Sequence the DNA from an environmental sample and you can reconstruct which microorganisms are present, which genes they carry, and what they’re theoretically capable of. But there’s a fundamental problem with that picture: DNA is a blueprint, not a timesheet.
A gene sitting on a chromosome is silent. It might be expressed constantly, occasionally, never, or only in response to conditions that no longer exist by the time you sampled. Two communities could have identical genomic potential but operate in completely opposite ways — one active, stressed, growing; the other dormant, maintenance-mode, barely ticking over. Metagenomics alone cannot tell you the difference.
That’s where metatranscriptomics enters.
What metatranscriptomics actually measures
Metatranscriptomics captures the mRNA pool of an entire microbial community at the moment of sampling. Instead of sequencing DNA, you extract RNA, convert it to cDNA, and sequence that. The result is a snapshot of gene expression — not what could happen, but what is happening.
This matters enormously. In estuarine systems like the ones we study — coastal habitats spanning salinity gradients from Pennsylvania marshes to Florida mangroves — community composition might look similar along a transect while the metabolic activities are entirely different. A sulfur-oxidizing bacterium present in both environments might be actively oxidizing sulfide in the anoxic zone of a PA salt marsh but effectively dormant in the shallow, well-oxygenated FL seagrass bed. Metagenomics sees the organism in both places. Metatranscriptomics tells you which one is actually working.
The central biological question shifts from:
“Who is present?”
to:
“Who is active, and what are they doing?”
Metagenomics vs. metatranscriptomics: the key differences
| Metagenomics | Metatranscriptomics | |
|---|---|---|
| Template | DNA | RNA (as cDNA) |
| What it captures | Genomic potential | Active gene expression |
| Stability | Stable; survives harsh extraction | Fragile; degrades rapidly |
| Reference | Any assembled sequence | mRNA only (after rRNA removal) |
| Temporal signal | None — accumulated over cell lifetime | Snapshot — moment of sampling |
| Dominant signal | All genes in all organisms | Highly expressed genes only |
The RNA fragility issue is not just a lab inconvenience — it’s biologically meaningful. Because RNA degrades fast, your metatranscriptome reflects what cells were actively transcribing at the time of freezing. That’s a feature, not a bug. But it means sample handling is everything. Flash-freezing in liquid nitrogen in the field, or immediate preservation in RNAlater, is non-negotiable.
The challenges you’ll face
Challenge 1: rRNA dominance
Even in a “total RNA” extraction, 80–95% of your reads will be ribosomal RNA. This is not a problem you can sequence your way out of — you have to actively remove rRNA either biologically (depletion kits) or computationally (SortMeRNA, bbduk). Day 2 of this series covers this in full.
Challenge 2: low mapping rates
Even after QC and rRNA removal, your mRNA reads face another challenge: finding a reference to map against. In Day 4 we show real alignment stats from our own pipeline — a whole-community metatranscriptome mapped against four MAGs yielding roughly 2% overall alignment. To the uninitiated, this looks catastrophic. It isn’t. It reflects a fundamental truth: your reference can only capture what you’ve already assembled. Understanding why low alignment rates happen — and what they mean — is one of the most important lessons in this series.
Challenge 3: normalization and interpretation
Raw counts are not expression levels. A highly expressed gene in a rare organism and a moderately expressed gene in an abundant organism can produce similar read counts. Disentangling expression from abundance requires normalization strategies (TPM, RPKM) and, ideally, pairing your metatranscriptome with a metagenome to compute DNA:RNA ratios. This is where the real biology lives, and we cover it fully on Day 5.
Our system: estuarine microbiomes under salinity gradients
The samples in this series come from tidal estuaries in Pennsylvania and Florida. Our samples follow a naming convention you’ll learn to read:
-
CP_Spr15G08— Cape (location), Spring (season), Station 15G, Replicate 08
The PA and FL sites differ in salinity gradient structure, temperature regime, and organic matter inputs. These environmental differences create distinct selection pressures on microbial communities — and our metatranscriptomes capture how those communities respond in real time.
Across this gradient, we’re particularly interested in hydrocarbon-degrading guilds, sulfur cycling communities, and the organisms mediating metabolic coupling between primary producers and heterotrophs.
A conceptual map: DNA → RNA → function
The diagram below frames where metatranscriptomics sits relative to metagenomics and why neither alone is sufficient:
- DNA (metagenomics): stable, complete blueprint. Tells you what’s possible.
- mRNA (metatranscriptomics): fragile, moment-specific snapshot. Tells you what’s happening.
- Function (ecological outcome): what we actually care about — metabolic activity, niche occupation, ecosystem process.
Metatranscriptomics bridges the gap between blueprint and behavior. It’s not a replacement for metagenomics — it’s the layer that tells you which parts of the blueprint are running right now.
What’s next
Tomorrow we tackle the step most tutorials skip entirely: raw data preprocessing. FastQC diagnostics, adapter trimming with fastp, and rRNA removal with SortMeRNA and bbduk. The mapping success you get on Day 4 is decided here — long before a single read touches Bowtie2.
Day 2: QC and Preprocessing
📦 All code, SLURM scripts, and toy datasets used in this series are available in the companion repository 📦 Github repository
_Questions about this workflow? Drop a comment below. The complete R script and SLURM submission scripts are available in the companion GitHub repository
