Day 10: Multi-Omics Integration - Metagenomics & Metatranscriptomics

Status: 🚧 Active Learning | Time: 4-6 hours
Focus: Integrate processed metagenomics and metatranscriptomics data for comprehensive functional analysis

Note: This tutorial assumes you have already processed metatranscriptomics (MTX) data and have final count tables ready. A complete blog series on MTX data analysis (from raw reads to count tables) will be released soon. Today we focus on integration of MG and MTX datasets.

📚 Learning Objectives

By the end of this session, you will:

Understand multi-omics integration principles
Compare gene abundance (DNA) vs transcript abundance (RNA)
Perform differential expression analysis with DESeq2
Calculate and interpret expression ratios
Create comprehensive visualizations (heatmaps, correlation plots, networks)
Identify active metabolic pathways in microbial communities

🧭 Workflow at a Glance (DNA + RNA Integration)

MG (DNA) raw counts → CPM (for comparability)
MTX (RNA) raw counts → DESeq2 (stats) + CPM (ratios/plots)
Gene activity: log2((RNA+1)/(DNA+1))
Visual QC: histogram, MA plot, boxplot
Higher-level integration: pathways + taxonomy → “active functions” & “active taxa”

🧬 Part 1: Introduction to Multi-Omics Integration

Why Integrate Metagenomics and Metatranscriptomics?

Data Type	What It Tells Us	Limitation
Metagenomics (DNA)	Who’s present? What can they do?	Doesn’t show actual activity
Metatranscriptomics (RNA)	What are they actually doing?	Hard to assign to specific organisms
Integrated (DNA + RNA)	Active organisms + their functions	Complete functional picture!

Key Questions Answered by Integration

Which pathways are active beyond what genomic potential suggests?
Which microbes are dormant vs actively functioning?
How does community respond to environmental changes?
What metabolic processes are upregulated or downregulated?

The Expression Ratio Concept

Expression Ratio = RNA abundance / DNA abundance

High ratio (>1): Gene is highly expressed relative to its abundance
Low ratio (<1): Gene is under-expressed despite being present
Ratio ≈ 1: Expression matches genomic potential

📊 Part 2: Data Preparation

Required Input Files

Before starting, we assume that we have the following data:

project/
├── metagenomics/
│   ├── gene_abundance_table.tsv      # Gene counts from MG
│   ├── pathway_abundance_table.tsv   # Pathway abundances from MG
│   └── taxonomy_abundance.tsv        # Taxonomic profiles from MG
├── metatranscriptomics/
│   ├── transcript_counts.tsv         # Transcript counts from MTX
│   ├── pathway_abundance_table.tsv   # Pathway abundances from MTX
│   └── taxonomy_abundance.tsv        # Taxonomic profiles from MTX
└── metadata.csv                      # Sample information

Setting Up Environment

# Create project structure
mkdir -p ~/multi_omics_integration/{data,scripts,results,figures}
cd ~/multi_omics_integration

# Install required R packages
R -e "install.packages(c('BiocManager', 'tidyverse', 'pheatmap', 'ggplot2', 'corrplot'))"
R -e "BiocManager::install(c('DESeq2', 'EnhancedVolcano', 'ComplexHeatmap'))"

# Install Python packages
pip install pandas numpy scipy matplotlib seaborn scikit-learn networkx --break-system-packages

Example Metadata File

Create metadata.csv:

SampleID,Condition,Timepoint,Replicate
Sample1_MG,Control,Day0,1
Sample2_MG,Control,Day0,2
Sample3_MG,Control,Day0,3
Sample4_MG,Treatment,Day7,1
Sample5_MG,Treatment,Day7,2
Sample6_MG,Treatment,Day7,3
Sample1_MTX,Control,Day0,1
Sample2_MTX,Control,Day0,2
Sample3_MTX,Control,Day0,3
Sample4_MTX,Treatment,Day7,1
Sample5_MTX,Treatment,Day7,2
Sample6_MTX,Treatment,Day7,3

🔗 Part 3: Gene-Level Integration

Python Scripts for Gene Expression Integration

Overview

Four standalone Python scripts that perform step-by-step integration of metagenomics (DNA) and metatranscriptomics (RNA) data to identify active genes and metabolic pathways. Please find all scripts in repo Day 10

Script 1: Load and Prepare Data

File: 01_load_and_prepare_data.py

What it does:

Loads raw count tables from metagenomics and metatranscriptomics experiments, identifies genes present in both datasets, and creates aligned matrices for fair comparison.

Key functions:

Reads TSV/CSV count tables
Finds overlapping genes between MG and MTX
Filters datasets to common genes only
Reports data quality metrics

Output:

Aligned gene count matrices
Summary statistics (genes, samples, overlap)

When to use:

First step of any multi-omics integration analysis

Script 2: Normalize Data

File: 02_normalize_data.py

What it does:

Converts raw counts to CPM (Counts Per Million) to account for differences in sequencing depth between samples. Essential for comparing samples with different library sizes.

Key functions:

CPM normalization: (count / total_reads) × 1,000,000
Validates normalization (all samples = 1M)
Reports sequencing depth statistics

Output:

CPM-normalized gene abundance tables
Sequencing depth summary

When to use:

After loading data, before calculating ratios

Why it matters:

Without normalization, a sample with 10M reads appears 2x more abundant than a sample with 5M reads, even if biological abundance is identical.

Script 3: Calculate Expression Ratios

File: 03_calculate_expression_ratios.py

What it does:

Calculates log2(RNA/DNA) ratios for each gene to identify which genes are more (or less) expressed than expected based on their genomic abundance.

Key functions:

Computes log2((RNA + 1) / (DNA + 1)) for each gene
Identifies highly expressed genes (ratio > 2 = 4x more RNA)
Identifies under-expressed genes (ratio < -2)
Calculates mean and variability across samples

Output:

Complete expression ratio matrix
List of highly expressed genes
List of under-expressed genes
Summary statistics

When to use:

Core integration step after normalization

Interpretation:

Ratio > 2: Active metabolism, stress response
Ratio ≈ 0: Expression matches abundance
Ratio < -2: Dormancy, energy conservation

Script 4: Visualize Expression Ratios

File: 04_visualize_expression_ratios.py

What it does:

Creates three publication-quality figures to visualize expression patterns across all genes and samples.

Figures generated:

Histogram (expression_ratio_histogram.png)
- Shows distribution of expression ratios
- Identifies overall community activity
- Reference lines at key thresholds (0, 2)
MA Plot (ma_plot_expression.png)
- X-axis: Gene abundance (DNA + RNA average)
- Y-axis: Expression ratio (RNA/DNA)
- Reveals if low/high abundance genes behave differently
- Detects technical biases
Boxplot (expression_ratio_boxplot.png)
- Compares expression ratios across samples
- Shows sample-to-sample variability
- Identifies outlier samples

Output:

Three high-resolution PNG figures (300 DPI)
Ready for publication or presentations

When to use:

Final step for visual interpretation and quality control

Running the Scripts

Sequential execution:

python3 01_load_and_prepare_data.py
python3 02_normalize_data.py
python3 03_calculate_expression_ratios.py
python3 04_visualize_expression_ratios.py

Quick Reference Table

Script	Input	Output	Purpose
01	Raw count tables	Aligned matrices	Prepare datasets
02	Aligned matrices	CPM-normalized data	Account for sequencing depth
03	Normalized data	Expression ratios	Identify active genes
04	Expression ratios	Figures (PNG)	Visualize patterns

Expected Outputs Summary

After running all scripts, you’ll have:

Data files:

mg_genes_aligned.csv - Metagenomics aligned
mtx_transcripts_aligned.csv - Metatranscriptomics aligned
mg_genes_cpm.csv - MG normalized to CPM
mtx_transcripts_cpm.csv - MTX normalized to CPM
expression_ratios.csv - Complete ratio matrix
highly_expressed_genes.csv - Top active genes
under_expressed_genes.csv - Dormant genes

Figures:

expression_ratio_histogram.png
ma_plot_expression.png
expression_ratio_boxplot.png

Next Steps

After running these scripts:

Review highly expressed genes for biological significance
Perform functional enrichment analysis (KEGG, GO)
Run DESeq2 for statistical differential expression
Create pathway-level and taxonomic-level integrations
Generate comprehensive multi-omics reports

These scripts lay the foundation for deeper multi-omics analysis!

📈 Part 4: Differential Expression Analysis with DESeq2

Why DESeq2 for Metatranscriptomics?

DESeq2 is designed for RNA-seq data and handles:

Count data normalization
Variance estimation across replicates
Statistical testing for differential expression
Multiple testing correction

Step 1: Prepare Count Matrix for DESeq2

import pandas as pd

# Load raw transcript counts (not normalized)
mtx_raw = pd.read_csv('data/transcript_counts.tsv', sep='\t', index_col=0)

# Filter low-abundance genes (at least 10 reads across all samples)
gene_sums = mtx_raw.sum(axis=1)
mtx_filtered = mtx_raw[gene_sums >= 10]

print(f"Genes before filtering: {mtx_raw.shape[0]}")
print(f"Genes after filtering: {mtx_filtered.shape[0]}")

# Save filtered count matrix
mtx_filtered.to_csv('results/mtx_counts_filtered.csv')

# Load and prepare metadata
metadata = pd.read_csv('metadata.csv', index_col=0)
metadata_mtx = metadata[metadata.index.str.contains('MTX')]

# Ensure sample order matches
mtx_filtered = mtx_filtered[metadata_mtx.index]

# Save metadata for R
metadata_mtx.to_csv('results/metadata_mtx.csv')

print("\nData prepared for DESeq2!")

Step 2: Run DESeq2 Analysis in R

Please see deseq2_analysis.R in Day 10**

Run the R script:

⚠️ Important: DESeq2 must use raw integer counts (not CPM/TPM/RPKM).
CPM is great for visualization and ratio-style summaries, but statistical DE testing should stay in DESeq2.

🔥 Part 5: Comparative Analysis- DNA vs RNA Abundance

🧰 Day 10 Scripts (Quick Index)

Step	Script	Purpose	Outputs
1	`01_load_and_prepare_data.py`	Align MG/MTX gene tables	aligned matrices
2	`02_normalize_data.py`	CPM normalization	`*_cpm.csv`
3	`03_calculate_expression_ratios.py`	log2(RNA/DNA) ratios	`expression_ratios.csv`
4	`04_visualize_expression_ratios.py`	QC + overview plots	histogram, MA, boxplot
5	`comparative_heatmaps.py`	DNA vs RNA heatmaps	comparison PNG
6	`ratio_heatmap.py`	ratio heatmap of top genes	ratio heatmap PNG
7	`pathway.py`	pathway DNA vs RNA activity	pathway ratios + barplots
8	`taxonomy_integration.py`	taxa abundance vs activity	activity scores + plots

Day 10 scripts comes with toy data for practicing Day 10

🎯 Part 8: Biological Interpretation Guide

Understanding Your Results

Expression Ratio Interpretation

Log2(RNA/DNA)	Interpretation	Biological Meaning
> 3	Very highly active	Stress response, active metabolism
2 to 3	Highly active	Active growth, core functions
0 to 2	Moderately active	Steady-state metabolism
-1 to 0	Slightly under-expressed	Energy conservation
< -2	Under-expressed	Dormancy, unfavorable conditions

Common Biological Patterns

High DNA + High RNA: Core metabolic genes, abundant active populations
Low DNA + High RNA: Induced stress responses, specific adaptations
High DNA + Low RNA: Dormant populations, genetic potential not utilized
Low DNA + Low RNA: Rare inactive functions

Example Interpretation

Finding: Nitrogen fixation genes show high expression ratios (log2 = 3.5)
         but low genomic abundance (0.5% of community)

Interpretation: Although nitrogen-fixing bacteria are rare in the community,
                they are highly active, suggesting nitrogen limitation and
                an important ecological role for these minority members.

✅ Deliverables Checklist

Expression ratios calculated for all genes
DESeq2 differential expression analysis complete
Comparative heatmaps (DNA vs RNA) generated
Pathway activity analysis complete
Taxonomic activity scores calculated
All visualization figures saved
Summary statistics report generated
Biological interpretation written

🎉 Conclusion

Congratulations! You’ve now integrated metagenomics and metatranscriptomics data to:

Identify which genes and pathways are actively expressed
Discover which microbes are metabolically active
Quantify differential expression between conditions
Create publication-quality visualizations

Next Steps:

Apply these methods to your own data
Explore temporal multi-omics (time-series)
Integrate with meta-metabolomics data
Investigate strain-level activity patterns

Remember: Integration reveals the complete functional story - who’s there (DNA), what they’re doing (RNA), and how they respond to their environment!

Repo for today’s code and other details

🔗 Day 10

Last updated: February 2026

day10 mg