10-Day Metagenome Analysis Series - Complete Summary

Complete workflow from raw sequencing data to publication-ready insights

📊 Overview

Day 1: Metagenome QC & Taxonomic Profiling

From Raw Reads to Clean, Profiled Data

What You’ll Do:

Quality control of raw sequencing reads
Adapter trimming and host contamination removal
Read-based taxonomic profiling
PhiX removal

Key Tools:

FastQC / MultiQC - Quality assessment
Trimmomatic - Adapter trimming
Kraken2 - K-mer matching taxonomy
Kaiju - Protein-level classification
MetaPhlAn - Marker gene profiling
mOTUs - Universal marker genes

Time Estimate: 10-20 minutes (QC) + 30 min-1 hr (Taxonomy)

Deliverables:

Clean FASTQ files
Taxonomic profiles
Quality reports
Ready for assembly

📝 Day 1

📂 Metagenome Analysis Series Repo

Day 2: Metagenome Assembly

Reconstructing Genomes from Short Reads

What You’ll Do:

Assemble clean reads into contigs
Quality assessment of assemblies
Calculate coverage
Choose best assembler

Key Tools:

metaSPAdes - Multi-k-mer assembly (high quality)
MEGAHIT - Memory-efficient (fast)
Flye - Long-read assembly
MetaQUAST - Assembly quality assessment
Bowtie2 - Read mapping
SAMtools - Coverage calculation

Time Estimate: 6-12 hours (MEGAHIT: 2-8 hrs)

Deliverables:

Assembled contigs (contigs.fasta)
Quality reports (N50, L50, total length)
Coverage files
Ready for binning

📝 Day 2

📂 Metagenome Analysis Series Repo

Day 3: Genome Binning

Recovering Individual Genomes (MAGs) from Metagenomes

What You’ll Do:

Bin contigs into individual genomes
Refine bins using multiple binners
Assess bin quality
Profile abundance across samples

Key Tools:

MetaBAT2 - Coverage-based binning
MaxBin2 - Marker gene binning
CONCOCT - Composition-based binning
MetaWRAP - Bin refinement (combines all 3)
CheckM2 - Quality assessment
CoverM - Abundance profiling
SingleM - Validation

Time Estimate: 12-24 hours total

Initial binning: 4-8 hrs
Refinement: 2-6 hrs
Quality check: 1-2 hrs
Abundance: 2-4 hrs

Quality Metrics:

Completeness (target >50%, ideally >90%)
Contamination (target <10%, ideally <5%)
N50 (contig length)

Deliverables:

High-quality MAGs (>50% complete, <10% contamination)
Abundance tables
Coverage metrics
Ready for annotation

📝 Day 3

📂 Metagenome Analysis Series Repo

Day 4: Dereplication & Taxonomy

Identifying Unique Species and Classifying Genomes

What You’ll Do:

Remove redundant/duplicate genomes
Classify MAGs taxonomically
Build phylogenetic trees
Identify novel species

Key Tools:

dRep - Dereplication (primary, 95% ANI for species)
GTDB-Tk - GTDB taxonomy classification
pyani - Average Nucleotide Identity
iTOL - Interactive phylogenetic tree visualization

Time Estimate: 6-17 hours

dRep (MASH + ANI): 1-4 hrs
GTDB-Tk: 4-12 hrs (identify + classify)
Visualization: 1 hr

Outputs:

Species representatives (40-70% fewer genomes)
Full GTDB taxonomy
ANI matrices
Phylogenetic trees
Novel species identified (ANI <95%)

Success Rates:

Classification: 95-99% success
Novel species: 10-30% (ANI <95%)
Novel genera: 1-5% (ANI <85%)

Deliverables:

Dereplicated MAG set
Complete taxonomy assignments
Species catalog
Phylogenetic trees

📝 Day 4

📂 Metagenome Analysis Repo

Day 5: Genome Annotation

Understanding Metabolic Potential

What You’ll Do:

Predict genes in MAGs
Annotate gene functions
Map to metabolic pathways
Identify biosynthetic potential

Key Tools:

Prodigal - Gene prediction (100% genes)
Prokka - Rapid annotation (60-80%)
eggNOG-mapper - Functional annotation (70-90%)
DRAM - Metabolic distillation (40-60% pathways)
METABOLIC - Comprehensive pathways (40+ pathways)

Time Estimate: 3-4 hours (10 genomes)

Prodigal: 5 min
Prokka: 50 min
eggNOG: 5 hrs
DRAM: 3-4 hrs

Annotation Coverage:

Genes predicted: 100% (Prodigal)
Functions assigned: 60-80% (Prokka), 70-90% (eggNOG)
Pathways identified: 40+ (METABOLIC)

Deliverables:

Annotated gene files (.gbk, .gff, .faa)
Functional summaries (KEGG, COG, GO, EC)
Metabolic pathway maps
Interactive diagrams
Ready for comparative genomics

📝 Day 5

📂 Metagenome Analysis Repo

Day 6: Specialized Genomic Functions

Discovering Hidden Capabilities

What You’ll Do:

Identify biosynthetic gene clusters (BGCs)
Find antimicrobial resistance genes
Detect CRISPR-Cas systems
Identify prophages
Discover mobile genetic elements
Find protein domains

Key Tools:

antiSMASH - Secondary metabolites (BGCs: 5-30 per genome)
CARD-RGI - Antimicrobial resistance (AMR genes)
ABRicate - Resistance gene screening
dbCAN2 - Carbohydrate-active enzymes (CAZymes: 50-500)
VirSorter2 - Prophage detection (0-10 per genome)
MinCED - CRISPR arrays
IntegronFinder - Mobile genetic elements
InterProScan - Protein domains (Pfam, PROSITE)

Time Estimate: 15-20 hours (10 genomes)

antiSMASH: 5 min (fast screen) to 2 hrs (complete, with antiSMASH)
CARD: Standard 1 hr, Complete 10 genomes: 15-20 hrs

Expected Discoveries (per genome):

BGCs: 5-30 (antibiotics, siderophores, etc.)
AMR genes: 0-50 (KPC, NDM, mcr-1, etc.)
CAZymes: 50-500 (cellulose, starch, chitin degradation)
Prophages: 0-10 (horizontal gene transfer)
CRISPR arrays: 0-5 (adaptive immunity)

Key Applications:

Antibiotic biosynthesis potential
Drug resistance surveillance
Carbohydrate degradation capabilities
Phage-mediated horizontal gene transfer
Novel biosynthetic clusters

Deliverables:

Complete specialized function catalog
BGCs with chemical structures
AMR gene inventory
CAZyme families
Prophage sequences
CRISPR arrays
Mobile genetic elements
Protein domain architectures

📝 Day 6

📂 Metagenome Analysis Repo

Day 7: Comparative Genomics & Statistics

Connecting Genomes to Environment

What You’ll Do:

Pangenome analysis (core vs accessory genes)
Statistical testing of community differences
Environmental driver identification
Network analysis
Module detection

Key Tools & Methods:

Part I: Pangenome Analysis

PanX - Interactive pangenome browser
Anvi’o - Visual pangenome analysis
Roary - Rapid pangenome pipeline
BPGA - Windows GUI tool

Outputs: Core genes (present in all), Accessory genes (variable), Unique genes (strain-specific)

Part II: Statistical Tests

PERMANOVA (vegan R package)
- Test: Do communities differ between groups?
- Output: P-value, R² (variance explained)
- Interpretation: Variance explained by treatment
RDA / db-RDA (vegan R package)
- Test: Which environmental factors drive differences?
- Output: Ordination plot, environmental vectors
- Interpretation: pH, temperature, nutrients effects
LEfSe / ANCOM (Python/R)
- Test: Which taxa are enriched in treatments?
- Output: Biomarkers, effect sizes
- Interpretation: Treatment-specific taxa

Part III: Networks

SparCC (Python) - Co-occurrence networks
WGCNA (R) - Module detection
igraph / Cytoscape - Network visualization

Outputs: Co-occurrence networks, Keystone species, Functional modules

Time Estimate: 6-8 hours

Pangenome (10 genomes): 1-2 hrs
PERMANOVA + RDA: 30 min
Network construction: 1-2 hrs
WGCNA: 1 hr

Statistical Requirements:

Sample size: n > 20 for robust statistics
Replicates: Biological > Technical
Multiple testing: BH correction
Compositional data: Use SparCC, ANCOM
Confounders: Control for batch effects

Deliverables:

Pangenome summary (Core/Accessory/Unique)
Statistical test results (P-values, R², biomarkers)
Environmental drivers identified
Co-occurrence networks
Keystone species
Functional modules
Trait associations

📝 Day 7

📂 Metagenome Analysis Repo

Day 8: Workflow Wrappers & Platforms

Streamlining Analysis - Automate Everything!

What You’ll Learn:

Automate Days 1-7 with single commands
Use web-based platforms (no installation!)
Scale to 100+ samples efficiently
Ensure reproducibility

Workflow Wrappers (Command-line):

MetaWRAP
- Complete pipeline (QC → Assembly → Binning → Annotation)
- All-in-one solution
- Bin refinement built-in
- Time: 2-3 days for 50 samples
nf-core/mag
- Modern Nextflow pipeline
- Automatic resume (restart from failures)
- Reproducible & scalable
- MultiQC reports
- Time: 2-3 days for 50 samples
Anvi’o
- Interactive platform
- Real-time binning
- Beautiful visualizations
- Pangenomics integration
ATLAS
- Snakemake pipeline
- Large datasets (100+ samples)
- Metatranscriptomics support

Web Platforms (Browser-based, No Installation):

Galaxy
- 1000+ tools available
- User-friendly GUI
- Free compute resources
- Educational tutorials
- Best for: Learning, quick analysis
KBase
- DOE platform
- Metabolic modeling (unique!)
- Narrative interface
- Best for: Systems biology
IMG/M
- JGI database (40,000+ metagenomes)
- Pre-computed annotations
- Compare to public data
- Best for: Database queries
BV-BRC (PATRIC)
- Bacterial/viral focus
- AMR database
- Clinical/pathogen analysis
- Best for: Pathogens, AMR studies

Time Savings:

Manual (Days 1-7): 2-4 weeks per sample
Automated: 2-3 days for 50 samples!
10x-100x faster!

When to Use Each:

Workflow Wrappers: HPC access, large datasets, customization
Web Platforms: No Linux skills, learning, collaboration

Key Benefits:

✅ Automation: 1 command runs entire pipeline
✅ Reproducibility: Same input = same output
✅ Best practices: Tools integrated properly
✅ Scalability: 1 to 1000 samples
✅ Documentation: Clear workflows for papers

Deliverables:

Complete analysis in days (not weeks!)
Automated quality control
Reproducible workflows
Publication-ready results

📝 Day 8

📂 Metagenome Analysis Repo

Day 9: Visualization & Publication

Making Data Tell Stories

What You’ll Learn:

Create publication-quality figures (300 DPI)
Master both Python and R visualization
Use specialized genomics tools
Make interactive plots

Python Visualization:

Matplotlib - Publication static plots
- Taxonomy barplots
- Alpha diversity boxplots
- Multi-panel figures
Seaborn - Statistical visualizations
- Clustered heatmaps
- Correlation matrices
- Pairplots
Plotly - Interactive HTML
- Sunburst charts (taxonomy hierarchy)
- Interactive scatter plots
- Hover tooltips
NetworkX / igraph - Network graphs
- Co-occurrence networks
- Community detection
- Hub identification

R Visualization:

ggplot2 - Grammar of graphics
pheatmap - Publication heatmaps
vegan - Ordination plots
See complete R visualization series at jojyjohn28.github.io

Specialized Tools:

Krona
- Interactive taxonomic wheels
- Zoomable HTML output
- Hierarchy navigation
Circos / pycirclize
- Circular genome plots
- GC content tracks
- Synteny links
Cytoscape
- Professional network viz
- Publication-quality layouts
- Advanced styling

Plot Types Covered:

Stacked barplots (taxonomy)
Boxplots with statistics
Heatmaps (clustered)
Ordination (PCA, NMDS)
Networks (co-occurrence)
Circular genomes
Interactive HTML plots

Publication Standards:

✅ 300 DPI minimum
✅ PDF (vector) preferred
✅ Colorblind-safe palettes
✅ Clear axis labels
✅ Multi-panel figures (A, B, C, D)
✅ Error bars where appropriate

Time Estimate: 4-6 hours

Deliverables:

Publication-ready figures (PDF + PNG + SVG)
Interactive HTML plots
Multi-panel composite figures
Journal-compliant formatting
Colorblind-friendly
Ready for submission!

📝 Day 9

📂 Metagenome Analysis Repo

Day 10: Multi-Omics Integration

Connecting Metagenomes to Function

What You’ll Learn:

Integrate metagenomics + metatranscriptomics
Link genes to expression
Map to metabolic pathways
Identify active vs inactive functions

Multi-Omics Workflow:

Load & Align
- Metagenome data (DNA)
- Metatranscriptome data (RNA)
- Align RNA reads to MAG genes
Normalize
- CPM (Counts Per Million)
- Account for sequencing depth
Expression Ratios
- log2(RNA/DNA) for each gene
- Identifies active genes
Differential Expression
- DESeq2 analysis
- Find treatment-responsive genes
Visualize
- Heatmaps (gene expression)
- Pathway maps (active functions)
Pathway Analysis
- Map to KEGG pathways
- Identify active functions
Taxonomy Integration
- Link expression to species
- Find active taxa

Key Questions Answered:

Which genes are transcribed?
Which pathways are active?
Which species are metabolically active?
How does expression change with treatment?

Tools:

Salmon / Kallisto - RNA quantification
DESeq2 - Differential expression
KEGG Mapper - Pathway visualization
R / Python - Integration & plotting

Time Estimate: 4-6 hours

Deliverables:

Expression ratios (RNA/DNA)
Differentially expressed genes
Active metabolic pathways
Species-specific activity
Publication figures showing:
- Gene expression heatmaps
- Active pathway diagrams
- Taxa-function links
- Statistical results

Biological Insights:

Distinguish potential (DNA) from activity (RNA)
Identify metabolically active species
Find treatment-responsive pathways
Discover unexpected functions
Link community structure to function

📝 Day 10

📂 Metagenome Analysis Repo

📊 Series Statistics

Total Tools Covered: 50+

Total Time Investment: 100-150 hours (learning + doing)

Deliverables: Complete publication-ready dataset

Skill Progression:

Day 1-3: Core skills (QC → Assembly → Binning)
Day 4-6: Advanced annotation & specialization
Day 7-9: Analysis, statistics, visualization
Day 10: Integration & insights

Quick Reference Card

Day	Topic	Key Tool	Time	Output
1	QC & Taxonomy	FastQC, Kraken2	1 hr	Clean reads + profiles
2	Assembly	MEGAHIT	6-12 hrs	Contigs
3	Binning	MetaBAT2	12-24 hrs	MAGs
4	Dereplication	dRep, GTDB-Tk	6-17 hrs	Species catalog
5	Annotation	Prokka, DRAM	3-4 hrs	Functional genes
6	Specialized	antiSMASH, CARD	15-20 hrs	BGCs, AMR
7	Comparative	PERMANOVA, RDA	6-8 hrs	Statistics
8	Workflows	MetaWRAP, Galaxy	Days → Hours	Automation
9	Visualization	Python, R	4-6 hrs	Figures
10	Multi-omics	DESeq2	4-6 hrs	Integration

🎉 Congratulations on completing the 10-day series! 🎉

Last updated: February 2026

mg summary