Metagenome Analysis - Graduate Course Materials
Beginner-Level Practical Training
Welcome to the beginner-friendly companion to the 10-Day Metagenome Analysis Series!
This repository contains materials for a graduate-level metagenome analysis course, co-taught as part of a university graduate program. It’s designed for students with little to no prior bioinformatics experience who want hands-on practice with real workflows using manageable toy datasets.
Why this course?
- ✅ Quick results - Complete workflows in 1-2 hours (not days!)
- ✅ Laptop-friendly - Runs on 4-8 GB RAM
- ✅ Realistic expectations - Learn with toy data, understand limitations
- ✅ Foundation for advanced work - Bridge to the full 10-Day Series
📚 Course Overview
A hands-on introduction to metagenomics covering the complete workflow from raw sequencing reads to taxonomic classification, genome recovery, and phylogenetic analysis.
Level: Beginner
Prerequisites: Basic command-line knowledge
Duration: Full semester practical course
Instructors: Co-taught graduate course
📁 Repository Contents
metagenome-analysis-class/
├── README.md # This file
├── metagenome_practical_full_version_jojy.md # Complete course guide
├── data.md # Data download instructions
├── metawrap_micromamba.md # MetaWRAP setup guide
│
├── Toy Data (< 100 MB - Included)
│ ├── toy-R1.fastq.gz # Forward reads
│ ├── toy-R2.fastq.gz # Reverse reads
│ └── assembly.fasta # Pre-assembled contigs
│
└── Phylogenetic Analysis
├── class_practice_tree/ # Tree files
└── tree_annotation_data.xlsx # Annotation metadata
└──Metagenome_M8130_Feb16_JJ.pdf # full class materials
🎯 Learning Objectives
After completing this course, students will be able to:
✅ Perform quality control on raw sequencing data
✅ Conduct taxonomic profiling of microbial communities
✅ Assemble metagenome reads into contigs
✅ Recover individual genomes (MAGs) through binning
✅ Classify recovered genomes taxonomically
✅ Annotate and visualize phylogenetic relationships
🧬 Toy Dataset (< 100 MB)
What’s Included:
- toy-R1.fastq.gz - Forward paired-end reads
- toy-R2.fastq.gz - Reverse paired-end reads
- assembly.fasta - Pre-assembled contigs for binning practice
What This Dataset is Good For:
✅ Quality Control
- FastQC analysis
- Adapter trimming (fastp / Trimmomatic)
- Quality score visualization
✅ Taxonomic Profiling
- Kraken2 / Bracken classification
- Kaiju protein-level profiling
- K-mer profiling (Mash/sourmash)
✅ Read Mapping
- Bowtie2 / BWA alignment
- Coverage calculation
- Mapping to reference genomes
✅ Assembly Demonstration
- MEGAHIT or metaSPAdes assembly
- Note: Results will be fragmented (toy-scale data) but demonstrate the workflow
✅ Binning Practice
- Use provided
assembly.fastafor binning exercises - MetaBAT2, MaxBin2, or CONCOCT
- Bin refinement and quality assessment
What to Expect:
⚠️ Important Notes:
- Dataset is intentionally small for quick processing
- Assembly will be fragmented - this is expected for toy data
- Binning will produce ~2 bins only - sufficient for learning
- Bins are from contaminated genomes - for educational purposes only
- Results are not publication-quality - this is for learning workflows
🧪 Pre-Assembled Data
assembly.fasta
A toy assembly file provided for students to practice:
- Genome binning
- Bin quality assessment (CheckM)
- Dereplication (dRep)
- Taxonomic classification (GTDB-Tk)
- Functional annotation
Expected Output: ~2 bins (intentionally limited for faster processing)
Use Case:
- Practice binning workflows without waiting hours for assembly
- Learn downstream analysis steps quickly
- Understand MAG quality metrics
🌳 Phylogenetic Tree Materials
For learning phylogenetic tree construction and annotation:
Files:
- class_practice_tree/ - Tree files in various formats
- tree_annotation_data.xlsx - Metadata for tree annotation
Purpose:
- Visualize evolutionary relationships
- Annotate trees with metadata
- Practice using iTOL or similar tools
Note: Use these tree files for annotation practice, as toy data binning produces limited genomes.
📊 Real Dataset (Optional)
For students wanting to work with real-world data:
NCBI BioProject: PRJNA432171
Download Instructions:
See data.md for detailed download instructions.
Important:
- Real data is significantly larger (several GB)
- Processing time: hours to days depending on sample
- Requires more computational resources
- Provides publication-quality results
🚀 Getting Started
1. Prerequisites
Software Requirements:
- FastQC
- fastp or Trimmomatic
- MEGAHIT or metaSPAdes
- MetaBAT2
- CheckM
- (Optional) MetaWRAP for integrated workflow
See metawrap_micromamba.md for complete setup instructions.
2. Quick Start with Toy Data
# 1. Clone this repository
git clone https://github.com/jojyjohn28/metagenome-analysis-class.git
cd metagenome-analysis-class
# 2. Quality control
fastqc toy-R1.fastq.gz toy-R2.fastq.gz
# 3. Trim adapters
fastp -i toy-R1.fastq.gz -I toy-R2.fastq.gz \
-o toy-R1.clean.fastq.gz -O toy-R2.clean.fastq.gz
# 4. Taxonomic profiling
kraken2 --db /path/to/db --paired \
toy-R1.clean.fastq.gz toy-R2.clean.fastq.gz \
--output toy.kraken --report toy.report
# 5. Binning (using provided assembly)
metabat2 -i assembly.fasta -o bins/bin
3. Follow the Full Tutorial
Open metagenome_practical_full_version_jojy.md for complete step-by-step instructions.
📖 Course Materials
Main Tutorial
metagenome_practical_full_version_jojy.md
- Complete workflow from QC to phylogenetic analysis
- Code examples for each step
- Troubleshooting tips
- Expected outputs
Setup Guides
metawrap_micromamba.md
- MetaWRAP installation with micromamba
- Database setup
- Configuration instructions
data.md
- Download instructions for real data
- File organization
- Storage requirements
💡 Teaching Tips
For Instructors:
- Start with Toy Data
- Fast processing for demonstrations
- Students see results quickly
- Reduces frustration with long wait times
- Use Pre-Assembled Contigs
- Skip assembly for initial binning lessons
- Focus on binning concepts first
- Students can assemble later if interested
- Provide Tree Files
- Binning produces limited results
- Use provided trees for annotation practice
- Demonstrates publication-quality phylogenies
- Optional Real Data
- Advanced students can download real data
- Compare toy vs. real results
- Understand computational requirements
For Students:
- Master the Workflow First
- Use toy data to understand each step
- Don’t worry about fragmented assembly
- Focus on learning the process
- Understand Limitations
- Toy data = toy results
- Real data requires more time/resources
- Concepts are the same at any scale
- Practice Makes Perfect
- Run the workflow multiple times
- Experiment with parameters
- Try real data when confident
⚠️ Important Disclaimers
About Toy Data:
- Not for publication - Educational purposes only
- Contaminated genomes - Intentionally included for learning
- Limited results - Expect ~2 bins maximum
- Fragmented assembly - Expected behavior for small dataset
- Simplified workflow - Real projects are more complex
About Computational Resources:
- Toy data: Runs on laptops (4-8 GB RAM)
- Real data: Requires HPC or workstation (64+ GB RAM)
- Processing time: Toy=minutes, Real=hours to days
📚 Additional Resources
Recommended Reading:
Online Tutorials:
Databases:
- GTDB: https://gtdb.ecogenomic.org/
- NCBI Taxonomy: https://www.ncbi.nlm.nih.gov/taxonomy
🤝 Contributing
This is a course repository. If you’re a student or instructor using these materials:
- Students: Report issues or ask questions via GitHub Issues
- Instructors: Feel free to adapt materials for your courses
- Improvements: Pull requests welcome for corrections or enhancements
📧 Contact
Course Instructor: Jojy John
GitHub: jojyjohn28
Website: jojyjohn28.github.io
For course-related questions, open an issue in this repository.
📝 Citation
If you use these materials in your course, please cite:
Jojy John. (2026). Metagenome Analysis Graduate Course Materials.
GitHub repository: https://github.com/jojyjohn28/metagenome-analysis-class
Part of the 10-Day Metagenome Analysis Series.
Related work:
- 10-Day Series: https://github.com/jojyjohn28/metagenome-analysis-series
- Blog: https://jojyjohn28.github.io/blog/
📜 License
Educational materials provided for academic use.
- Course materials: Free to use with attribution
- Toy data: Educational purposes only
- Code examples: MIT License
🎉 Acknowledgments
This course was developed as part of a graduate-level metagenomics training program.
Special Thanks:
- Students who provided feedback
- Co-instructors for collaboration
- Metagenomics community for tool development
Last Updated: February 2026
Version: 1.0 - Beginner Level
Quick Reference Card
| Task | Tool | Input | Output | Time (Toy Data) |
|---|---|---|---|---|
| QC | FastQC | FASTQ | HTML reports | 1-2 min |
| Trim | fastp | FASTQ | Clean FASTQ | 2-3 min |
| Taxonomy | Kraken2 | FASTQ | Classification | 5-10 min |
| Assembly | MEGAHIT | FASTQ | Contigs | 10-15 min |
| Binning | MetaBAT2 | Contigs | MAGs (~2) | 5-10 min |
| Quality | CheckM | MAGs | Completeness | 5-10 min |
| Classify | GTDB-Tk | MAGs | Taxonomy | 15-30 min |
Total workflow time with toy data: ~1-2 hours
🔗 Related Resources
10-Day Metagenome Analysis Series
For comprehensive production-level workflows, check out the complete series:
📖 Day 1: QC & Taxonomic Profiling
📖 Day 2: Genome Assembly
📖 Day 3: Genome Binning
📖 Day 4: Dereplication & Taxonomy
📖 Day 5: Genome Annotation
📖 Day 6: Specialized Functions
📖 Day 7: Comparative Genomics
📖 Day 8: Workflow Platforms
📖 Day 9: Visualization
📖 Day 10: Multi-Omics Integration
Repository
🔗 Course Materials GitHub
🔗 10-Day Series GitHub 🔗 10-Day Series blog
Last updated: February 2026

Happy Learning! 🧬