Day 5: Genome Annotation - Understanding Metabolic Potential

Estimated Time: 6-10 hours

Difficulty: Intermediate

Prerequisites: Day 4 (Taxonomy)

📚 Table of Contents

Introduction
Annotation Overview
Software Installation
Basic Annotation
Advanced Annotation
Metabolic Reconstruction
Comparative Analysis
Best Practices
Troubleshooting

🎯 Introduction

Welcome to Day 5! You’ve recovered high-quality MAGs and classified them taxonomically. Now the exciting question: What can these organisms do?

Today’s goals:

✅ Predict genes - Where are the genes in your genomes?
✅ Assign functions - What do these genes do?
✅ Reconstruct pathways - What metabolic capabilities exist?
✅ Compare genomes - How do species differ functionally?

Why Annotate?

Unannotated genome: Just DNA sequences
Annotated genome: Blueprint of an organism’s capabilities

Answers you’ll get:

Can it fix nitrogen?
Does it produce secondary metabolites?
What carbon sources can it use?
Does it have antibiotic resistance genes?
Can it degrade pollutants?

🔬 Annotation Overview

The Annotation Hierarchy

Level 1: Gene Prediction
├─ Prodigal    (Fast, accurate)
└─ Prokka      (Gene prediction + basic annotation)

Level 2: Functional Annotation
├─ eggNOG-mapper  (Orthology-based)
└─ Prokka         (Swiss-Prot, Pfam)

Level 3: Metabolic Annotation
├─ DRAM          (Metabolic distillation)
└─ METABOLIC     (Comprehensive pathways)

Tools Comparison

Tool	Speed	Depth	Best For
Prodigal	⚡⚡⚡	Basic	Gene prediction only
Prokka	⚡⚡	Good	Quick annotation, publication
eggNOG-mapper	⚡	Excellent	Detailed functional annotation
DRAM	⚡	Excellent	Metabolic reconstruction
METABOLIC	⚡	Comprehensive	Full metabolic pathways

📦 Software Installation

Basic Tools

# Prodigal (gene prediction)
conda create -n prodigal
conda activate prodigal
conda install -c bioconda prodigal

# Prokka (rapid annotation)
conda create -n prokka python=3.8
conda activate prokka
conda install -c bioconda prokka

# Verify installations
prodigal -v
prokka --version

Advanced Tools

# eggNOG-mapper
conda create -n eggnog python=3.9
conda activate eggnog
conda install -c bioconda eggnog-mapper

# Download eggNOG database (~45 GB)
download_eggnog_data.py -y

# DRAM
conda create -n dram python=3.10
conda activate dram
conda install -c bioconda dram

# Setup DRAM databases (~200 GB!)
DRAM-setup.py prepare_databases --output_dir ~/DRAM_data

# METABOLIC
conda create -n metabolic python=3.7
conda activate metabolic
conda install -c bioconda metabolic

# Download METABOLIC databases
perl METABOLIC-G.pl -test

🚀 Basic Annotation

Level 1: Gene Prediction with Prodigal

Prodigal is the gold standard for bacterial gene prediction.

conda activate prodigal

# Single genome
prodigal -i genome.fasta \
         -o genes.gbk \
         -a proteins.faa \
         -d genes.fna \
         -f gbk

# Batch processing
for genome in species_representatives/*.fa; do
    name=$(basename $genome .fa)
    prodigal -i $genome \
             -o prodigal_output/${name}.gbk \
             -a prodigal_output/${name}.faa \
             -d prodigal_output/${name}.fna \
             -f gbk
    echo "✓ ${name} completed"
done

Output files:

.gbk - GenBank format (annotations)
.faa - Protein sequences (amino acids)
.fna - Gene sequences (nucleotides)

Understanding output:

# Count genes
grep -c ">" prodigal_output/*.faa

# Gene statistics
python << 'EOF'
from Bio import SeqIO

for record in SeqIO.parse("proteins.faa", "fasta"):
    print(f"Gene: {record.id}, Length: {len(record.seq)} aa")
EOF

Level 2: Quick Annotation with Prokka

Prokka does gene prediction AND functional annotation in one step.

conda activate prokka

# Single genome annotation
prokka \
    --outdir prokka_output/genome1 \
    --prefix genome1 \
    --kingdom Bacteria \
    --cpus 8 \
    --force \
    genome1.fa

# With metadata for better annotation
prokka \
    --outdir prokka_output/genome1 \
    --prefix genome1 \
    --kingdom Bacteria \
    --genus Bacteroides \
    --species uniformis \
    --strain isolate123 \
    --cpus 8 \
    --force \
    genome1.fa

Prokka output files:

prokka_output/genome1/
├── genome1.faa      # Protein sequences
├── genome1.ffn      # Gene sequences
├── genome1.fna      # Genome sequence
├── genome1.gbk      # GenBank format
├── genome1.gff      # GFF3 format (for visualization)
├── genome1.tsv      # Feature table
├── genome1.txt      # Statistics
└── genome1.sqn      # Sequin format (NCBI submission)

Parse Prokka results:

# View statistics
cat prokka_output/genome1/genome1.txt

# Extract annotations
grep -v "^#" prokka_output/genome1/genome1.tsv | \
    awk -F'\t' '{print $1"\t"$2"\t"$3}' | head -20

# Count annotated vs hypothetical
total=$(grep -c "CDS" prokka_output/genome1/genome1.tsv)
hypo=$(grep -c "hypothetical protein" prokka_output/genome1/genome1.tsv)
annotated=$((total - hypo))

echo "Total CDSs: ${total}"
echo "Annotated: ${annotated} ($(echo "scale=1; ${annotated}*100/${total}" | bc)%)"
echo "Hypothetical: ${hypo} ($(echo "scale=1; ${hypo}*100/${total}" | bc)%)"

🔬 Advanced Annotation

Level 3: Functional Annotation with eggNOG-mapper

eggNOG-mapper provides deep functional annotation using orthology.

conda activate eggnog

# Annotate proteins from Prodigal
emapper.py \
    -i proteins.faa \
    --itype proteins \
    -m diamond \
    --output eggnog_output \
    --output_dir eggnog_results \
    --cpu 16 \
    --data_dir ~/eggnog_data

# Batch annotation
for faa in prodigal_output/*.faa; do
    name=$(basename $faa .faa)
    emapper.py \
        -i $faa \
        --itype proteins \
        -m diamond \
        --output ${name} \
        --output_dir eggnog_results \
        --cpu 16
    echo "✓ ${name} annotated"
done

eggNOG output columns:

Column	Description
query	Gene ID
eggNOG_OGs	Orthologous groups
max_annot_lvl	Taxonomic level
COG_category	COG functional category
Description	Gene function
Preferred_name	Gene name
GOs	Gene Ontology terms
EC	Enzyme Commission numbers
KEGG_ko	KEGG orthology
KEGG_Pathway	KEGG pathways
KEGG_Module	KEGG modules
KEGG_Reaction	KEGG reactions
CAZy	Carbohydrate-active enzymes

Parse eggNOG results:

#!/usr/bin/env python3
# parse_eggnog.py

import pandas as pd

# Read eggNOG results
df = pd.read_csv('eggnog_results/genome1.emapper.annotations',
                 sep='\t', comment='#', header=None)

# Add column names
columns = ['query', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs',
           'max_annot_lvl', 'COG_category', 'Description', 'Preferred_name',
           'GOs', 'EC', 'KEGG_ko', 'KEGG_Pathway', 'KEGG_Module',
           'KEGG_Reaction', 'KEGG_rclass', 'BRITE', 'KEGG_TC', 'CAZy',
           'BiGG_Reaction', 'PFAMs']
df.columns = columns

# Summary statistics
print("="*60)
print("  eggNOG Annotation Summary")
print("="*60)
print(f"Total genes: {len(df)}")
print(f"With function: {df['Preferred_name'].notna().sum()}")
print(f"With EC number: {df['EC'].notna().sum()}")
print(f"With KEGG pathway: {df['KEGG_Pathway'].notna().sum()}")
print(f"With CAZy: {df['CAZy'].notna().sum()}")

# COG category distribution
print("\nCOG Category Distribution:")
for cat in sorted(df['COG_category'].value_counts().head(10).items()):
    print(f"  {cat[0]}: {cat[1]}")

# Save functional summary
functional = df[df['Preferred_name'].notna()][['query', 'Preferred_name',
                                                'Description', 'KEGG_ko',
                                                'KEGG_Pathway', 'EC']]
functional.to_csv('functional_summary.csv', index=False)
print(f"\n✓ Functional summary saved: functional_summary.csv")

Level 4: Metabolic Annotation with DRAM

DRAM (Distilled and Refined Annotation of Metabolism) specializes in metabolic reconstruction.

conda activate dram

# Annotate MAGs
DRAM.py annotate \
    -i 'species_representatives/*.fa' \
    -o dram_output \
    --threads 16 \
    --min_contig_size 1000

# Distill annotations into metabolic summary
DRAM.py distill \
    -i dram_output/annotations.tsv \
    -o dram_distillate \
    --trna_path dram_output/trnas.tsv \
    --rrna_path dram_output/rrnas.tsv

DRAM outputs:

dram_output/
├── annotations.tsv           # All annotations
├── genes.faa                 # Predicted proteins
├── genes.fna                 # Predicted genes
├── genes.gff                 # GFF format
├── trnas.tsv                # tRNA predictions
└── rrnas.tsv                # rRNA predictions

dram_distillate/
├── metabolism_summary.xlsx   # Metabolic capabilities
├── product.html             # Interactive visualization
└── genome_stats.tsv         # Quality statistics

Key DRAM features:

Metabolic pathways - Carbon, nitrogen, sulfur metabolism
Transporters - What can organisms import/export
CAZymes - Carbohydrate degradation capabilities
Secondary metabolites - Antibiotic production
Stress response - Environmental adaptation

Interpret DRAM results:

#!/usr/bin/env python3
# parse_dram.py

import pandas as pd

# Read annotations
df = pd.read_csv('dram_output/annotations.tsv', sep='\t', low_memory=False)

print("="*60)
print("  DRAM Annotation Summary")
print("="*60)

# Count by category
categories = {
    'CAZymes': df['cazy_hits'].notna().sum(),
    'Peptidases': df['peptidase_family'].notna().sum(),
    'Transporters': df['transporter_classification'].notna().sum(),
    'VOG': df['vogdb_categories'].notna().sum(),
    'KEGG': df['kegg_hit'].notna().sum(),
}

for cat, count in categories.items():
    print(f"{cat}: {count}")

# Extract carbon metabolism genes
carbon_meta = df[df['module_name'].str.contains('carbon', case=False, na=False)]
print(f"\nCarbon metabolism genes: {len(carbon_meta)}")

# Extract nitrogen metabolism genes
nitrogen_meta = df[df['module_name'].str.contains('nitrogen', case=False, na=False)]
print(f"Nitrogen metabolism genes: {len(nitrogen_meta)}")

print("="*60)

Level 5: Comprehensive Annotation with METABOLIC

METABOLIC provides the most comprehensive metabolic annotation.

conda activate metabolic

# Run METABOLIC on MAGs
perl METABOLIC-G.pl \
    -in-gn species_representatives \
    -o metabolic_output \
    -t 16 \
    -m-cutoff 0.75

# Generate visualization
perl METABOLIC-G.pl \
    -in metabolic_output \
    -o metabolic_viz \
    -p viz

METABOLIC outputs:

metabolic_output/
├── METABOLIC_result.xlsx              # Main results
├── METABOLIC_result_each_spreadsheet/ # Individual tables
│   ├── HMM_hit_table.xlsx
│   ├── CAZy_hit_table.xlsx
│   ├── R_protein_hit_table.xlsx
│   └── MW-CMRT_profile.xlsx
├── intermediate_files/
└── log.txt

metabolic_viz/
├── METABOLIC_Diagram/                 # Pathway diagrams
└── METABOLIC_result_each_spreadsheet/

METABOLIC features:

✅ 40+ metabolic pathways - Carbon, nitrogen, sulfur, methane, etc.
✅ Energy metabolism - Respiration, fermentation
✅ Biogeochemical cycles - Full cycling capabilities
✅ Interactive diagrams - Heatmaps of pathway presence
✅ Module completeness - % of pathway present

Parse METABOLIC results:

#!/usr/bin/env python3
# parse_metabolic.py

import pandas as pd

# Read main result
df = pd.read_excel('metabolic_output/METABOLIC_result.xlsx',
                   sheet_name='Sheet1')

print("="*60)
print("  METABOLIC Analysis Summary")
print("="*60)

# Count genomes
print(f"Genomes analyzed: {len(df)}")

# Key pathways
pathways = [
    'Carbon fixation',
    'Nitrogen fixation',
    'Denitrification',
    'Sulfate reduction',
    'Methanogenesis',
    'Aerobic respiration'
]

print("\nPathway Presence:")
for pathway in pathways:
    cols = [c for c in df.columns if pathway.lower() in c.lower()]
    if cols:
        present = df[cols].notna().any(axis=1).sum()
        print(f"  {pathway}: {present} genomes")

print("="*60)

🔄 Complete Annotation Workflow

Recommended Pipeline

# Step 1: Gene prediction (Prodigal)
prodigal -i genome.fa -a proteins.faa -d genes.fna -f gbk -o genes.gbk

# Step 2: Functional annotation (eggNOG-mapper)
emapper.py -i proteins.faa -m diamond --output genome --cpu 16

# Step 3: Metabolic annotation (DRAM or METABOLIC)
# Option A: DRAM
DRAM.py annotate -i 'genomes/*.fa' -o dram_output --threads 16
DRAM.py distill -i dram_output/annotations.tsv -o dram_distillate

# Option B: METABOLIC
perl METABOLIC-G.pl -in-gn genomes/ -o metabolic_output -t 16

# Step 4: Quick annotation for NCBI submission (Prokka)
prokka --outdir prokka_out --prefix genome --cpus 8 genome.fa

📊 Comparative Analysis

Compare Metabolic Capabilities Across Genomes

#!/usr/bin/env python3
# compare_metabolic_capabilities.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read DRAM distillate or METABOLIC results
# Example with DRAM
metabolism = pd.read_excel('dram_distillate/metabolism_summary.xlsx')

# Create heatmap of pathway presence/absence
pathways = metabolism.iloc[:, 1:]  # Skip genome column
genome_names = metabolism.iloc[:, 0]

# Binary presence/absence
binary = (pathways > 0).astype(int)

# Create heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(binary.T,
            xticklabels=genome_names,
            yticklabels=pathways.columns,
            cmap=['white', 'darkblue'],
            cbar_kws={'label': 'Presence'})
plt.title('Metabolic Pathway Presence Across Genomes',
          fontsize=14, fontweight='bold')
plt.xlabel('Genome')
plt.ylabel('Metabolic Pathway')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('metabolic_comparison.pdf', dpi=300)
print("✓ Metabolic comparison heatmap saved")

Identify Unique Metabolic Capabilities

#!/usr/bin/env python3
# find_unique_capabilities.py

import pandas as pd

# Read annotations
dram = pd.read_csv('dram_output/annotations.tsv', sep='\t')

# Group by genome
genomes = dram.groupby('scaffold').apply(lambda x: set(x['kegg_hit'].dropna()))

# Find unique capabilities per genome
for genome, kegg_hits in genomes.items():
    unique = kegg_hits
    for other_genome, other_hits in genomes.items():
        if other_genome != genome:
            unique = unique - other_hits

    if len(unique) > 0:
        print(f"\n{genome} unique capabilities:")
        for hit in sorted(unique)[:10]:  # Top 10
            print(f"  {hit}")

💡 Best Practices

Before Annotation

✅ Use dereplicated, high-quality MAGs (Day 4)
✅ Check genome quality (>50% comp, <10% cont)
✅ Remove short contigs (<1000 bp)
✅ Ensure sufficient disk space (databases are large!)

During Annotation

✅ Start with Prodigal (fastest)
✅ Use Prokka for quick functional overview
✅ Use DRAM/METABOLIC for metabolic focus
✅ Save all intermediate files
✅ Document parameters used

After Annotation

✅ Validate results (check gene counts are reasonable)
✅ Compare to related organisms
✅ Create summary tables
✅ Visualize key pathways
✅ Back up annotation files

🔧 Troubleshooting

Prodigal Issues

Problem: No genes predicted

# Check input FASTA format
head genome.fa

# Ensure DNA sequences (not protein)
# Try closed mode for complete genomes
prodigal -i genome.fa -a proteins.faa -c

# Or meta mode for metagenomes/fragments
prodigal -i genome.fa -a proteins.faa -p meta

Prokka Issues

Problem: Prokka fails with kingdom error

# List available databases
prokka --listdb

# Force specific kingdom
prokka --kingdom Bacteria ...

# Or use --rawproduct to skip complex annotations
prokka --rawproduct ...

Problem: Low annotation rate

# Update databases
prokka --setupdb

# Use genus/species to improve annotation
prokka --genus Escherichia --species coli ...

eggNOG-mapper Issues

Problem: Database not found

# Download databases
download_eggnog_data.py -y --data_dir ~/eggnog_data

# Set data directory
export EGGNOG_DATA_DIR=~/eggnog_data
emapper.py --data_dir ~/eggnog_data ...

Problem: Out of memory

# Use diamond (less memory) instead of mmseqs
emapper.py -m diamond ...

# Reduce CPUs
emapper.py --cpu 4 ...

DRAM Issues

Problem: Database setup fails

# Check disk space (need ~200 GB)
df -h

# Download databases separately
DRAM-setup.py prepare_databases \
    --output_dir ~/DRAM_data \
    --verbose

# Set database location
export DRAM_DB_LOC=~/DRAM_data

Problem: Annotation takes too long

# Skip optional databases
DRAM.py annotate --skip_trnascan ...

# Increase min contig size
DRAM.py annotate --min_contig_size 2500 ...

📈 Expected Results

Gene Prediction

Genome Size	Expected Genes	Genes per Mb
2 Mb	1,800-2,200	~1,000
4 Mb	3,600-4,400	~1,000
6 Mb	5,400-6,600	~1,000

Typical bacteria: ~1,000 genes per Mb

Functional Annotation

Expected annotation rates:

Prodigal: 100% genes (no function)
Prokka: 60-80% with function
eggNOG-mapper: 70-90% with function
DRAM: 40-60% metabolically relevant
METABOLIC: 30-50% in pathways

Processing Time

Tool	10 Genomes	50 Genomes	100 Genomes
Prodigal	5 min	20 min	40 min
Prokka	30 min	2-3 hrs	5-6 hrs
eggNOG	1-2 hrs	6-8 hrs	12-15 hrs
DRAM	2-4 hrs	10-15 hrs	20-30 hrs
METABOLIC	1-2 hrs	5-8 hrs	10-15 hrs

✅ Success Checklist

Before moving forward:

📚 Key Papers & Resources

Essential Reading

Prodigal:
- Hyatt et al. (2010) - BMC Bioinformatics
Prokka:
- Seemann (2014) - Bioinformatics
eggNOG-mapper:
- Cantalapiedra et al. (2021) - Molecular Biology and Evolution
DRAM:
- Shaffer et al. (2020) - Nucleic Acids Research
METABOLIC:
- Zhou et al. (2022) - Nature Communications

Helpful Links

➡️ What’s Next?

Congratulations! You’ve annotated your MAGs and understand their metabolic potential!

Next Steps: Explore specialized genomic features and secondary metabolites.

🚧Topics Covered:

antiSMASH for secondary metabolite detection
CARD-RGI for antimicrobial resistance genes
dbCAN for carbohydrate-active enzymes
Prophage detection (PHASTER, VirSorter2)
CRISPR detection
Mobile genetic elements

💬 Feedback

Found this helpful? Have suggestions?

Repo for today’s code and other details

🔗 Day 5 →

Last updated: February 2026

day5 mg