Day 3: Genome Binning - Recovering Individual Genomes from Metagenomes

Estimated Time: 8-12 hours

Difficulty: Intermediate

Prerequisites: Day 2 (Assembly)

📚 Table of Contents

Introduction
What is Genome Binning?
Modern MetaWRAP Setup
Binning Workflow
SemiBin2 Overview
Bin Refinement
Quality Assessment with CheckM2
MAG Classification
Best Practices
Troubleshooting

🎯 Introduction

Welcome to Day 3 of the metagenome analysis series! After assembling contigs in Day 2, we now face the challenge of separating individual genomes from the complex mixture. This process, called binning, is crucial for recovering Metagenome-Assembled Genomes (MAGs).

What You’ll Learn

By the end of this tutorial, you’ll be able to:

✅ Set up MetaWRAP with Python 3 (modern, stable approach)
✅ Run complementary binning algorithms (MetaBAT2, MaxBin2, CONCOCT)
✅ Use SemiBin2 for deep learning-based binning
✅ Refine bins to improve quality
✅ Assess MAG completeness and contamination with CheckM2
✅ Classify and annotate recovered genomes

Prerequisites

From Day 2, you need:

✅ Assembled contigs (contigs.fasta)
✅ Sorted BAM file (contigs.bam)
✅ Coverage depth file (depth.txt)

🔬 What is Genome Binning?

Genome binning groups contigs that likely originated from the same organism based on:

Key Features Used for Binning

Sequence Composition (Tetranucleotide Frequency)
- GC content patterns
- Codon usage bias
- k-mer frequencies
- Different organisms have unique “genomic signatures”
Coverage Patterns (Abundance)
- Contigs from the same organism have similar coverage
- Multi-sample binning: co-abundance across samples
- More samples = better resolution
Taxonomic Markers
- Single-copy marker genes
- Phylogenetic placement
- Reference database similarity

🛠️ Modern MetaWRAP Setup

⚠️ Important Note on Installation

DO NOT use metawrap-mg! It’s outdated and has numerous dependency conflicts. Instead, we’ll set up MetaWRAP properly with Python 3.

Why This Approach is Better

✅ Python 3 compatible - All major dependencies now support Python 3
✅ Stable - Avoid dependency hell and version conflicts
✅ Modular - Separate environments for problematic tools (Prokka, Salmon)
✅ Maintainable - Easy to update individual components
✅ Future-proof - Python 2 is officially deprecated

Step 1: Create Main MetaWRAP Environment

# Create clean Python 3 environment
conda create -n metawrap python=3.9
conda activate metawrap

# Install MetaWRAP
conda install -c conda-forge -c bioconda metawrap-mg

# Install core dependencies manually
conda install -c bioconda -c conda-forge \
    bwa \
    samtools \
    metabat2 \
    maxbin2 \
    concoct \
    checkm-genome \
    pplacer \
    blast \
    megahit \
    spades \
    quast \
    biopython \
    pandas \
    seaborn \
    matplotlib

Step 2: Create Separate Environments for Problematic Tools

Prokka Environment (for annotation):

conda create -n prokka python=3.8
conda activate prokka
conda install -c bioconda prokka

# Get Prokka path
which prokka
# Example: /home/user/miniconda3/envs/prokka/bin/prokka

Salmon Environment (for quantification):

conda create -n salmon python=3.8
conda activate salmon
conda install -c bioconda salmon

# Get Salmon path
which salmon
# Example: /home/user/miniconda3/envs/salmon/bin/salmon

Step 3: Configure MetaWRAP Paths

Edit MetaWRAP configuration to point to separate environments:

# Activate main environment
conda activate metawrap

# Find MetaWRAP installation
METAWRAP_PATH=$(which metawrap)
METAWRAP_DIR=$(dirname $(dirname $METAWRAP_PATH))

# Edit configuration (if needed)
# Most scripts will auto-detect conda environments
# If you encounter issues, manually edit scripts to use full paths:

# Example: Edit bin_refinement.sh
nano $METAWRAP_DIR/bin/metawrap-scripts/bin_refinement.sh

# Replace:
#   prokka
# With:
#   /full/path/to/prokka/bin/prokka

Step 4: Handle Python 2 Legacy Scripts

Some auxiliary scripts still use Python 2. Create a Python 2 environment:

conda create -n python2 python=2.7
conda activate python2
conda install biopython numpy

# Get Python 2 path
which python
# Example: /home/user/miniconda3/envs/python2/bin/python

Update shebang lines in legacy scripts:

# Find Python 2 scripts
grep -r "#!/usr/bin/env python$" $METAWRAP_DIR/bin/metawrap-scripts/

# Edit each script to use Python 2 explicitly
# Change:
#   #!/usr/bin/env python
# To:
#   #!/home/user/miniconda3/envs/python2/bin/python

Step 5: Install CheckM2 (Modern Alternative to CheckM)

conda activate metawrap

# Install CheckM2 (faster, more accurate than CheckM)
conda install -c bioconda checkm2

# Download database (~3.5 GB)
checkm2 database --download --path ~/checkm2_db/

Verification

# Activate main environment
conda activate metawrap

# Test installations
metawrap --help
metabat2 --help
maxbin2 -h
concoct --help
checkm2 -h

# Test separate environments
conda activate prokka
prokka --version

conda activate salmon
salmon --version

🔄 Binning Workflow

Overview of the Complete Pipeline

Clean FASTQ Files + Contigs (from Day 1 & 2)
                ↓
┌───────────────────────────────────┐
│  MetaWRAP Initial Binning         │
│  (All-in-One Command)             │
│  • Automatic read mapping         │
│  • Coverage calculation           │
│  • MetaBAT2 binning               │
│  • MaxBin2 binning                │
│  • CONCOCT binning                │
│  • CheckM quality assessment      │
└───────────────────────────────────┘
            ↓
┌───────────────────────────────────┐
│  Bin Refinement (MetaWRAP)        │
│  • Consolidate bins               │
│  • Remove contamination           │
│  • Improve completeness           │
│  • Re-validate with CheckM        │
└───────────────────────────────────┘
            ↓
┌───────────────────────────────────┐
│  Quality Assessment (CheckM2)     │
│  • Completeness > 50%             │
│  • Contamination < 10%            │
│  • Quality tiers (HQ/MQ/LQ)       │
└───────────────────────────────────┘
            ↓
┌───────────────────────────────────┐
│  Abundance Profiling (CoverM)     │
│  • Relative abundance             │
│  • Mean coverage depth            │
│  • RPKM / TPM normalization       │
│  • Multi-sample comparison        │
└───────────────────────────────────┘
            ↓
┌───────────────────────────────────┐
│  Coverage Validation (SingleM)    │
│  • MAG detection in samples       │
│  • % genome covered by reads      │
│  • Identify unbinned organisms    │
│  • Strain-level analysis          │
└───────────────────────────────────┘
            ↓
┌───────────────────────────────────┐
│  Bin Quantification (MetaWRAP)    │
│  • Track across time series       │
│  • Core vs variable microbiome    │
│  • Abundance heatmaps             │
└───────────────────────────────────┘
            ↓
High-Quality MAGs + Abundance + Coverage Data

Step 1: Initial Binning with MetaWRAP (All-in-One)

Modern Approach: MetaWRAP can run all three binners (MetaBAT2, MaxBin2, CONCOCT) in a single command!

Why use multiple binners?

Each algorithm has different strengths
Combining results improves MAG quality
Captures more of the community diversity
MetaWRAP automatically handles coverage calculation

Run MetaWRAP Binning Module

conda activate metawrap

# Run all three binners + CheckM in one command
metawrap binning \
    -o INITIAL_BINNING \
    -t 32 \
    -a contigs.fasta \
    --metabat2 \
    --maxbin2 \
    --concoct \
    clean_reads_R1.fastq \
    clean_reads_R2.fastq \
    -m 1500 \
    --run-checkm

What this does:

✅ Maps reads back to contigs automatically
✅ Calculates coverage depth
✅ Runs MetaBAT2 binning
✅ Runs MaxBin2 binning
✅ Runs CONCOCT binning
✅ Runs CheckM on all bins
✅ Generates summary statistics

Parameters explained:

-o INITIAL_BINNING: Output directory
-t 32: Number of threads
-a contigs.fasta: Assembled contigs from Day 2
--metabat2 --maxbin2 --concoct: Enable all three binners
clean_reads_R1.fastq clean_reads_R2.fastq: Clean reads from Day 1
-m 1500: Minimum bin size (1.5 Mb)
--run-checkm: Run quality assessment immediately

Expected output structure:

INITIAL_BINNING/
├── metabat2_bins/          # MetaBAT2 results
│   ├── bin.1.fa
│   ├── bin.2.fa
│   └── ...
├── maxbin2_bins/           # MaxBin2 results
│   ├── bin.001.fasta
│   ├── bin.002.fasta
│   └── ...
├── concoct_bins/           # CONCOCT results
│   ├── 0.fa
│   ├── 1.fa
│   └── ...
├── metabat2_bins.stats     # CheckM results
├── maxbin2_bins.stats
├── concoct_bins.stats
└── work_files/             # Intermediate files

Time: 4-8 hours depending on data size and number of samples (for me 32 metagenomes took many days in batch to finish the binning)

Algorithm Strengths:

Binner	Best For	Approach
MetaBAT2	High-coverage bins	Coverage + tetranucleotide frequency
MaxBin2	Low-abundance organisms	Marker genes (107 genes) + coverage
CONCOCT	Complex communities	Gaussian mixture models on coverage

Bin refinement consolidates the three binning results to produce the best possible MAGs.

conda activate metawrap

metawrap bin_refinement \
    -o BIN_REFINEMENT \
    -t 32 \
    -A INITIAL_BINNING/metabat2_bins/ \
    -B INITIAL_BINNING/maxbin2_bins/ \
    -C INITIAL_BINNING/concoct_bins/ \
    -c 50 \
    -x 10 \
    -m 1500

# Alternative: Stricter quality thresholds for publication
metawrap bin_refinement \
    -o BIN_REFINEMENT_STRICT \
    -t 32 \
    -A INITIAL_BINNING/metabat2_bins/ \
    -B INITIAL_BINNING/maxbin2_bins/ \
    -C INITIAL_BINNING/concoct_bins/ \
    -c 70 \
    -x 5 \
    -m 1500

Parameters:

-o: Output directory
-t 32: Number of threads
-A/-B/-C: Paths to the three binning results
-c 50: Minimum completeness (50%, or 70% for strict)
-x 10: Maximum contamination (10%, or 5% for strict)
-m 1500: Minimum bin size (1.5 Mb)

What refinement does:

Identifies overlapping bins across methods
Picks best contigs from each method
Removes contaminating contigs
Validates with CheckM
Produces consensus bins
Creates visualization plots

Expected output:

BIN_REFINEMENT/
├── metawrap_50_10_bins/          # Final refined bins
│   ├── bin.1.fa
│   ├── bin.2.fa
│   └── ...
├── metawrap_50_10_bins.stats     # Quality statistics
├── metawrap_50_10_bins.contigs   # Contig assignments
├── figures/
│   ├── binning_results.png       # Venn diagram
│   └── bin_refinement_stats.png  # Quality metrics
└── work_files/

Step 3: Bin Quality Assessment

# Run CheckM on refined bins
checkm lineage_wf \
    -t 16 \
    -x fa \
    BIN_REFINEMENT/metawrap_50_10_bins/ \
    checkm_output/

# Generate summary table
checkm qa \
    checkm_output/lineage.ms \
    checkm_output/ \
    -o 2 \
    -f checkm_summary.txt \
    --tab_table

Interpreting Results:

Quality Tier	Completeness	Contamination	Use Case
High-Quality (HQ)	>90%	<5%	Publications, reference genomes
Medium-Quality (MQ)	50-90%	<10%	Most analyses, gene mining
Low-Quality (LQ)	<50%	>10%	Exploratory, presence/absence

Step 4: Bin Quantification (Abundance)

Calculate the abundance of each MAG across your samples:

conda activate metawrap

# Quantify bins across samples
metawrap quant_bins \
    -b BIN_REFINEMENT/metawrap_50_10_bins \
    -o QUANT_BINS \
    -a contigs.fasta \
    clean_reads_*.fastq \
    -t 32

# For multiple samples (time series, spatial)
metawrap quant_bins \
    -b BIN_REFINEMENT/metawrap_50_10_bins \
    -o QUANT_BINS \
    -a contigs.fasta \
    sample1_R*.fastq sample2_R*.fastq sample3_R*.fastq \
    -t 32

What quantification does:

Maps reads from each sample to refined bins
Calculates coverage and abundance
Generates abundance table
Identifies dominant vs rare community members

Output:

QUANT_BINS/
├── bin_abundance_table.tab        # Abundance matrix
├── bin_abundance_heatmap.png      # Visualization
└── sample_logs/                   # Per-sample mapping logs

Use cases:

Track MAG abundance across time series
Compare communities between samples
Identify core vs variable microbiome members
Calculate relative abundance

Step 5: MAG Abundance Profiling with CoverM

CoverM provides comprehensive abundance metrics including relative abundance, mean coverage, and RPKM.

Installation

# Install CoverM
conda install -c bioconda coverm

# Verify installation
coverm --version

Calculate MAG Abundance Across Multiple Samples

# Create directory for results
mkdir -p mag_abundance

# Run CoverM genome mode
coverm genome \
    --coupled \
        sample1_R1.fastq.gz sample1_R2.fastq.gz \
        sample2_R1.fastq.gz sample2_R2.fastq.gz \
        sample3_R1.fastq.gz sample3_R2.fastq.gz \
    --genome-fasta-directory BIN_REFINEMENT/metawrap_50_10_bins \
    --genome-fasta-extension fa \
    --output-file mag_abundance/mag_abundance_table.tsv \
    --threads 32 \
    --methods relative_abundance mean trimmed_mean covered_fraction \
    --min-covered-fraction 0.1

# Alternative: If you already have BAM files
coverm genome \
    --bam-files sample1.bam sample2.bam sample3.bam \
    --genome-fasta-directory BIN_REFINEMENT/metawrap_50_10_bins \
    --genome-fasta-extension fa \
    --output-file mag_abundance/mag_abundance_table.tsv \
    --methods relative_abundance mean covered_fraction

CoverM Methods Explained

Method	Description	When to Use
relative_abundance	% of total reads mapping to MAG	Community composition
mean	Mean coverage depth	Abundance estimation
trimmed_mean	Mean after trimming outliers	Robust abundance
covered_fraction	% of genome covered by reads	MAG detection confidence
rpkm	Reads Per Kilobase per Million	Cross-sample normalization
tpm	Transcripts Per Million	RNA-seq abundance

Analyze CoverM Output

# View abundance table
column -t mag_abundance/mag_abundance_table.tsv | less -S

# Extract relative abundance only
cut -f1,2,5,8,11 mag_abundance/mag_abundance_table.tsv > mag_relative_abundance.tsv

Example output:

Genome              Sample1_RA  Sample2_RA  Sample3_RA  Sample1_Mean  Sample2_Mean
bin.1.fa            15.2        18.4        12.1        45.3          52.1
bin.2.fa            8.7         6.2         9.3         28.4          21.7
bin.3.fa            2.1         3.4         1.8         8.9           12.3

Visualize MAG Abundance

#!/usr/bin/env python3
# visualize_mag_abundance.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read CoverM output
df = pd.read_csv('mag_abundance/mag_abundance_table.tsv', sep='\t')

# Extract relative abundance columns (every 3rd column starting from 1)
ra_cols = [col for col in df.columns if 'Relative Abundance' in col]
abundance_df = df[['Genome'] + ra_cols]

# Clean column names
abundance_df.columns = ['MAG'] + [col.split()[0] for col in ra_cols]

# Set MAG as index
abundance_df = abundance_df.set_index('MAG')

# Create heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(abundance_df, annot=True, fmt='.2f', cmap='YlOrRd',
            cbar_kws={'label': 'Relative Abundance (%)'})
plt.title('MAG Relative Abundance Across Samples', fontsize=14, fontweight='bold')
plt.xlabel('Sample', fontsize=12)
plt.ylabel('MAG', fontsize=12)
plt.tight_layout()
plt.savefig('mag_abundance/abundance_heatmap.pdf', dpi=300)
print("✓ Heatmap saved: mag_abundance/abundance_heatmap.pdf")

# Create stacked bar chart
abundance_df.T.plot(kind='bar', stacked=True, figsize=(10, 6),
                     colormap='tab20')
plt.title('MAG Community Composition', fontsize=14, fontweight='bold')
plt.xlabel('Sample', fontsize=12)
plt.ylabel('Relative Abundance (%)', fontsize=12)
plt.legend(title='MAG', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig('mag_abundance/composition_barplot.pdf', dpi=300)
print("✓ Barplot saved: mag_abundance/composition_barplot.pdf")

Step 6: MAG Detection and Coverage with SingleM

SingleM analyzes what percentage of each MAG is actually represented in your samples using single-copy marker genes.

Why Use SingleM?

✅ Strain-level detection: Identify MAG presence even at low abundance
✅ Coverage estimation: % of MAG covered by sample reads
✅ Completeness check: Validate MAG recovery across samples
✅ Compare samples: Track MAG presence/absence across conditions

Installation

# Install SingleM
conda install -c bioconda singlem

# Download database (~2 GB)
singlem data

Run SingleM Analysis

# Create directory for results
mkdir -p singlem_results

# Run SingleM pipe on reads and MAGs
singlem pipe \
    --coupled \
        sample1_R1.fastq.gz sample1_R2.fastq.gz \
    --otu-table singlem_results/sample1_otu_table.tsv \
    --threads 32

# Process all samples
for sample in sample1 sample2 sample3; do
    singlem pipe \
        --coupled ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz \
        --otu-table singlem_results/${sample}_otu_table.tsv \
        --threads 32
done

# Run SingleM on MAGs to get reference profiles
singlem pipe \
    --genome-fasta-files BIN_REFINEMENT/metawrap_50_10_bins/*.fa \
    --otu-table singlem_results/mags_otu_table.tsv \
    --threads 32

Appraise MAG Coverage in Samples

# Compare sample OTUs to MAG OTUs
singlem appraise \
    --metagenome-otu-tables singlem_results/sample*_otu_table.tsv \
    --genome-otu-tables singlem_results/mags_otu_table.tsv \
    --output-binned-otu-table singlem_results/binned_otus.tsv \
    --output-unbinned-otu-table singlem_results/unbinned_otus.tsv

# Calculate MAG recovery percentage
singlem summarise \
    --input-otu-tables singlem_results/sample*_otu_table.tsv \
    --output-otu-table singlem_results/summary_otu_table.tsv

# Detailed appraisal with stats
singlem appraise \
    --metagenome-otu-tables singlem_results/sample*_otu_table.tsv \
    --genome-otu-tables singlem_results/mags_otu_table.tsv \
    --plot singlem_results/appraisal_plot.png \
    --output-found-in-metagenome singlem_results/found_mags.tsv

Interpreting SingleM Results

Binned OTUs (binned_otus.tsv):

MAG        Sample      Gene    Coverage    Marker_count
bin.1.fa   sample1     riboL2  95.2%       14/14
bin.1.fa   sample2     riboL2  78.4%       11/14
bin.2.fa   sample1     riboL2  45.8%       6/14

What this tells you:

95.2% coverage: MAG is well-represented in sample1
78.4% coverage: MAG present but lower abundance in sample2
45.8% coverage: Partial MAG detection (low abundance or incomplete)

Unbinned OTUs (unbinned_otus.tsv):

Sequences in samples that don’t match any MAG
Indicates organisms you haven’t recovered
Useful for identifying missing community members

Visualize SingleM Results

#!/usr/bin/env python3
# visualize_singlem_coverage.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read SingleM appraisal results
df = pd.read_csv('singlem_results/found_mags.tsv', sep='\t')

# Create pivot table: MAGs vs Samples
pivot = df.pivot_table(values='Coverage', index='MAG', columns='Sample', aggfunc='mean')

# Plot coverage heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(pivot, annot=True, fmt='.1f', cmap='RdYlGn',
            vmin=0, vmax=100,
            cbar_kws={'label': 'Coverage (%)'})
plt.title('MAG Coverage Across Samples (SingleM)', fontsize=14, fontweight='bold')
plt.xlabel('Sample', fontsize=12)
plt.ylabel('MAG', fontsize=12)
plt.tight_layout()
plt.savefig('singlem_results/mag_coverage_heatmap.pdf', dpi=300)
print("✓ Coverage heatmap saved")

# Calculate summary statistics
print("\n" + "="*60)
print("  MAG DETECTION SUMMARY")
print("="*60)
print(f"\nTotal MAGs analyzed: {len(pivot)}")
print(f"Total samples: {len(pivot.columns)}")
print(f"\nMean coverage: {pivot.mean().mean():.1f}%")
print(f"MAGs with >90% coverage: {(pivot > 90).sum().sum()}")
print(f"MAGs with >50% coverage: {(pivot > 50).sum().sum()}")
print(f"MAGs with <10% coverage: {(pivot < 10).sum().sum()} (likely false positives)")
print("="*60)

Combine CoverM and SingleM Results

Create a comprehensive abundance + coverage table:

#!/usr/bin/env python3
# combine_abundance_coverage.py

import pandas as pd

# Read CoverM abundance
coverm_df = pd.read_csv('mag_abundance/mag_abundance_table.tsv', sep='\t')
ra_cols = [col for col in coverm_df.columns if 'Relative Abundance' in col]
abundance = coverm_df[['Genome'] + ra_cols]

# Read SingleM coverage
singlem_df = pd.read_csv('singlem_results/found_mags.tsv', sep='\t')
coverage_pivot = singlem_df.pivot_table(values='Coverage',
                                         index='MAG',
                                         columns='Sample',
                                         aggfunc='mean')

# Merge
combined = abundance.set_index('Genome').join(coverage_pivot, rsuffix='_coverage')

# Save
combined.to_csv('mag_abundance/combined_abundance_coverage.tsv', sep='\t')

print("✓ Combined table saved: mag_abundance/combined_abundance_coverage.tsv")
print("\nExample output:")
print(combined.head())

Example combined output:

MAG        Sample1_RA  Sample1_Cov  Sample2_RA  Sample2_Cov  Interpretation
bin.1.fa   15.2%       95.2%        18.4%       92.1%        High abundance, well-covered
bin.2.fa   8.7%        78.4%        6.2%        71.3%        Medium abundance, good coverage
bin.3.fa   0.5%        15.2%        0.3%        8.7%         Low abundance, poor coverage

Decision Matrix: Abundance + Coverage

Relative Abundance	Coverage	Interpretation	Action
High (>5%)	High (>80%)	Abundant, well-recovered	✓ Use for analysis
High (>5%)	Low (<50%)	Abundant but incomplete MAG	Reassemble or refine
Low (<1%)	High (>80%)	Rare but complete MAG	✓ Use with caution
Low (<1%)	Low (<50%)	Rare and incomplete	Consider removing

Summary: Abundance Analysis Workflow

# 1. Basic quantification (MetaWRAP)
metawrap quant_bins ...

# 2. Detailed abundance (CoverM)
coverm genome --methods relative_abundance mean covered_fraction ...

# 3. Coverage validation (SingleM)
singlem pipe ...
singlem appraise ...

# 4. Visualize and combine
python visualize_mag_abundance.py
python visualize_singlem_coverage.py
python combine_abundance_coverage.py

What you learn:

📊 Relative abundance: Which MAGs dominate the community
📈 Coverage depth: How well each MAG is sampled
✅ Detection confidence: % of MAG represented in reads
🔍 Missing organisms: Unbinned OTUs indicate gaps in MAG recovery

🤖 SemiBin2 Overview

SemiBin2 is a modern, deep learning-based binner that often outperforms traditional methods.

Why Use SemiBin2?

✅ Deep learning-based - Neural networks learn genomic patterns
✅ Self-supervised - No need for labeled training data
✅ Multi-sample aware - Excellent for time series/spatial data
✅ Fast - GPU acceleration available
✅ High precision - Lower contamination rates

Quick Start

# Single-sample binning
SemiBin single_easy_bin \
    -i contigs.fasta \
    -b contigs.bam \
    -o semibin_output \
    --sequencing-type short_read

# Multi-sample binning (recommended for ≥3 samples)
SemiBin multi_easy_bin \
    -i contigs.fasta \
    -b sample1.bam sample2.bam sample3.bam \
    -o semibin_multi_output \
    --sequencing-type short_read

For Detailed Tutorial

🔗 See my complete SemiBin2 guide →

This covers:

Installation and setup
Single vs multi-sample strategies
GPU acceleration
Parameter optimization
Integration with MetaWRAP
Benchmarking against other binners

🔍 Quality Assessment with CheckM2

CheckM2 is the modern successor to CheckM - faster, more accurate, and easier to use.

Why CheckM2 > CheckM?

Feature	CheckM2	CheckM (old)
Speed	13x faster	Slow
Database	21,000 genomes	5,000 genomes
Memory	Low	High
Accuracy	Higher	Good
Installation	Easy	Complex

Installation & Setup

conda activate metawrap

# Install CheckM2
conda install -c bioconda checkm2

# Download database (~3.5 GB, one-time)
checkm2 database --download --path ~/checkm2_db/

# Verify installation
checkm2 --version

Running CheckM2

# Assess all refined bins
checkm2 predict \
    --threads 16 \
    --input BIN_REFINEMENT/metawrap_50_10_bins \
    --output-directory checkm2_output \
    -x fa

# Generate quality report
checkm2 quality_report \
    --tsv_file checkm2_output/quality_report.tsv

Interpreting Output

Key Metrics:

Completeness - % of expected marker genes found
- 90% = Excellent
- 70-90% = Good
- 50-70% = Fair
- <50% = Incomplete
Contamination - % of duplicated marker genes
- <5% = Excellent
- 5-10% = Acceptable
- 10% = Needs refinement
Strain Heterogeneity - Mixed strains indicator
- Low = Single strain
- High = Multiple strains mixed

Example output:

Name      Completeness  Contamination  Strain_heterogeneity
bin.1     95.2%        2.1%           0.0%          # HQ MAG
bin.2     87.3%        8.5%           10.2%         # MQ MAG
bin.3     45.8%        15.3%          25.6%         # LQ MAG - discard

Quality Filtering

# Keep only high/medium quality MAGs
python << 'EOF'
import pandas as pd

# Read CheckM2 output
df = pd.read_csv('checkm2_output/quality_report.tsv', sep='\t')

# Define quality thresholds
hq = df[(df['Completeness'] > 90) & (df['Contamination'] < 5)]
mq = df[(df['Completeness'] > 50) & (df['Contamination'] < 10)]

print(f"High-quality MAGs: {len(hq)}")
print(f"Medium-quality MAGs: {len(mq)}")

# Save filtered list
mq.to_csv('quality_mags.csv', index=False)
EOF

📊 Visualizing Results

Create Bin Quality Plot

#!/usr/bin/env python3
# plot_bin_quality.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read CheckM2 output
df = pd.read_csv('checkm2_output/quality_report.tsv', sep='\t')

# Create scatter plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(df['Contamination'], df['Completeness'],
                     s=100, alpha=0.6, c=df['Completeness']-df['Contamination'],
                     cmap='RdYlGn')

# Add quality thresholds
plt.axhline(y=90, color='green', linestyle='--', label='HQ threshold (90%)')
plt.axhline(y=50, color='orange', linestyle='--', label='MQ threshold (50%)')
plt.axvline(x=5, color='green', linestyle='--', label='HQ contamination (5%)')
plt.axvline(x=10, color='orange', linestyle='--', label='MQ contamination (10%)')

plt.xlabel('Contamination (%)', fontsize=12)
plt.ylabel('Completeness (%)', fontsize=12)
plt.title('MAG Quality Assessment', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Quality Score (Comp - Cont)')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('mag_quality.pdf', dpi=300)
plt.savefig('mag_quality.png', dpi=300)
print("✓ Plot saved: mag_quality.pdf/png")

Generate Summary Report

python << 'EOF'
import pandas as pd
from collections import Counter

# Read data
checkm = pd.read_csv('checkm2_output/quality_report.tsv', sep='\t')
gtdbtk = pd.read_csv('gtdbtk_output/gtdbtk.bac120.summary.tsv', sep='\t')

# Quality summary
hq = len(checkm[(checkm['Completeness']>90) & (checkm['Contamination']<5)])
mq = len(checkm[(checkm['Completeness']>50) & (checkm['Contamination']<10) &
                ~((checkm['Completeness']>90) & (checkm['Contamination']<5))])
lq = len(checkm) - hq - mq

print("="*60)
print("  MAG Recovery Summary")
print("="*60)
print(f"Total bins recovered: {len(checkm)}")
print(f"  High-quality (HQ):    {hq} ({hq/len(checkm)*100:.1f}%)")
print(f"  Medium-quality (MQ):  {mq} ({mq/len(checkm)*100:.1f}%)")
print(f"  Low-quality (LQ):     {lq} ({lq/len(checkm)*100:.1f}%)")
print()

# Taxonomy summary
phyla = [tax.split(';')[1] for tax in gtdbtk['classification']]
phyla_counts = Counter(phyla)

print("Top 5 Phyla:")
for phylum, count in phyla_counts.most_common(5):
    print(f"  {phylum}: {count} MAGs")
print()

# Size statistics
print("Genome Size Statistics:")
print(f"  Mean: {checkm['Genome_size'].mean()/1e6:.2f} Mb")
print(f"  Median: {checkm['Genome_size'].median()/1e6:.2f} Mb")
print(f"  Range: {checkm['Genome_size'].min()/1e6:.2f} - {checkm['Genome_size'].max()/1e6:.2f} Mb")
print("="*60)
EOF

💡 Best Practices

Before Binning

✅ Good assembly quality (N50 > 5kb recommended)
✅ Sufficient coverage (>10x mean coverage minimum)
✅ Multiple samples (if available) improve binning
✅ Long contigs (>2500 bp) bin better
✅ Clean data from Day 1 QC

During Binning

✅ Use multiple binners (3+ recommended)
✅ Refine bins with MetaWRAP or DAS Tool
✅ Document parameters for reproducibility
✅ Monitor resources (disk space, memory)
✅ Save intermediate files until verified

After Binning

✅ Quality check ALL bins with CheckM2
✅ Filter by quality (keep MQ+ for most analyses)
✅ Classify taxonomically with GTDB-Tk
✅ Dereplicate if many similar MAGs
✅ Backup MAGs - they’re valuable!

Quality Thresholds by Use Case

Use Case	Min Completeness	Max Contamination
Reference genome	95%	2%
Publication	90%	5%
Gene mining	70%	10%
Metabolic analysis	50%	10%
Presence/absence	50%	15%

🔧 Troubleshooting

Problem: Few or No Bins Recovered

Possible causes:

Low sequencing depth
Poor assembly quality
High community complexity
Incorrect coverage calculation

Solutions:

Check assembly N50 (should be >1kb)
Verify coverage file is correct

Adjust binning parameters:

# Lower minimum bin size
metabat2 -m 1000  # instead of 1500

# Lower minimum contig length
--minContig 1500  # instead of 2500

Try SemiBin2 (often better for complex samples)

Problem: High Contamination in Bins

Solutions:

Stricter refinement:

metawrap bin_refinement -c 70 -x 5  # More stringent

Manual curation:

# Use RefineM for interactive refinement
conda install -c bioconda refinem
refinem scaffold_stats -c 70 -x 5 bins/ bins/

Re-bin with different parameters

Common issues:

Issue 1: Prokka fails

# Verify Prokka environment
conda activate prokka
prokka --version

# Re-install if needed
conda remove prokka
conda install -c bioconda prokka

Issue 2: CheckM database missing

# Download CheckM database
checkm data setRoot ~/checkm_db/

Issue 3: Memory issues

# Reduce threads
metawrap bin_refinement -t 8  # instead of 32

# Or use checkm_lite mode (faster, less memory)
metawrap bin_refinement --quick

Problem: CheckM2 Fails

Solutions:

# Re-download database
rm -rf ~/checkm2_db/
checkm2 database --download --path ~/checkm2_db/

# Verify database integrity
checkm2 testrun

# Check available memory
free -h  # Need at least 16GB

# Reduce parallel jobs if memory limited
checkm2 predict --threads 8  # instead of 16

📈 Expected Results

Typical Recovery Rates

Sample Type	Expected HQ MAGs	Expected MQ MAGs
Soil	5-20	20-50
Human gut	10-40	30-80
Marine	5-15	15-40
Wastewater	10-30	30-70

Factors affecting recovery:

Community complexity (α-diversity)
Sequencing depth (>20M reads = better)
Assembly quality (N50 >5kb = better)
Number of samples (more = better for SemiBin2)

Benchmark Comparison

On a mock community (known composition):

Binner	Precision	Recall	F1-Score
SemiBin2	96%	94%	95%
MetaWRAP Refined	94%	91%	92.5%
MetaBAT2	90%	88%	89%
MaxBin2	88%	89%	88.5%
CONCOCT	85%	86%	85.5%

Higher is better; refinement improves both metrics

✅ Success Checklist

Before moving to Day 4:

Multiple binners run successfully (3+ recommended)
Bins refined with MetaWRAP or DAS Tool
CheckM2 quality assessment completed
At least 5-10 MQ+ MAGs recovered
Quality plots generated
MAGs organized in final directory

📚 Key Papers & Resources

Essential Reading

Binning Methods:
- Kang et al. (2019) - MetaBAT2: PeerJ
- Wu et al. (2016) - MaxBin2: Bioinformatics
- Alneberg et al. (2014) - CONCOCT: Nature Methods
- Pan et al. (2022) - SemiBin2: Nature Communications
Quality Assessment:
- Chklovski et al. (2023) - CheckM2: Nature Methods
- Parks et al. (2015) - CheckM: Genome Research
Refinement:
- Uritskiy et al. (2018) - MetaWRAP: Microbiome
- Sieber et al. (2018) - DAS Tool: Nature Microbiology

Helpful Links

➡️ What’s Next?

Congratulations! You’ve recovered individual genomes from your metagenome!

Day 4: Genome Dereplication & Taxonomic Classification (Coming Soon)

Dereplicate - Remove redundant genomes, keep best representatives
Classify - Assign accurate taxonomic names using GTDB
Visualize - Create beautiful phylogenetic trees
Curate - Select species representatives for downstream analysis

💬 Feedback

Found this helpful? Have suggestions?

Repo for today’s code and other details

🔗 See my complete repo →

Last updated: February 2025_

day2 mg

Day 3: Genome Binning - Recovering Individual Genomes from Metagenomes

📚 Table of Contents

🎯 Introduction

What You’ll Learn

Prerequisites

🔬 What is Genome Binning?

Key Features Used for Binning

🛠️ Modern MetaWRAP Setup

⚠️ Important Note on Installation

Why This Approach is Better

Step 1: Create Main MetaWRAP Environment

Step 2: Create Separate Environments for Problematic Tools

Step 3: Configure MetaWRAP Paths

Step 4: Handle Python 2 Legacy Scripts

Step 5: Install CheckM2 (Modern Alternative to CheckM)

Verification

🔄 Binning Workflow

Overview of the Complete Pipeline

Step 1: Initial Binning with MetaWRAP (All-in-One)

Run MetaWRAP Binning Module

Step 2: Bin Refinement with MetaWRAP

Step 3: Bin Quality Assessment

Step 4: Bin Quantification (Abundance)

Step 5: MAG Abundance Profiling with CoverM

Installation

Calculate MAG Abundance Across Multiple Samples

CoverM Methods Explained

Analyze CoverM Output

Visualize MAG Abundance

Step 6: MAG Detection and Coverage with SingleM

Why Use SingleM?

Installation

Run SingleM Analysis

Appraise MAG Coverage in Samples

Interpreting SingleM Results

Visualize SingleM Results

Combine CoverM and SingleM Results

Decision Matrix: Abundance + Coverage

Summary: Abundance Analysis Workflow

🤖 SemiBin2 Overview

Why Use SemiBin2?

Quick Start

For Detailed Tutorial

🔍 Quality Assessment with CheckM2

Why CheckM2 > CheckM?

Installation & Setup

Running CheckM2

Interpreting Output

Quality Filtering

📊 Visualizing Results

Create Bin Quality Plot

Generate Summary Report

💡 Best Practices

Before Binning

During Binning

After Binning

Quality Thresholds by Use Case

🔧 Troubleshooting

Problem: Few or No Bins Recovered

Problem: High Contamination in Bins

Problem: MetaWRAP Refinement Fails

Problem: CheckM2 Fails

📈 Expected Results

Typical Recovery Rates

Benchmark Comparison

✅ Success Checklist

📚 Key Papers & Resources

Essential Reading

Helpful Links

➡️ What’s Next?

💬 Feedback

Repo for today’s code and other details