Machine-learning MAG binning with SemiBin2 and Snakemake (soil metagenomes)

Today’s post is about one of my favorite combinations right now:
machine-learning–guided binning with SemiBin2 + a fully automated Snakemake workflow for recovering MAGs from highly complex soil metagenomes.

This is part of my broader goal of machine-learning–guided multi-omics analysis.

🧬 Why soil metagenomes are painful (and exciting)

I’m working with soil metagenomes, which are notoriously challenging:

1 g of soil can harbor millions of bacterial, archaeal, and fungal cells
Community richness and evenness are extremely high
Assemblies become highly fragmented with low N50
Short contigs make binning much harder

After exploring different strategies, I moved to a co-assembly approach:

Co-assembly with MEGAHIT of ~30 samples
Producing a single final.contigs.fa

(I will post separate details on MEGAHIT parameters.)

🧱 Classical binners: what failed and what worked

I initially tested the usual multi-binner approaches:

❌ CONCOCT

Likely failed because:

Assembly extremely large
Could not consistently segment into 10 kb fragments
Coverage patterns too noisy

❌ MaxBin2

Struggled due to:

Too many contigs
Sparse coverage across samples

⚠️ MetaBAT2 (worked partially)

Produced ~500 bins
Good, but clearly under-represented soil diversity
Bins were fragmented and incomplete

To do better, I needed something that:

Scales to large, fragmented assemblies
Leverages machine learning rather than fixed heuristics

🤖 Enter SemiBin2: machine-learning–guided binning

SemiBin2 integrates:

Sequence composition (k-mers)
Coverage profiles from multiple samples
Deep-learning-based embeddings
Density-based clustering for final bins

🔗 Repo: https://github.com/BigDataBiology/SemiBin

For complex and diverse soil data, this ML approach provides a huge improvement.

I automated everything through Snakemake so the full workflow runs with one command.

🔁 My SemiBin2 + Snakemake workflow (high-level)

Inputs

Co-assembly: final.contigs.fa
Paired-end reads: sample_R1_nodup.fq, sample_R2_nodup.fq

Workflow steps

Build Bowtie2 index
Map each sample → sorted BAM
Create bam_list.txt
SemiBin2 generate features
SemiBin2 train_self (machine-learning step)
SemiBin2 bin (produces bins)
Unzip and convert .fa.gz → .fa
MetaWRAP refinement
dRep dereplication

All managed by Snakemake.

🚀 Running the workflow

Once the Snakefile is created:

snakemake -j 32

Snakemake will:

Detect required jobs
Build the DAG
Run everything in parallel
Restart only failed steps
Ensure reproducibility

📈 Results from this workflow

⭐ SemiBin2 initial output

6,563 bins generated from the co-assembly.

Using:

MetaWRAP bin refinement
dRep dereplication
≥ 70% completeness
≤ 5% contamination

I recovered:

🔥 ~1,000 high-quality MAGs

A major improvement over the ~500 bins from MetaBAT2 alone.

These high-quality MAGs will be used for:

Taxonomy
Functional annotation
CAZyme profiling
Energy metabolism markers
MG–MTX expression integration
Functional redundancy modeling (future post)

🔚 Wrap-up

This combined SemiBin2 + Snakemake workflow gave me:

A reproducible, automated MAG recovery pipeline
Better binning in extremely complex soil metagenomes
A strong foundation for multi-omics integration
ML-powered binning without manual tuning

💾 Full workflow on GitHub

I’ll maintain and update the full Snakemake pipeline here:

👉 https://github.com/jojyjohn28/semibin2-soil-mag-workflow