Day 8: Workflow Wrappers & Web Platforms

Estimated Time: 4-6 hours

Difficulty: Beginner to Intermediate

Prerequisites: Understanding of Days 1-7 concepts

📚 Table of Contents

Introduction
Part I: Workflow Wrappers
Part II: Web-Based Platforms
Part III: Choosing the Right Tool
Best Practices

🎯 Introduction

You’ve learned the individual tools. Now let’s automate everything.

The Challenge

Days 1-7 taught you:

Quality control
Assembly
Binning
Annotation
Comparative analysis

But running each tool separately is:

❌ Time-consuming
❌ Error-prone
❌ Hard to reproduce
❌ Requires deep Linux knowledge

The Solution

Workflow wrappers = Pre-built pipelines that automate everything
Web platforms = Run analyses in your browser (no installation!)

🔧 Part I: Workflow Wrappers

Overview

Tool	Type	Best For
MetaWRAP	All-in-one	Complete workflows, HPC
nf-core/mag	Nextflow	Reproducible, scalable
Anvi’o	Interactive	Visualization + analysis
ATLAS	Snakemake	Large datasets, metatranscriptomics
SqueezeMeta	Integrated	Co-assembly, functional annotation

🎁 1. MetaWRAP - Complete Metagenome Pipeline

MetaWRAP wraps Days 1-7 into simple commands.

Installation

# Create conda environment
conda create -n metawrap-env python=2.7
conda activate metawrap-env

# Install MetaWRAP
conda install -c ursky metawrap-mg

# Download databases (WARNING: ~300 GB!)
metaWRAP_db_setup.sh

Complete Workflow

please use metaWrap_complete_workflow.sh from Day 8 →

MetaWRAP Modules

Available modules:

# Quality control
metawrap read_qc

# Assembly
metawrap assembly --metaspades  # or --megahit

# Binning
metawrap binning --metabat2 --maxbin2 --concoct

# Refinement
metawrap bin_refinement -c 50 -x 10

# Reassembly
metawrap reassemble_bins

# Quantification
metawrap quant_bins

# Classification
metawrap classify_bins

# Annotation
metawrap annotate_bins

# Bin visualization
metawrap blobology

Advantages

✅ All-in-one solution
✅ Integrates multiple tools
✅ Automatic bin refinement
✅ Good documentation

Disadvantages

❌ Large database requirements (~300 GB)
❌ Python 2.7 dependency
❌ Can be slow
❌ Less flexible than individual tools

🔁 2. nf-core/mag - Nextflow Pipeline

Nextflow = Modern workflow manager (reproducible, scalable)

Installation

# Install Nextflow
curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/

# Or via conda
conda install -c bioconda nextflow

Running nf-core/mag

# Basic run
nextflow run nf-core/mag \
    --input samplesheet.csv \
    --outdir results \
    -profile docker

# Full options
nextflow run nf-core/mag \
    --input samplesheet.csv \
    --outdir results \
    --skip_spades \
    --megahit_fix_cpu_1 \
    --min_contig_size 1500 \
    --min_length_unbinned_contigs 1000000 \
    --max_unbinned_contigs 100 \
    --busco_db bacteria_odb10 \
    -profile docker,test

Sample Sheet Format

sample,group,short_reads_1,short_reads_2,long_reads
sample1,group1,s1_R1.fastq.gz,s1_R2.fastq.gz,
sample2,group1,s2_R1.fastq.gz,s2_R2.fastq.gz,s2_long.fastq.gz
sample3,group2,s3_R1.fastq.gz,s3_R2.fastq.gz,

nf-core/mag Features

✅ Modern workflow engine
✅ Automatic resume (restart from failures)
✅ Multiple assemblers (MEGAHIT, SPAdes)
✅ Multiple binners (MetaBAT2, MaxBin2, CONCOCT)
✅ BUSCO quality checks
✅ GTDB-Tk classification
✅ MultiQC reports

Output Structure

results/
├── QC/
│   ├── fastp/
│   └── fastqc/
├── Assembly/
│   ├── MEGAHIT/
│   └── SPAdes/
├── Binning/
│   ├── MetaBAT2/
│   ├── MaxBin2/
│   └── bins_summary.tsv
├── GenomeBinning/
│   ├── bins/
│   └── checkm2_results/
├── Taxonomy/
│   └── GTDB-Tk/
└── MultiQC/
    └── multiqc_report.html

🖥️ 3. Anvi’o - Interactive Metagenomics

Anvi’o = Analysis + Visualization platform

Installation

conda create -y -n anvio-8 python=3.10
conda activate anvio-8
conda install -y -c conda-forge -c bioconda anvio=8
anvi-self-test

Quick Workflow

# 1. Create contigs database
anvi-gen-contigs-database -f contigs.fa -o contigs.db

# 2. Annotate
anvi-run-hmms -c contigs.db -T 8
anvi-run-ncbi-cogs -c contigs.db -T 8
anvi-run-kegg-kofams -c contigs.db -T 8

# 3. Map reads
bowtie2-build contigs.fa contigs
bowtie2 -x contigs -1 R1.fastq -2 R2.fastq -S sample.sam -p 8
samtools view -bS sample.sam | samtools sort -o sample.bam
anvi-init-bam sample.bam -o sample_sorted.bam

# 4. Profile
anvi-profile -i sample_sorted.bam -c contigs.db -o PROFILE

# 5. Binning (automatic)
anvi-cluster-contigs -p PROFILE/PROFILE.db -c contigs.db \
    -C CONCOCT --driver CONCOCT -T 8

# 6. Interactive visualization
anvi-interactive -p PROFILE/PROFILE.db -c contigs.db

Anvi’o for Pangenomics

# Create genomes storage
anvi-gen-genomes-storage -e external-genomes.txt -o GENOMES.db

# Run pangenome
anvi-pan-genome -g GENOMES.db \
    --project-name my_pangenome \
    --num-threads 8

# Visualize
anvi-display-pan -g GENOMES.db -p my_pangenome/my_pangenome-PAN.db

Anvi’o Strengths

✅ Beautiful interactive interface
✅ Real-time binning (click to bin!)
✅ Metabolic reconstruction
✅ Pangenomics integration
✅ Publication-quality figures

🐍 4. ATLAS - Scalable Metagenomics

ATLAS uses Snakemake for large-scale analyses.

Installation

conda create -n atlas -c bioconda -c conda-forge metagenome-atlas
conda activate atlas
atlas init --db-dir databases
atlas download --db-dir databases

Running ATLAS

# Initialize project
atlas init --db-dir ~/databases my_project

# Edit config.yaml and samples.tsv

# Run complete workflow
atlas run all \
    --working-dir my_project \
    --config-file my_project/config.yaml \
    --cores 32 \
    --jobs 5

# Or specific steps
atlas run qc            # Quality control
atlas run assembly      # Assembly
atlas run binning       # Binning
atlas run genomes       # Genome refinement

ATLAS Features

✅ Automatic database downloads
✅ Co-assembly support
✅ Metatranscriptomics (RNA-seq integration)
✅ DRAM annotation
✅ Easy configuration

🧩 5. SqueezeMeta - Integrated Pipeline

SqueezeMeta = Complete pipeline with co-assembly.

Installation

git clone https://github.com/jtamames/SqueezeMeta.git
cd SqueezeMeta
./install_linux.sh

Running SqueezeMeta

# Single sample
SqueezeMeta.pl \
    -m sequential \
    -p project_name \
    -s samples.txt \
    -f raw_reads \
    -t 24

# Co-assembly mode
SqueezeMeta.pl \
    -m coassembly \
    -p coassembly_project \
    -s samples.txt \
    -f raw_reads \
    -t 24 \
    --assembly megahit

SqueezeMeta Outputs

Taxonomic classification (multiple databases)
Functional annotation (KEGG, COG, Pfam)
Metabolic pathways
Interactive HTML reports
Abundance tables

🌐 Part II: Web-Based Platforms

Why Use Web Platforms?

✅ No installation (works in browser)
✅ Pre-configured (databases ready)
✅ User-friendly (GUI interface)
✅ Collaborative (share projects easily)
✅ HPC access (use their servers)

❌ Limited customization
❌ Data upload time (can be slow)
❌ Storage limits (usually)
❌ Privacy concerns (your data on their servers)

🌌 1. Galaxy - Web-Based Analysis

Galaxy = Most popular web platform for bioinformatics

Getting Started

Public servers:

Main: https://usegalaxy.org
Europe: https://usegalaxy.eu
Australia: https://usegalaxy.org.au

Create account → Free!

Galaxy Metagenomics Workflow

Step 1: Upload Data

Click “Upload Data”
Choose files or paste URLs
Select “fastqsanger.gz” format
Click “Start”

Step 2: Quality Control

Tool: FastQC

Tools → FASTQ Quality Control → FastQC
- Input: Your FASTQ files
- Execute

Tool: Trim Galore!

Tools → FASTQ Quality Control → Trim Galore!
- Single-end or Paired-end: Paired-end
- Reads 1: R1.fastq
- Reads 2: R2.fastq
- Execute

Step 3: Assembly

Tool: MEGAHIT

Tools → Assembly → MEGAHIT
- Single/Paired-end: Paired
- Forward: trimmed_R1.fastq
- Reverse: trimmed_R2.fastq
- Minimum contig length: 1500
- Execute

Step 4: Binning

Tool: MetaBAT2

Tools → Metagenomics → MetaBAT2
- Assembly: contigs.fasta
- BAM files: mapped_reads.bam
- Minimum contig length: 2500
- Execute

Step 5: Quality Check

Tool: CheckM

Tools → Metagenomics → CheckM lineage_wf
- Bins: MetaBAT2 bins
- Execute

Step 6: Annotation

Tool: Prokka

Tools → Annotation → Prokka
- For each bin
- Kingdom: Bacteria
- Execute

Galaxy Workflows

Create reusable workflows:

Go to Workflow menu
Click “Extract Workflow”
Select history
Name workflow
Save

Run on new data:

Upload new files
Go to Workflows
Select your workflow
Map inputs
Run!

Galaxy Advantages

✅ No installation
✅ 1000+ tools available
✅ Workflows can be shared
✅ Educational tutorials
✅ Free compute resources

🧬 2. KBase - Systems Biology Platform

KBase = DOE Joint Genome Institute platform

Website: https://kbase.us

Getting Started

Create account (free)
Create narrative (like Jupyter notebook)
Upload data
Add analysis apps

KBase Metagenomics Apps

Available Apps:

Trim Reads (Trimmomatic)
Assemble Reads (MEGAHIT, SPAdes)
Annotate Assembly (RAST, PROKKA)
Bin Contigs (MaxBin2, MetaBAT2)
Classify Taxonomy (GTDB-Tk)
Compare Genomes (Pangenome)
Build Metabolic Model (ModelSEED)

Example KBase Workflow

1. Import FASTQ Data
   App: "Import FASTQ File as Reads"

2. Quality Control
   App: "Assess Read Quality with FastQC"

3. Trim Reads
   App: "Trim Reads with Trimmomatic"

4. Assemble
   App: "Assemble Reads with MEGAHIT"

5. Annotate
   App: "Annotate Assembly with PROKKA"

6. Bin
   App: "Bin Contigs with MetaBAT2"

7. Check Quality
   App: "Assess Genome Quality with CheckM"

8. Classify
   App: "Classify Genomes with GTDB-Tk"

9. Build Model
   App: "Build Metabolic Model with ModelSEED"

KBase Strengths

✅ Integrated platform (data + analysis + models)
✅ Metabolic modeling (unique feature!)
✅ Reproducible narratives
✅ Collaborative
✅ DOE compute resources

KBase Limitations

❌ Steeper learning curve
❌ Fewer tools than Galaxy
❌ US-focused (slower internationally)

🗄️ 3. IMG/M - Integrated Microbial Genomes & Microbiomes

IMG/M = JGI’s comprehensive database + analysis platform

Website: https://img.jgi.doe.gov

What is IMG/M?

Not just analysis - It’s a massive database!

300,000+ bacterial genomes
40,000+ metagenomes
Pre-computed annotations
Comparative tools

Using IMG/M

Upload Your Data

Create account (free)
Submit metagenome
Wait for processing (days-weeks)
Receive annotated results

Or Browse Public Data

Search by:

Ecosystem
Taxonomy
Genes
Functions
Pathways

IMG/M Features

Automatic annotation includes:

COG/KOG categories
Pfam domains
KEGG pathways
Enzyme functions
Signal peptides
Transmembrane helices

Comparative tools:

Gene phylogenetic profiler
Abundance profiles
Function comparisons
BLAST searches

IMG/M Best For

✅ Comparing your data to public datasets
✅ Exploring gene families
✅ Metabolic pathway analysis
✅ Publication-quality annotations

❌ Not for quick turnaround
❌ Can’t customize pipeline

🦠 4. BV-BRC (formerly PATRIC) - Bacterial & Viral Database

BV-BRC = Bacterial and Viral Bioinformatics Resource Center

Website: https://www.bv-brc.org

BV-BRC Services

Genome Assembly

Services → Assembly
- Upload reads
- Choose assembler
- Submit job

Genome Annotation

Services → Annotation
- Upload genome
- Select taxonomy
- Choose annotation options
- Submit

Comprehensive Genome Analysis

Services → Comprehensive Genome Analysis
- Upload reads OR genome
- One-stop analysis:
  - Assembly
  - Annotation
  - Quality assessment
  - Specialized analyses

BV-BRC Analysis Tools

Proteome Comparison - Compare protein families
Phylogenetic Tree - Build trees from genomes
Protein Family Sorter - Find conserved proteins
Comparative Pathway Analysis - KEGG pathway comparison
Variation Analysis - SNP/indel identification
RNA-Seq Analysis - Transcriptomics
Tn-Seq Analysis - Transposon sequencing

BV-BRC Strengths

✅ Bacterial/viral focus
✅ AMR gene database (critical for pathogens)
✅ Drug target identification
✅ Extensive genome collection
✅ NIH-funded (reliable)

Best For

Clinical/pathogen studies
AMR research
Vaccine development
Outbreak analysis

🎯 Part III: Choosing the Right Tool

Decision Matrix

Your Situation	Best Choice
New to bioinformatics	Galaxy
Need complete automation	MetaWRAP or nf-core/mag
Want interactive binning	Anvi’o
Large dataset (100+ samples)	ATLAS or nf-core/mag
Need metabolic modeling	KBase
Comparing to public data	IMG/M
Clinical/pathogen focus	BV-BRC
Publication-quality analysis	Anvi’o or nf-core/mag
Teaching/learning	Galaxy
Collaborative project	KBase or Galaxy

💡 Best Practices

For Workflow Wrappers

Start with default parameters
- Don’t over-customize initially
- Understand baseline before tweaking

Use version control

git init my_project
git add config.yaml
git commit -m "Initial config"

Document your choices
- Why this assembler?
- Why these QC thresholds?
- Keep a lab notebook 6
Test on subset
- Don’t run 100 samples immediately
- Test 1-3 samples first

Monitor resource usage

htop  # Watch CPU/memory
df -h # Check disk space

For Web Platforms

Start with tutorials
- Galaxy has excellent training materials
- KBase has guided narratives
Save workflows
- Create reusable workflows
- Share with collaborators
Check data limits
- Know upload size limits
- Understand storage quotas
Export results
- Download key files locally
- Don’t rely solely on web storage
Cite properly
- Each platform has citation requirements
- Acknowledge compute resources

🎓 Learning Resources

MetaWRAP

nf-core/mag

Anvi’o

Galaxy

KBase

✅ Success Checklist

Chose appropriate workflow wrapper OR web platform
Successfully ran test dataset
Understood output structure
Validated results against manual analysis (Days 1-7)
Documented workflow/parameters
Saved/exported results
Ready to scale up to full dataset

📚 Key Takeaways

Workflow wrappers save time - Automate Days 1-7
Web platforms lower barriers - No Linux/HPC needed
Choose based on needs - No one-size-fits-all
Start simple - Use defaults before customizing
Reproducibility matters - Document everything

Repo for today’s code and other details

🔗 Day 8 →

Last updated: February 2026

day8 mg