From Metagenome to Viral OTUs: Identifying, Filtering, and Clustering Viral Genomes
Part 1 — Why Viromics?
🧬 𝐷𝑎𝑦 78 𝑜𝑓 𝐷𝑎𝑖𝑙𝑦 𝐵𝑖𝑜𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑐𝑠 𝑓𝑟𝑜𝑚 𝐽𝑜𝑗𝑦’𝑠 desk
The invisible majority
When microbiologists talk about a microbiome — whether from the ocean, soil, or a human gut — they usually mean the bacteria and archaea. But there is an entire parallel community that rarely makes it onto the plot: the viruses.
In any spoonful of seawater, viral particles outnumber bacterial cells by roughly ten to one. In soil and sediment, the ratio is similar. These are not passive bystanders. Viruses infect and lyse microbial cells, releasing nutrients and carbon back into the environment in a process sometimes called the viral shunt. They transfer genes between hosts through transduction, accelerating microbial evolution. And a surprising number of them carry auxiliary metabolic genes (AMGs) — functional copies of host metabolic genes that the virus expresses during infection to boost its own replication, effectively reprogramming the host cell’s biochemistry mid-infection.
In short, you cannot fully understand a microbial ecosystem without understanding its viruses.
Why viruses are hard to detect
Bacteria and archaea have a universal marker gene — the 16S rRNA gene — that makes community profiling relatively straightforward. Amplify it, sequence it, count it.
Viruses have no such universal marker. They are phenomenally diverse, evolve rapidly, and share very few genes across groups. This means the only reliable way to detect them is to look at the raw sequence data itself: assemble the metagenome into contigs, then ask which contigs look viral.
This is the core challenge of viromics, and it is what this post walks through end to end.
The six-post series at a glance
| Post | Topic |
|---|---|
| 1 (this post) | Metagenome context, viral identification (VirSorter2 + VITAP), vOTU clustering, CheckV quality |
| 2 | Taxonomic classification (geNomad + BLAST) and abundance estimation (CoverM) |
| 3 | Auxiliary metabolic genes with DRAM-v |
| 4 | Host prediction with iPHoP |
| 5 | Viral lifestyle prediction (PhaBOX2) and network analysis (vConTACT2) |
| 6 | Active viral communities from metatranscriptomes and expressed AMGs |
Part 2 — Starting Point: The Assembled Metagenome
A metagenome is the collective DNA extracted from an environmental sample and sequenced without culturing. You homogenize the sample, extract total DNA, fragment it, and sequence millions of short paired-end reads. Those reads represent every organism present: bacteria, archaea, viruses, eukaryotes, and free-floating environmental DNA.
The first computational step is assembly — reconstructing longer contiguous sequences (contigs) from overlapping short reads using tools like MEGAHIT or metaSPAdes.
This series starts from already-assembled contigs. If you still need to assemble, a standard MEGAHIT run looks like this:
# Quality trim first (always worth doing) fastp \ -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o sample_R1_trimmed.fastq.gz -O sample_R2_trimmed.fastq.gz \ --thread 16 --detect_adapter_for_pe # Assemble megahit \ -1 sample_R1_trimmed.fastq.gz \ -2 sample_R2_trimmed.fastq.gz \ -o megahit_output \ --min-contig-len 500 \ -t 16
After assembly you will have one FASTA file containing hundreds of thousands of contigs — bacterial, archaeal, viral, and eukaryotic sequences all mixed together. The goal of everything that follows is to pull out the viral fraction.
Why viruses are hard to find inside a metagenome
Viral contigs tend to be shorter than bacterial ones (most viral genomes are 5–200 kb), their genes are highly divergent from cellular genes, and in bulk DNA extractions they are often under-represented because cellular DNA swamps them numerically. Two complementary approaches are needed to find them reliably:
- Machine learning on genome features — VirSorter2 learns patterns of gene density, coding strand bias, and viral hallmark gene content that distinguish viral from cellular contigs.
- Alignment-based taxonomy — VITAP translates contigs into predicted proteins, aligns them against a viral reference database, and assigns lineage. Contigs placed in Riboviria or Duplodnaviria are flagged as viral.
Running both and combining their output is standard practice. Each tool has complementary strengths and blind spots.
Part 3 — VirSorter2: Machine Learning Viral Prediction
VirSorter2 uses a combination of viral hallmark gene HMMs and a random forest classifier to score each contig for viral likelihood. It supports multiple viral groups: dsDNA phage, ssDNA viruses, RNA viruses, NCLDVs (giant viruses), and Lavidaviridae (virophages).
Environment setup
module load anaconda3/2023.09-0
conda activate /path/to/virsorter2_env
Running VirSorter2
virsorter run \
-i assembly.fa \
-w virsorter2_output \
--db-dir /path/to/virsorter2_db \
--min-length 1500 \
--include-groups "dsDNAphage,ssDNA,RNA,NCLDV,lavidaviridae" \
-j 16 \
all
Key parameters explained:
| Parameter | What it does |
|---|---|
--min-length 1500 | ignore contigs shorter than 1500 bp — too short to reliably classify |
--include-groups | viral categories to screen for; include all unless you have a reason to exclude |
-j 16 | threads |
all | run the full pipeline including score calculation |
Key output file
virsorter2_output/final-viral-combined.fa
This FASTA contains all contigs that passed VirSorter2’s viral score threshold. This is the file you carry forward.
Note for DRAM-v users (Day 3): If you plan to run DRAM-v for AMG identification, you need to rerun VirSorter2 with
--prep-for-dramvadded to the command. This generates additional annotation files that DRAM-v requires. We cover this in the AMG post.
Part 4 — VITAP: Alignment-Based Viral Detection
VITAP predicts open reading frames on each contig and aligns the translated proteins against a curated viral reference database using DIAMOND. It assigns a taxonomic lineage to every contig, including reference genomes from the ICTV database alongside your sample contigs.
Environment setup
conda activate /path/to/VITAP_env
module load diamond/2.0.14
If VITAP has an older bundled DIAMOND version that conflicts with the module, resolve it before running:
ENV=/path/to/VITAP_env
mv "$ENV/bin/diamond" "$ENV/bin/diamond.old_0.9.19" 2>/dev/null || true
hash -r
module load diamond/2.0.14
which diamond # should point to the module version
diamond version
Running VITAP
VITAP assignment \
-i assembly.fa \
-d /path/to/VITAP_db \
-o vitap_output
Filtering viral contigs from VITAP output
VITAP assigns taxonomy to all contigs — it does not output a viral-only FASTA directly. You need to filter it yourself.
In this dataset, viral sequences belong to two realms: Riboviria (RNA viruses) and Duplodnaviria (dsDNA viruses including most phages).
Step 1 — inspect the output
head -n 5 vitap_output/best_determined_lineages.tsv
The file has four columns: Genome_ID, lineage, lineage_score, Confidence_level. IDs starting with k141_ are your assembled contigs; alphanumeric IDs like AJ428555.1 are ICTV reference genomes bundled in the database.
Step 2 — extract viral contig IDs
awk -F '\t' 'NR>1 && ($2 ~ /Riboviria|Duplodnaviria/) {print $1}' \
vitap_output/best_determined_lineages.tsv > viral_ids.txt
wc -l viral_ids.txt
Step 3 — remove reference genomes
grep "^k" viral_ids.txt > viral_ids_clean.txt
wc -l viral_ids_clean.txt
Step 4 — extract sequences
module load seqkit
seqkit grep -f viral_ids_clean.txt \
assembly.fa \
> viral_contigs_vitap.fa
grep -c "^>" viral_contigs_vitap.fa # should match wc -l viral_ids_clean.txt
Step 5 — optional length filter
For higher-confidence downstream analysis, filter to contigs ≥ 5 kb:
seqkit seq -m 5000 viral_contigs_vitap.fa > viral_contigs_vitap_min5kb.fa
Part 5 — Combining Both Tools
After running VirSorter2 and VITAP you have two viral FASTA files. Combining them captures the union of both detection strategies.
cat virsorter2_output/final-viral-combined.fa \
viral_contigs_vitap_min5kb.fa \
> viral_contigs_combined.fa
Because the two tools may have detected some of the same contigs, the combined file will contain duplicate sequences with different or identical headers. The clustering step below handles this, but if you want a clean deduplicated FASTA before clustering, the companion Python script combine_viral_contigs.py (in the repository) does this properly — it tracks which tool detected each contig, removes exact duplicates, and writes a summary table.
python scripts/combine_viral_contigs.py \
--virsorter virsorter2_output/final-viral-combined.fa \
--vitap viral_contigs_vitap_min5kb.fa \
--out viral_contigs_combined.fa \
--summary combine_summary.tsv
Part 6 — Clustering into vOTUs with CD-HIT
Individual viral contigs from different samples, or from different tools, may represent the same viral genome at slightly different coverage or with minor assembly errors. Clustering groups these into viral operational taxonomic units (vOTUs) — one representative per distinct viral genome.
The community standard for vOTU clustering is 95% nucleotide identity over 85% of the shorter contig’s length, following the Minimum Information about an Uncultivated Virus (MIUViG) guidelines.
module load cd-hit/4.8.1
cd-hit-est \
-i viral_contigs_combined.fa \
-o vOTUs.fa \
-c 0.95 \
-aS 0.85 \
-T 20 \
-M 0
Parameter reference:
| Parameter | Value | Meaning |
|---|---|---|
-c | 0.95 | 95% nucleotide identity threshold |
-aS | 0.85 | 85% alignment coverage of the shorter sequence |
-T | 20 | threads |
-M | 0 | unlimited memory usage |
The output vOTUs.fa contains one representative contig per cluster. This file is the input for all downstream steps in this series.
Part 7 — Quality Assessment with CheckV
Not all contigs in vOTUs.fa are complete viral genomes. Some are fragments, and some may carry host contamination (cellular DNA dragged along with a prophage). CheckV estimates genome completeness, detects contamination, and assigns a quality tier to each vOTU.
conda activate checkv_env
checkv end_to_end \
vOTUs.fa \
checkv_output \
-d /path/to/checkv_db \
-t 16
Key output files
| File | What it contains |
|---|---|
checkv_output/quality_summary.tsv | completeness, contamination, quality tier per vOTU |
checkv_output/viruses.fna | predicted free-living viral genomes |
checkv_output/proviruses.fna | viral regions integrated in host scaffolds |
checkv_output/contamination.tsv | detailed contamination estimates |
Quality tiers
| Tier | Completeness |
|---|---|
| Complete | circular or DTR-confirmed complete genome |
| High quality | ≥ 90% complete |
| Medium quality | 50–90% complete |
| Low quality | < 50% complete |
| Not determined | insufficient information |
For most downstream analyses — taxonomy, abundance, AMG identification — medium quality and above is a reasonable minimum. For network analysis and host prediction, high quality or complete genomes give the most reliable results.
Proviruses vs free-living viruses
CheckV distinguishes two types of viral sequences:
- Free-living viruses (
viruses.fna) — contigs that appear to be standalone viral genomes not integrated in a host - Proviruses (
proviruses.fna) — viral sequences embedded within a longer host scaffold; CheckV trims the host flanking regions
For most ecological analyses you will work with viruses.fna. Proviruses are important for understanding lysogeny (covered in Post 5) and should not be discarded.
Part 8 — What You Have at the End of This Step
After running through this pipeline you have:
-
vOTUs.fa— representative viral genomes, one per cluster, ready for taxonomy and abundance estimation -
checkv_output/quality_summary.tsv— quality metadata for every vOTU -
checkv_output/viruses.fna— free-living subset -
checkv_output/proviruses.fna— integrated prophage subset -
combine_summary.tsv— which tool detected each contig (useful for understanding detection overlap)
In the next post we assign taxonomy to these vOTUs using geNomad and BLAST, and estimate their abundance across samples using CoverM.
Companion repository: metagenome-to-viromics Toy assembly, batch scripts, the combine Python script, and installation guides for every tool used here.
