Size Fractionated Microbiome Analysis β Day 2: Kaiju Classification and Extraction of Bacterial & Archaeal Reads
π Size Fractionated Microbiome Anlaysis β Day 2
Kaiju taxonomic classification, summary tables, and read extraction
In Day 1, we focused on raw metagenome preprocessing (FastQC and trimming).
Today, in Day 2, we move to taxonomic classification using Kaiju.
Kaiju is a fast and sensitive classifier that operates at the protein level, translating sequencing reads and matching them against a reference protein database. This approach improves taxonomic resolution for divergent organisms commonly found in complex estuarine metagenomes.
Beyond classification, todayβs workflow has two key goals:
-
Generate structured summary tables that will be used directly for heatmap-based visualization and cross-sample comparisons.
-
Extract and save bacterial and archaeal reads for potential reuse in assembly- and MAG-based workflows.
π¬ Why Kaiju?
β Kaiju is particularly useful for shotgun metagenomic datasets because it:
β Uses protein-level alignment (higher sensitivity than nucleotide-only methods)
β Scales well to large datasets
β Produces taxonomy-aware summary tables at multiple ranks
In this series, Kaiju is one of three tools (alongside MetaPhlAn and mOTUs) that I am benchmarking before selecting a single method for consistent application across all samples.
π οΈ Installation
Conda installation (recommended)
conda install -c bioconda kaiju
conda install -c bioconda seqtk
π¦ Step 1: Database setup
Kaiju requires a reference database and NCBI taxonomy files.
mkdir -p ~/kaiju_db
cd ~/kaiju_db
kaiju-makedb -s refseq
Kaiju do have many databse options and we can choose 1 according to the aim of our analysis!
kaiju-makedb -s refseq
This step downloads and builds:
β kaiju_db.fmi
β nodes.dmp
β names.dmp
β οΈ Database construction may take time and disk space.
ποΈ Kaiju Databases: Overview and Recommended Use
The table below summarizes the commonly used Kaiju reference databases, their characteristics, and recommended use cases. Database choice directly affects taxonomic resolution, runtime, and downstream visualization, particularly when generating heatmaps for size-fractionated microbiomes.
| Database | Build Command | Taxonomic Coverage | Sensitivity | Runtime / Size | Recommended Use |
|---|---|---|---|---|---|
| RefSeq | kaiju-makedb -s refseq | Curated Bacteria, Archaea, selected microbial Eukaryotes | High | Moderate | Default choice for ecological profiling, FL vs PA comparisons, and heatmap generation |
| NR | kaiju-makedb -s nr | Broad protein database including uncultured and environmental sequences | Very high | Very large / slow | Discovery-driven studies; detection of rare or highly divergent taxa |
| ProGenomes | kaiju-makedb -s progenomes | Genome-derived, non-redundant microbial proteins | ModerateβHigh | Smaller / fast | Comparative genomics and benchmarking against genome-resolved tools |
| Fungi | kaiju-makedb -s fungi | Fungal proteins only | High (fungi-specific) | Moderate | Targeted fungal read extraction from metagenomes |
| Custom | Custom FASTA build | User-defined | Variable | Variable | Targeted analyses (e.g., habitat-specific or clade-focused studies) |
π§ Database choice for this study
For this size-fractionated estuarine microbiome analysis, I use the RefSeq database throughout the series because it provides the best balance between:
β Taxonomic accuracy
β Computational efficiency
β Interpretability of abundance tables used for heatmaps and cross-sample comparisons
Using a consistent, curated database also ensures fair benchmarking when comparing Kaiju outputs with MetaPhlAn and mOTUs in later steps.
𧬠Step 2: Run Kaiju on paired-end reads
This command runs Kaiju on paired-end shotgun metagenomic reads, translating sequences into protein space and matching them against a reference database for taxonomic classification. The taxonomy file (nodes.dmp) and database index (kaiju_db.fmi) provide the taxonomic framework, while multi-threading (-z 8) and verbose mode (-v) ensure efficient and transparent execution.
kaiju -t nodes.dmp -f kaiju_db.fmi \
-i reads_1.fastq.gz -j reads_2.fastq.gz \
-o kaiju.out -z 8 -v
Key parameters
β -t taxonomy tree
β -f Kaiju database index
β -i / -j paired-end reads
β -z threads
β -v verbose output
π·οΈ Step 3: Add taxonomy labels
This step appends readable taxonomy labels, which enables targeted filtering.
kaiju-addTaxonNames -t nodes.dmp -n names.dmp \
-i kaiju.out \
-o kaiju_labeled.out \
-r superkingdom
π Step 4: Generate Kaiju summary tables (core output)
Kaiju summary tables form the quantitative backbone for downstream visualization, including heatmaps comparing free-living and particle-attached communities.
Superkingdom-level summary (quality control)
kaiju2table -t nodes.dmp -n names.dmp -r superkingdom \
-o kaiju_superkingdom_summary.tsv kaiju.out
This table is used to verify overall classification structure (Bacteria vs Archaea).
Phylum-level summary (used for heatmaps)
kaiju2table -t nodes.dmp -n names.dmp -r phylum -m 1.0 \
-o kaiju_phylum_summary.tsv kaiju.out
Why this step is critical
β Produces a clean, tabular abundance matrix
β Reduces noise using a minimum abundance threshold
β Directly compatible with R-based heatmap workflows
*These tables will be reused in later posts for:
β FL vs PA comparisons
β Seasonal and bay-specific patterns
β Cross-method benchmarking
π§« Step 5: Extract bacterial and archaeal read IDs
To enable downstream analyses such as genome assembly and MAG binning, we next extract reads classified as Bacteria and Archaea. Isolating these reads reduces complexity and ensures that subsequent genome-resolved workflows focus on the core prokaryotic community.
awk -F"\t" '$7 == "Bacteria" || $7 == "Archaea" { print $2 }' \
kaiju_labeled.out > bac_arch_reads.txt
The first step identifies read IDs that Kaiju classified at the superkingdom level as Bacteria or Archaea. These read identifiers are written to a text file, which will be used to subset the original FASTQ files. β Column 7 contains the superkingdom label added by kaiju-addTaxonNames
β Column 2 contains the read ID
β The output file (bac_arch_reads.txt) is a simple list of read names
Continue to extract FASTQ reads: seqtk is a lightweight and fast toolkit for manipulating FASTQ/FASTA files. Here, it is used to extract only those reads whose IDs appear in bac_arch_reads.txt, while preserving read pairing.
Install seqtk
conda install -c bioconda seqtk
Extract reads
seqtk subseq reads_1.fastq.gz bac_arch_reads.txt > filtered_reads_1.fastq
seqtk subseq reads_2.fastq.gz bac_arch_reads.txt > filtered_reads_2.fastq
β subseq extracts sequences matching a list of read IDs
β Both R1 and R2 files are filtered using the same ID list
β Paired-end structure is maintained
These filtered reads can later be used for:
β Assembly (MEGAHIT / metaSPAdes)
β Binning (MetaWRAP, SemiBin2)
β Genome-resolved ecological analysis
π Whatβs Next?
πΌοΈ Figure description
Todayβs figure represents the final, polished visualization generated in R using taxonomic abundance tables derived from Kaiju. The stacked bar plot shows relative abundances at the order level, enabling direct comparison of community composition across samples.
All data processing and visualization were performed using R (v4.x) with packages from the tidyverse ecosystem, and both the input data and complete R code are openly available in the projectβs GitHub repository for full reproducibility.
π Data and code availability: All Kaiju summary tables and R scripts used to generate this figure are available in the GitHub project repository. https://github.com/jojyjohn28/Size_Fractionated_Microbiome_Analysis
π¦ R packages used
# Core tidyverse packages
library(ggplot2) # visualization
library(readxl) # reading Excel files
library(dplyr) # data manipulation
library(tidyr) # data reshaping
π Day 3: MetaPhlAn β marker gene-based profiling We will compare MetaPhlAn-derived abundance tables against Kaiju outputs and begin systematic benchmarking across size fractions.
