Finding Fungi in Shotgun Metagenomes: A Kaiju-Based Mycobiome Workflow

🧬 𝐷𝑎𝑦 𝟪𝟨 𝑜𝑓 𝐷𝑎𝑖𝑙𝑦 𝐵𝑖𝑜𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑐𝑠 𝑓𝑟𝑜𝑚 𝐽𝑜𝑗𝑦’𝑠 𝖽𝖾𝗌𝗄

Standard metagenomic pipelines are optimized for bacteria. Kraken2, MetaPhlAn, HUMAnN — they are all built around prokaryotic reference databases, and fungi rarely appear in their output even when fungal DNA is clearly present. This is not because fungi are absent from the samples. It is because the tools are not looking for them. This post documents a workflow that does look: Kaiju with the nr_euk database, applied to five stool metagenomes from an ulcerative colitis FMT cohort. The pilot detected fungi in all five samples, found a ~55-fold bloom in one FMT recipient between baseline and week 8, and produced everything needed to scale the analysis to 285 samples on a SLURM cluster.


Why Kaiju, and why protein-level classification?

Most taxonomic classifiers work at the nucleotide level — they match k-mers from sequencing reads against a reference genome database. This works well for well-represented organisms like gut bacteria, but breaks down for fungi for two reasons. First, fungal genomes are underrepresented in standard bacterial databases. Second, fungi are evolutionarily distant from bacteria, so nucleotide-level similarity between a query read and a reference sequence is often too low to confidently classify.

Kaiju takes a different approach: it translates sequencing reads into all six amino acid reading frames and searches against a protein database. Protein sequences are more conserved than DNA sequences across evolutionary distance, which makes Kaiju substantially more sensitive for divergent eukaryotes. The nr_euk database — built from NCBI’s non-redundant protein database, filtered to fungi, protists, and other microbial eukaryotes — provides the reference.

A read that shares 40% nucleotide identity with a fungal gene might share 70% amino acid identity with the same gene’s protein product. Kaiju finds the second match. k-mer classifiers miss it.


Dataset

Study: ERP108339 — Shotgun metagenomes from a multi-donor FMT trial for active ulcerative colitis (University of New South Wales)

Pilot cohort: Five stool metagenomes

Sample Total Fungal Reads
ERR2589084 94
ERR2589087 154
ERR2589089 8,517
ERR2589095 42
ERR2589100 2,471

The range — from 42 reads to 8,517 — immediately signals that fungal load is highly variable between individuals, which is consistent with what is known about the gut mycobiome. ERR2589089 is particularly interesting: it is the week-8 sample from participant 1035, whose baseline sample (ERR2589087) had only 154 fungal reads. That is a ~55-fold increase following FMT.


Database

kaiju_db_nr_euk_2023-05-10

Three files are required:

kaiju_db_nr_euk.fmi   # Burrows-Wheeler index of protein sequences
nodes.dmp             # NCBI taxonomy tree
names.dmp             # Human-readable taxon names

Set a database directory variable at the top of every script to avoid hardcoding paths:

DBDIR=/project/bcampb7/camplab/Jojy/other/kaiju

Step 1 — Run Kaiju classification

SAMPLE=ERR2589084

kaiju \
  -t $DBDIR/nodes.dmp \
  -f $DBDIR/kaiju_db_nr_euk.fmi \
  -i ${SAMPLE}_1.fastq.gz \
  -j ${SAMPLE}_2.fastq.gz \
  -o ${SAMPLE}.out \
  -z 16 \
  -v

-z 16 sets the thread count. -v enables verbose output so you can monitor progress. The output ${SAMPLE}.out contains a line per read: classified/unclassified flag, read ID, and taxonomic ID. The IDs are NCBI integers at this stage — not human-readable yet.


Step 2 — Add taxonomic names

kaiju-addTaxonNames \
  -t $DBDIR/nodes.dmp \
  -n $DBDIR/names.dmp \
  -i ${SAMPLE}.out \
  -o ${SAMPLE}.names.out \
  -r superkingdom,phylum,class,order,family,genus,species

This converts the raw taxonomic IDs into a full lineage string. The -r argument controls which ranks are included in the output. A classified fungal read will now look like:

Eukaryota; Ascomycota; Saccharomycetes; Saccharomycetales; Pichiaceae; Pichia; Pichia kudriavzevii

This lineage string is what all downstream grep and awk steps parse.


Step 3 — Summarize phylum-level composition

kaiju2table \
  -t $DBDIR/nodes.dmp \
  -n $DBDIR/names.dmp \
  -r phylum \
  -m 0.001 \
  -o ${SAMPLE}_phylum_summary.tsv \
  ${SAMPLE}.out

The -m threshold sets the minimum percentage of reads a taxon must represent to appear in the output. Use -m 0.001, not the default -m 1.0. At -m 1.0, a phylum needs to account for at least 1% of all reads in the sample to be reported — in a stool metagenome where fungi may be 0.01% of the total reads, that threshold silently hides every fungal phylum. The permissive threshold is essential for any low-abundance group.


Step 4 — Count reads per fungal phylum

grep -c "Eukaryota; Ascomycota"    ${SAMPLE}.names.out
grep -c "Eukaryota; Mucoromycota"  ${SAMPLE}.names.out
grep -c "Eukaryota; Basidiomycota" ${SAMPLE}.names.out

Combine into a summary table:

Sample | Ascomycota | Mucoromycota | Basidiomycota | Total_Fungal

This table drives Figure 1 and Figure 2 in R downstream.


Step 5 — Extract genus-level abundance

grep "Eukaryota; Ascomycota\|Eukaryota; Basidiomycota\|Eukaryota; Mucoromycota" \
  ${SAMPLE}.names.out \
| awk -F';' '{gsub(/^ +| +$/,"",$6); if($6!="NA" && $6!="") print $6}' \
| sort | uniq -c | sort -nr

The awk command splits the lineage on semicolons, strips whitespace from field 6 (genus), and drops empty or NA entries. This produces a ranked count of genera per sample — the raw material for the genus abundance table.


Step 6 — Build a long-format table for R

grep "Eukaryota; Ascomycota\|Eukaryota; Basidiomycota\|Eukaryota; Mucoromycota" \
  ${SAMPLE}.names.out \
| awk -F';' '{gsub(/^ +| +$/,"",$6); if($6!="NA" && $6!="") print $6}' \
| sort | uniq -c | sort -nr \
| awk -v sample=$SAMPLE 'BEGIN{OFS="\t"; print "Sample","Genus","Reads"} {print sample,$2,$1}' \
> ${SAMPLE}_fungal_genera_long.tsv

Merge all samples into one file:

cat ERR*_fungal_genera_long.tsv \
| awk 'NR==1 || $1!="Sample"' \
> fungal_genera_long.tsv

The awk guard keeps only the header from the first file, preventing duplicate header rows when concatenating.


Step 7 — Visualization in R

library(tidyverse)
library(scales)
library(RColorBrewer)
library(patchwork)
library(pheatmap)

phylum    <- read_tsv("fungal_phylum_summary.tsv")
genus_long <- read_tsv("fungal_genera_long.tsv")

Figure 1: Total fungal reads per sample

p1 <- ggplot(phylum,
             aes(x = reorder(Sample, Total_Fungal), y = Total_Fungal)) +
  geom_col(fill = "#3DA5FF", width = 0.75) +
  coord_flip() +
  theme_classic(base_size = 14) +
  labs(title = "Total fungal reads detected", x = NULL, y = "Fungal reads")

ggsave("Fig1_total_fungal_reads.png", p1, width = 7, height = 4, dpi = 300)

Figure 2: Phylum composition

phylum_long <- phylum %>%
  select(Sample, Ascomycota, Mucoromycota, Basidiomycota) %>%
  pivot_longer(cols = -Sample, names_to = "Phylum", values_to = "Reads") %>%
  group_by(Sample) %>%
  mutate(Relative_Abundance = Reads / sum(Reads)) %>%
  ungroup()

p2 <- ggplot(phylum_long, aes(x = Sample, y = Relative_Abundance, fill = Phylum)) +
  geom_col(width = 0.85, color = "white") +
  scale_y_continuous(labels = percent) +
  scale_fill_brewer(palette = "Set2") +
  theme_classic(base_size = 14) +
  labs(title = "Fungal phylum composition", x = NULL, y = "Relative abundance") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("FigA_fungal_phylum_composition.png", p2, width = 8, height = 5, dpi = 300)

Figure 3: Genus composition

top_genera <- genus_long %>%
  group_by(Genus) %>%
  summarise(Total_Reads = sum(Reads), .groups = "drop") %>%
  arrange(desc(Total_Reads)) %>%
  slice_head(n = 15) %>%
  pull(Genus)

genus_plot <- genus_long %>%
  mutate(Genus = ifelse(Genus %in% top_genera, Genus, "Others")) %>%
  group_by(Sample, Genus) %>%
  summarise(Reads = sum(Reads), .groups = "drop") %>%
  group_by(Sample) %>%
  mutate(Relative_Abundance = Reads / sum(Reads)) %>%
  ungroup()

p3 <- ggplot(genus_plot, aes(x = Sample, y = Relative_Abundance, fill = Genus)) +
  geom_col(width = 0.85, color = "white") +
  scale_y_continuous(labels = percent) +
  theme_classic(base_size = 14) +
  labs(title = "Fungal genus composition", x = NULL, y = "Relative abundance") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("FigB_fungal_genus_composition.png", p3, width = 10, height = 5.5, dpi = 300)

Figure 4: Participant 1035 before and after FMT

reads_1035 <- tibble(
  Timepoint    = c("Baseline", "Week 8"),
  Fungal_Reads = c(154, 8517)
)

ggplot(reads_1035, aes(Timepoint, Fungal_Reads)) +
  geom_col(fill = "#2C7FB8") +
  geom_text(aes(label = scales::comma(Fungal_Reads)), vjust = -0.5) +
  theme_classic(base_size = 14) +
  labs(title = "Participant 1035 fungal abundance", subtitle = "Baseline vs Week 8")

154 reads at baseline. 8,517 at week 8. A ~55-fold increase following FMT. Whether this reflects donor-derived fungal engraftment, transient expansion of a resident strain, or a diet effect cannot be determined from five samples — but it is exactly the kind of signal that justifies scaling to the full 285-sample cohort.


Biological interpretation: what Kaiju can and cannot tell you

Kaiju detects DNA. DNA detection does not confirm active growth, colonization, or clinical relevance. This distinction matters especially for the environmental and dietary fungi that appear consistently in gut metagenomes:

Environmental/dietary taxa — likely transient, no colonization implied:

  • Rhizopus (common bread mold)
  • Hanseniaspora (fermentation yeast, found in fruit)
  • Gigaspora, Claroideoglomus (soil arbuscular mycorrhizal fungi)

Host-associated taxa — more likely to reflect true gut mycobiome members:

  • Candida, Pichia, Geotrichum — commensal yeasts
  • Malassezia — skin-associated but detectable in stool
  • Saccharomyces — common in fermented food, can persist in gut

This distinction should inform how you interpret differential abundance between FMT and placebo, or baseline and week 8. A bloom in Rhizopus may reflect a dietary change. A bloom in Candida is more likely to be biologically meaningful in an IBD context.


Scaling to 285 samples on SLURM

Download FASTQ files from ENA

# From ENA FTP file report
cut -f<fastq_ftp_col> filereport_read_run_PRJEB26357.tsv \
| tail -n +2 \
| tr ';' '\n' \
| sed 's#^#https://#' \
| xargs -n 1 -P 6 wget -c

Or using enaDataGet:

conda create -n ena_download -c bioconda -c conda-forge ena-browser-tools -y
conda activate ena_download

cut -f1 filereport_read_run_PRJEB26357.tsv | tail -n +2 > run_accessions.txt

while read RUN; do
  enaDataGet -f fastq $RUN
done < run_accessions.txt

Build a sample sheet

ls *_1.fastq.gz | sed 's/_1.fastq.gz//' > sample_ids.txt

awk 'BEGIN{OFS="\t"} {print $1, $1"_1.fastq.gz", $1"_2.fastq.gz"}' \
  sample_ids.txt > samples.tsv

wc -l samples.tsv  # should be 285

SLURM array job — one job per sample

A loop script running 285 samples sequentially would take days. A SLURM array runs them in parallel, one job per sample:

#!/bin/bash
#SBATCH --job-name=kaiju_array
#SBATCH --array=1-285%10
#SBATCH --cpus-per-task=16
#SBATCH --mem=180G
#SBATCH --time=24:00:00
#SBATCH --output=logs/kaiju_array_%A_%a.out
#SBATCH --error=logs/kaiju_array_%A_%a.err

module load kaiju/1.10.1

BASE=/scratch/jojyj/binod
RAWDIR=$BASE/raw_data
OUTDIR=$BASE/kaiju_all_results
DBDIR=/project/bcampb7/camplab/Jojy/other/kaiju

mkdir -p $OUTDIR logs

LINE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" $RAWDIR/samples.tsv)
SAMPLE=$(echo $LINE | awk '{print $1}')
R1=$(echo $LINE |     awk '{print $2}')
R2=$(echo $LINE |     awk '{print $3}')

echo "Processing: $SAMPLE"

kaiju \
  -t $DBDIR/nodes.dmp \
  -f $DBDIR/kaiju_db_nr_euk.fmi \
  -i $RAWDIR/$R1 \
  -j $RAWDIR/$R2 \
  -o $OUTDIR/${SAMPLE}.out \
  -z 16 -v

kaiju-addTaxonNames \
  -t $DBDIR/nodes.dmp \
  -n $DBDIR/names.dmp \
  -i $OUTDIR/${SAMPLE}.out \
  -o $OUTDIR/${SAMPLE}.names.out \
  -r superkingdom,phylum,class,order,family,genus,species

--array=1-285%10 submits all 285 tasks but runs at most 10 simultaneously — useful for being a polite cluster user when the database loading is memory-intensive.

Submit and monitor:

sbatch run_kaiju_array.sh
squeue -u jojyj
tail -f logs/kaiju_array_JOBID_1.out

Aggregating results across all 285 samples

Phylum summary table

echo -e "Sample\tAscomycota\tMucoromycota\tBasidiomycota\tTotal_Fungal" \
  > fungal_phylum_summary_all.tsv

for f in *.names.out; do
  SAMPLE=$(basename $f .names.out)
  ASC=$(grep -c "Eukaryota; Ascomycota"    $f)
  MUC=$(grep -c "Eukaryota; Mucoromycota"  $f)
  BAS=$(grep -c "Eukaryota; Basidiomycota" $f)
  TOTAL=$((ASC+MUC+BAS))
  echo -e "${SAMPLE}\t${ASC}\t${MUC}\t${BAS}\t${TOTAL}" \
    >> fungal_phylum_summary_all.tsv
done

Genus matrix (wide format for R)

echo -e "Sample\tGenus\tReads" > fungal_genera_long_all.tsv

for f in *.names.out; do
  SAMPLE=$(basename $f .names.out)
  grep "Eukaryota; Ascomycota\|Eukaryota; Basidiomycota\|Eukaryota; Mucoromycota" $f \
  | awk -F';' '{gsub(/^ +| +$/,"",$6); if($6!="NA" && $6!="") print $6}' \
  | sort | uniq -c | sort -nr \
  | awk -v sample=$SAMPLE 'BEGIN{OFS="\t"} {print sample,$2,$1}' \
  >> fungal_genera_long_all.tsv
done

Convert to wide matrix in R:

library(tidyverse)
genus_long <- read_tsv("fungal_genera_long_all.tsv")

genus_matrix <- genus_long %>%
  pivot_wider(names_from = Genus, values_from = Reads, values_fill = 0)

write_tsv(genus_matrix, "fungal_genus_matrix_all.tsv")

Diversity analysis

Once the genus matrix is merged with metadata, standard community ecology statistics apply directly:

library(vegan)

taxa_matrix      <- genus %>% select(where(is.numeric))
sample_metadata  <- genus %>% select(any_of(c("Sample","participant","group","timepoint","response")))

# Alpha diversity
sample_metadata$Observed_Genera <- specnumber(taxa_matrix)
sample_metadata$Shannon          <- diversity(taxa_matrix, index = "shannon")

# Beta diversity
bray  <- vegdist(taxa_matrix, method = "bray")
pcoa  <- cmdscale(bray, eig = TRUE, k = 2)

# PERMANOVA
adonis2(bray ~ group + timepoint + response, data = sample_metadata, permutations = 999)

Identifying fungal-rich samples for genome recovery

Most samples will have too few fungal reads for assembly. A practical threshold for exploring fungal MAG recovery is >10,000 fungal reads; a more realistic threshold for actually obtaining a binnable fungal genome is >100,000 reads. Filter early:

fungal_rich <- phylum %>%
  arrange(desc(Total_Fungal)) %>%
  filter(Total_Fungal >= 10000)

write_tsv(fungal_rich, "fungal_rich_samples.tsv")

For samples that pass the threshold, the workflow extends into eukaryotic assembly and binning:

Step Recommended tools
Fungal read extraction seqtk, BBTools
Assembly MEGAHIT, metaSPAdes
Eukaryotic contig detection EukRep, Tiara
Binning CONCOCT, MetaBAT2
Completeness assessment BUSCO, EukCC
Gene annotation funannotate, eggNOG-mapper

Proposed next steps for the full ERP108339 cohort

The pilot validates the approach. Scaling it produces the data needed to answer the biological question: does FMT alter the gut mycobiome, and does the direction of change predict treatment response in UC?

Phase 1 — Cohort-wide mycobiome profiling: 285 metagenomes → Kaiju → genus matrix → prevalence analysis → alpha/beta diversity → identify fungal-rich samples.

Phase 2 — Clinical associations: FMT vs placebo, baseline vs week 8, treatment response stratification, candidate fungal biomarker identification. Cross-validate significant associations with Kraken2/Bracken using a eukaryote-inclusive database.

Phase 3 — Genome-resolved mycobiome: Select fungal-rich samples → extract fungal reads → assemble → EukRep/Tiara for contig filtering → bin → BUSCO/EukCC quality → annotate.


Future direction: toward genome-resolved fungal analysis

This workflow is the foundation, not the destination. Read-level classification tells you what fungi are present and in what relative proportions — but it cannot tell you anything about the functional capacity of those fungi, their genetic diversity, or whether the same Candida strain persists across timepoints or is replaced by a phylogenetically distinct isolate after FMT. Answering those questions requires genomes.

The roadmap from this pilot to genome-resolved fungal analysis has three stages, each building directly on the outputs of the last:

┌─────────────────────────────┐
│ Pilot Study                 │
│ 5 samples · Kaiju · nr_euk  │
│ Proof of concept            │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Phase 1 · Mycobiome         │
│ Profiling                   │
│                             │
│ • Kaiju → 285 metagenomes   │
│ • Genus/phylum abundance    │
│ • Alpha & beta diversity    │
│ • Prevalence analysis       │
│ • Identify fungal-rich      │
│   samples                   │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Phase 2 · Validation &      │
│ Clinical Associations       │
│                             │
│ • FMT vs Placebo            │
│ • Baseline vs Week 8        │
│ • Treatment response        │
│ • Candidate biomarkers      │
│ • Kraken2/Bracken cross-    │
│   validation                │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Phase 3 · Fungal Genomics   │
│ (future work)               │
│                             │
│ • Extract fungal reads      │
│ • Assemble (MEGAHIT /       │
│   metaSPAdes)               │
│ • Eukaryotic contig ID      │
│   (EukRep, Tiara)           │
│ • MAG binning (MetaBAT2,    │
│   CONCOCT)                  │
│ • Quality: BUSCO, EukCC     │
│ • Annotation: funannotate,  │
│   eggNOG-mapper             │
└─────────────────────────────┘

Phase 3 is where this analysis becomes genuinely novel. Genome-resolved mycobiome analysis in human gut metagenomes is still a largely open field — most published studies stop at read-level classification. Recovering high-quality fungal MAGs from stool metagenomes, linking them to treatment response in IBD, and characterizing their functional gene content would represent a meaningful contribution.

Getting there requires Phase 1 and Phase 2 to work first. The fungal-rich samples that Phase 1 identifies are the input to Phase 3. The clinical associations that Phase 2 establishes determine which samples and which organisms are worth the computational investment of assembly and binning. This post covers Phase 1 in full. Phases 2 and 3 will be documented as the work develops.

This section will be updated as the genome-resolved analysis progresses.


Expected outputs

fungal_phylum_summary_all.tsv
fungal_genera_long_all.tsv
fungal_genus_matrix_all.tsv
fungal_phylum_summary_with_metadata.tsv
fungal_genus_matrix_with_metadata.tsv
fungal_alpha_diversity.tsv
fungal_genus_prevalence.tsv
fungal_rich_samples.tsv

Kaiju reference: Menzel P, Ng KL, Krogh A. “Fast and sensitive taxonomic classification for metagenomics with Kaiju.” Nature Communications 7:11257 (2016). Database: kaiju_db_nr_euk_2023-05-10. Study cohort: ERP108339 / PRJEB26357.

see_your_plot