Size-Fractionated Microbiome Analysis β Day 3: Species-Level Profiling with mOTUs
π Size-Fractionated Microbiome Analysis β Day 3
In Day 1 (https://jojyjohn28.github.io/blog/size-fractionated-microbiome-analysis-day1/), we focused on metagenome preprocessing, and in Day 2 (https://jojyjohn28.github.io/blog/size-fractionated-microbiome-analysis-day2/), we explored taxonomic profiling using Kaiju.
Today, in Day 3, I introduce mOTUs v3, a marker-geneβbased approach for species-level profiling of prokaryotes directly from short reads.
This method is particularly powerful for read-based analyses, where genomes are not assembled or binned, and is equally applicable to DNA (total community) and RNA (active community) datasets.
𧬠Primary tool: mOTUs v3
mOTUs profiles microbial communities using 40 universal single-copy marker genes, providing consistent, genome-informed species resolution.
Why mOTUs?
Uses conserved marker genes instead of whole-genome alignment
β Quantifies both known taxa (linked to reference genomes) and unknown mOTUs (derived from metagenomes)
β Produces directly comparable profiles across DNA and RNA datasets
β Outputs normalized relative abundances suitable for downstream statistics and visualization
This makes mOTUs ideal for size-fractionated microbiome studies, where consistent taxonomic resolution across multiple datasets is critical.
π οΈ Installation and database setup
Install mOTUs
conda install -c bioconda motus
Verify installation:
motus --version
mOTUs ships with prebuilt marker-gene databases, so no separate database construction is required. for more details refer : https://motu-tool.org/tutorial.html
π§ͺ Running mOTUs on paired-end reads
Example command
motus profile \
-f Control_23_S13_R1.fq \
-r Control_23_S13_R2.fq \
-o Control_23_S13_motus.tsv
General usage
# Total community (DNA)
motus profile \
-f reads/dna/S1_R1.fq.gz \
-r reads/dna/S1_R2.fq.gz \
-o motus/S1_dna.motus.tsv
# Active community (RNA)
motus profile \
-f reads/rna/S1_R1.fq.gz \
-r reads/rna/S1_R2.fq.gz \
-o motus/S1_rna.motus.tsv
The output table contains consensus taxonomy and relative abundance per sample.
π Batch processing on HPC (example)
Below is the batch script used to run mOTUs across Chesapeake Bay metagenomes.
#!/bin/bash
#SBATCH --job-name=motus_mg_CP
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=1500G
#SBATCH --time=72:00:00
#SBATCH -o motus_%A_%a.out
#SBATCH -e motus_%A_%a.err
#SBATCH -A camplab
#SBATCH -p camplab
set -euo pipefail
cd /project/bcampb7/camplab/AL_JJ_oct23
input_dir="/project/bcampb7/camplab/0_Raw_Data/Metagenome_estury/CB_DNA_trimmomatic"
output_dir="/project/bcampb7/camplab/AL_JJ_oct23/motus_output/cp"
mkdir -p "${output_dir}"/{profiles,logs}
source ~/.bashrc
conda activate /home/jojyj/.conda/envs/motus_env
for R1 in "${input_dir}"/*_R1.fastq; do
sample=$(basename "${R1/_R1.fastq/}")
R2="${input_dir}/${sample}_R2.fastq"
motus profile \
-f "$R1" \
-r "$R2" \
-o "${output_dir}/profiles/${sample}_motus.txt" \
-n 32 \
2> "${output_dir}/logs/${sample}.log"
done
π§© Merging mOTUs profiles across samples
Motus do have an built-in motus merge command (for me it failed for this dataset due to a error so instead of rerunning all my metegenomes;I used a robust bash-based merging approach, which is transparent and easy to repeat.)
Step 1 β Navigate to profiles directory
cd path/to/output/folder
Step 2 β Fix headers (add sample name)
for f in *_motus.txt; do
sample=$(basename "$f" _motus.txt)
awk -v s="$sample" 'NR==3{print "#consensus_taxonomy\t"s; next} {print}' "$f" > tmp && mv tmp "$f"
done
Step 3 β Create clean tables for merging
mkdir -p merged_clean
for f in *_motus.txt; do
sample=$(basename "$f" _motus.txt)
tail -n +3 "$f" > merged_clean/"$sample".tsv
done
Step 4 β Merge all samples
cd merged_clean
first=$(ls *.tsv | sort | head -n 1)
cut -f1 "$first" > clades.txt
for f in $(ls *.tsv | sort); do
cut -f2 "$f" > "${f%.tsv}.ab"
done
paste clades.txt *.ab > merged_raw.tsv
sed '1s/#consensus_taxonomy/clade_name/' merged_raw.tsv > motus_merged_profiles.tsv
The resulting file (motus_merged_profiles.tsv) is analysis-ready.
𧬠Generating the final order-level abundance table from mOTUs profiles
mOTUs outputs taxonomic assignments as consensus lineage strings, typically spanning multiple ranks (kingdom β species). For downstream ecological visualization (e.g., stacked bar plots and heatmaps), I summarized the data at the order level, which provides a good balance between resolution and interpretability for estuarine microbiomes.
Overview of the approach
-
Start from the merged mOTUs profile table (motus_merged_profiles.tsv)
-
Parse the consensus taxonomy lineage to extract the order-level annotation
-
Collapse species- and genus-level entries into their corresponding orders
-
Sum relative abundances across all mOTUs belonging to the same order
-
Produce a sample Γ order matrix for visualization and statistical analysis
This approach ensures that:
β Taxonomy is consistent across samples
β Unknown or poorly resolved taxa are handled transparently
β The resulting table is directly compatible with ggplot2, heatmaps, and multivariate analyses
π§© Step-by-step: Order-level table construction in R
I used βRβ for this goal:
library(tidyverse) #Load required packages
motus <- read_tsv("motus_merged_profiles.tsv") #Read merged mOTUs table
#The first column (clade_name) contains the consensus taxonomy lineage, followed by one column per sample.
motus_long <- motus %>%
pivot_longer(
cols = -clade_name,
names_to = "Sample",
values_to = "Abundance"
)
#Reshape to long format
#Extract order-level taxonomy from lineage strings
#mOTUs taxonomy strings typically follow this structure:
#k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Alteromonadales|...
#We extract the o__ component and clean it for plotting.
motus_long <- motus_long %>%
mutate(
order = str_extract(clade_name, "o__[^|]+"),
order = str_remove(order, "^o__"),
order = if_else(is.na(order), "Unclassified", order)
)
#Summarize abundances at the order level
order_level <- motus_long %>%
group_by(order, Sample) %>%
summarise(
Abundance = sum(Abundance),
.groups = "drop"
)
#At this stage, each sample has a relative abundance profile summarized by order.
This table was used as the final order-level input for:
β Stacked bar plots (Day 3)
β Heatmaps and multivariate analyses (later days)
β network and co-occurance modelling in coming months
π Visualization in R
Stacked bar plots (order level, bay-wise)
library(tidyverse)
library(readxl)
df <- read_excel(
"/home/jojy-john/Jojy_Research_Sync/Fl_vs_Pa/total_and_active_community/order_level_summed.xlsx"
)
df_long <- df %>%
pivot_longer(
cols = -order,
names_to = "Sample",
values_to = "Abundance"
) %>%
mutate(
Bay = case_when(
str_detect(Sample, "^CP") ~ "Chesapeake",
str_detect(Sample, "^DE") ~ "Delaware",
TRUE ~ "Unknown"
),
order = str_trim(order),
order = ifelse(order %in% c("Other", "Others"), "Other", order)
)
top_orders <- df_long %>%
filter(order != "Other") %>%
group_by(order) %>%
summarise(total = sum(Abundance), .groups = "drop") %>%
arrange(desc(total)) %>%
slice_head(n = 15) %>%
pull(order)
df_long2 <- df_long %>%
mutate(order2 = ifelse(order %in% top_orders, order, "Other")) %>%
group_by(order2, Sample, Bay) %>%
summarise(Abundance = sum(Abundance), .groups = "drop")
ggplot(df_long2, aes(x = Sample, y = Abundance, fill = order2)) +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~Bay, scales = "free_x") +
labs(
x = "Samples",
y = "Relative abundance",
fill = "Order"
) +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_text(angle = 60, hjust = 1, face = "bold"),
strip.text = element_text(face = "bold"),
legend.title = element_text(face = "bold")
)
β Strengths of mOTUs in this study
β Species-level resolution without MAGs
β Consistent taxonomy across DNA and RNA
β Robust handling of unknown taxa
β Clean, normalized outputs for statistical analysis
π Kaiju vs mOTUs: Consistency across methods
To assess methodological consistency, I compared Kaiju (protein-level classification) and mOTUs (marker-geneβbased profiling) outputs at higher taxonomic ranks. Despite relying on fundamentally different approaches, both methods recovered highly similar dominant taxa across samples, particularly when comparisons were made at the order level.
