Day 4 — Non-Parametric Tests for Microbiome Data: Wilcoxon, Kruskal-Wallis & Effect Sizes

*This is Day 4 of a 7-day series: Applied Statistics for Microbiome & Genomics Data.* and *🧬 Day 54 of Daily Bioinformatics from Jojy’s Desk All data and R code → github.com/jojyjohn28/microbiome_stats

Days 2 and 3 were heavy — ordination, constrained axes, permutation matrices, interaction models. Today we deliberately step back and cover something simpler but equally important: comparing groups at the level of individual MAGs.

PERMANOVA told us that Season explains 32% of community variance overall. But which specific MAGs are driving that signal? Which taxa are significantly more abundant in Summer than Spring? Which ones are enriched in Particle-attached fractions versus Free-living? Which MAGs thrive at high salinity and which disappear?

These are the questions biologists actually ask. And the answers come from non-parametric tests — the workhorses of microbiome data analysis.

🧬 The Biological Questions

Using our 61 MAGs across 27 metagenomes from Chesapeake and Delaware Bay, we want to know:

Which MAGs differ significantly between Spring and Summer? (Wilcoxon, 2 groups)
Which MAGs differ across Salinity regimes — low (<15 psu) vs high (≥15 psu)? (Wilcoxon with split continuous variable)
Do Particle-attached vs Free-living fractions harbour different MAG assemblages at the taxon level?
When testing 61 MAGs simultaneously, how do we avoid false positives? (FDR correction)
For significant MAGs — is the difference biologically meaningful? (Cliff’s delta effect size)

📊 The Statistical Concepts

Why not a t-test?

The t-test assumes your data is normally distributed. MAG abundance data is:

Zero-inflated — most MAGs are absent from most samples
Right-skewed — a few samples with very high abundance pull the mean far from the median
Compositional — values are relative, so the same MAG can appear to change because a different MAG changed

Running a t-test on these distributions will give you wrong p-values. The standard error estimate is biased, the test statistic doesn’t follow a t-distribution, and your conclusions will be unreliable.

Non-parametric tests make no distributional assumptions. They work by ranking the values rather than using the raw numbers — so outliers, skew, and zeros don’t distort the result.

Wilcoxon rank-sum test (Mann-Whitney U)

Use when comparing two groups (Season: Spring vs Summer; SF: Particle-attached vs Free-living).

What it does: Ranks all observations from both groups together (1 to n), then asks whether one group’s ranks tend to be higher than the other’s. If Group A consistently has higher ranks, its values tend to be higher — without assuming anything about the shape of the distribution.

The null hypothesis: The two groups are drawn from the same distribution (no difference in abundance).

Kruskal-Wallis test

Use when comparing three or more groups. Conceptually identical to Wilcoxon but extended to k groups — the non-parametric equivalent of one-way ANOVA.

In our dataset, all grouping variables have two levels, so we use Wilcoxon throughout. But if you had three seasons or three salinity zones, Kruskal-Wallis followed by post-hoc pairwise Wilcoxon tests is the right approach.

The multiple testing problem

When you test 61 MAGs, each at p < 0.05, you expect ~3 false positives by chance alone even if nothing is real (61 × 0.05 ≈ 3). Test across three comparisons (Season, SF, Salinity) and that’s 183 tests — expect ~9 false positives.

This is p-hacking, even unintentional. The solution is FDR correction (Benjamini-Hochberg): adjusts all p-values simultaneously so that the expected proportion of false positives among your significant results stays at 5%. The corrected values are called adjusted p-values or q-values.

Always use p.adjust(p_values, method = "BH") after running multiple tests.

Cliff’s delta — the effect size you must report

A p-value tells you whether a difference exists. It does not tell you how large that difference is.

Cliff’s delta (δ) is the non-parametric effect size for Wilcoxon tests. It measures the probability that a randomly selected value from Group A is larger than a randomly selected value from Group B, minus the reverse:

δ = P(X > Y) − P(X < Y)

δ = 0 → groups are identical
δ = +1 → every value in Group A exceeds every value in Group B
δ = −1 → every value in Group B exceeds every value in Group A

Conventional thresholds:

	δ		Interpretation
< 0.147	Negligible
0.147 – 0.330	Small
0.330 – 0.474	Medium
≥ 0.474	Large

A MAG can be statistically significant (small p) but have δ = 0.15 — a tiny, biologically irrelevant difference. Always report both. A result like “Pelagibacterales_MAG_18 was significantly more abundant in Summer (q = 0.003, Cliff’s δ = 0.72, large effect)” is scientifically meaningful. “Pelagibacterales_MAG_18 was significant (p = 0.003)” is not.

⚙️ The R Code

Step 1 — Load libraries and data

library(vegan)
library(ggplot2)
library(ggrepel)
library(effsize)    # cliff.delta()
library(dplyr)
library(tidyr)
library(patchwork)

# Load and align data
metadata <- read.csv("../day1/data/metadata.csv",
                     header = TRUE, row.names = 1, sep = ",")
mag_raw  <- read.csv("../day1/data/mag_abundance.csv",
                     header = TRUE, row.names = 1, sep = ",")

mag_t <- t(mag_raw)   # 27 × 61

# Align row order (same fix as Day 3)
metadata <- metadata[match(rownames(mag_t), rownames(metadata)), ]
stopifnot(all(rownames(mag_t) == rownames(metadata)))

# Group variables
Season   <- metadata$Season    # Spring (n=8), Summer (n=19)
Bay      <- metadata$Bay       # Chesapeake (n=11), Delaware (n=16)
Fraction <- metadata$SF        # Particle-attached (n=14), Free_living (n=13)

# Salinity split: <15 psu = low, ≥15 psu = high
# Natural estuarine threshold; our range is 0.18–30.52 psu
Salinity_group <- ifelse(metadata$Salinity < 15, "Low_salinity", "High_salinity")
table(Salinity_group)

Step 2 — Single MAG walkthrough

Before testing all 61 MAGs, understand the test on one example:

mag_name  <- "Pelagibacterales_MAG_18"
mag_abund <- mag_t[, mag_name]

spring_abund <- mag_abund[Season == "Spring"]
summer_abund <- mag_abund[Season == "Summer"]

# Always visualise before testing
boxplot(list(Spring = spring_abund, Summer = summer_abund),
        main  = paste(mag_name, "— by Season"),
        ylab  = "Abundance",
        col   = c("#22c55e", "#ef4444"),
        notch = FALSE)

# Wilcoxon rank-sum test
# exact = FALSE required when ties exist (shared zero abundances)
wt <- wilcox.test(spring_abund, summer_abund, exact = FALSE)
cat("p-value:", wt$p.value, "\n")

# Cliff's delta — direction: positive = higher in summer
cd <- cliff.delta(summer_abund, spring_abund)
cat("Cliff's delta:", round(cd$estimate, 3),
    "(", cd$magnitude, ")\n")

Step 3 — All 61 MAGs: Season comparison

mag_names <- colnames(mag_t)

results_season <- lapply(mag_names, function(mag) {

  spring_vals <- mag_t[Season == "Spring",  mag]
  summer_vals <- mag_t[Season == "Summer", mag]

  wt <- wilcox.test(spring_vals, summer_vals, exact = FALSE)
  cd <- cliff.delta(summer_vals, spring_vals)

  data.frame(
    MAG       = mag,
    p_value   = wt$p.value,
    cliffs_d  = cd$estimate,
    magnitude = as.character(cd$magnitude)
  )
})

results_season <- do.call(rbind, results_season)

# FDR correction
results_season$p_adj <- p.adjust(results_season$p_value, method = "BH")
results_season <- results_season[order(results_season$p_adj), ]

sig_season <- results_season[results_season$p_adj < 0.05, ]
cat("MAGs significantly different by Season:", nrow(sig_season), "\n")
print(sig_season)

Step 4 — Size fraction comparison

results_sf <- lapply(mag_names, function(mag) {

  pa_vals <- mag_t[Fraction == "Particle-attached", mag]
  fl_vals <- mag_t[Fraction == "Free_living",       mag]

  wt <- wilcox.test(pa_vals, fl_vals, exact = FALSE)
  cd <- cliff.delta(pa_vals, fl_vals)  # positive = higher in Particle-attached

  data.frame(
    MAG       = mag,
    p_value   = wt$p.value,
    cliffs_d  = cd$estimate,
    magnitude = as.character(cd$magnitude)
  )
})

results_sf <- do.call(rbind, results_sf)
results_sf$p_adj <- p.adjust(results_sf$p_value, method = "BH")
results_sf <- results_sf[order(results_sf$p_adj), ]

sig_sf <- results_sf[results_sf$p_adj < 0.05, ]
cat("MAGs significantly different by Size Fraction:", nrow(sig_sf), "\n")
print(sig_sf)

Step 5 — Salinity regime comparison

results_sal <- lapply(mag_names, function(mag) {

  low_vals  <- mag_t[Salinity_group == "Low_salinity",  mag]
  high_vals <- mag_t[Salinity_group == "High_salinity", mag]

  wt <- wilcox.test(low_vals, high_vals, exact = FALSE)
  cd <- cliff.delta(high_vals, low_vals)  # positive = higher at high salinity

  data.frame(
    MAG       = mag,
    p_value   = wt$p.value,
    cliffs_d  = cd$estimate,
    magnitude = as.character(cd$magnitude)
  )
})

results_sal <- do.call(rbind, results_sal)
results_sal$p_adj <- p.adjust(results_sal$p_value, method = "BH")
results_sal <- results_sal[order(results_sal$p_adj), ]

sig_sal <- results_sal[results_sal$p_adj < 0.05, ]
cat("MAGs significantly different by Salinity:", nrow(sig_sal), "\n")
print(sig_sal)

Step 6 — Kruskal-Wallis with three salinity zones

# Split salinity into three ecologically meaningful zones
Salinity_3 <- cut(metadata$Salinity,
                  breaks = c(-Inf, 5, 18, Inf),
                  labels = c("Oligohaline", "Mesohaline", "Polyhaline"))
table(Salinity_3)

results_kw <- lapply(mag_names, function(mag) {
  kw <- kruskal.test(mag_t[, mag] ~ Salinity_3)
  data.frame(MAG = mag, p_value = kw$p.value, chi_sq = kw$statistic)
})

results_kw <- do.call(rbind, results_kw)
results_kw$p_adj <- p.adjust(results_kw$p_value, method = "BH")
sig_kw <- results_kw[results_kw$p_adj < 0.05, ]

# Post-hoc pairwise Wilcoxon for significant MAGs
for (mag in sig_kw$MAG) {
  cat("\n--- Post-hoc:", mag, "---\n")
  pw <- pairwise.wilcox.test(mag_t[, mag], Salinity_3,
                              p.adjust.method = "BH",
                              exact = FALSE)
  print(pw$p.value)
}

Step 7 — Save full results table

all_results <- data.frame(
  MAG          = mag_names,
  season_padj  = results_season$p_adj[match(mag_names, results_season$MAG)],
  season_d     = results_season$cliffs_d[match(mag_names, results_season$MAG)],
  sf_padj      = results_sf$p_adj[match(mag_names, results_sf$MAG)],
  sf_d         = results_sf$cliffs_d[match(mag_names, results_sf$MAG)],
  sal_padj     = results_sal$p_adj[match(mag_names, results_sal$MAG)],
  sal_d        = results_sal$cliffs_d[match(mag_names, results_sal$MAG)]
)

all_results$any_sig <- with(all_results,
  season_padj < 0.05 | sf_padj < 0.05 | sal_padj < 0.05)

write.csv(all_results, "results/day4_wilcoxon_results.csv", row.names = FALSE)

Step 8 — Volcano plot

volcano_df <- results_season
volcano_df$neg_log10_padj <- -log10(volcano_df$p_adj + 1e-10)
volcano_df$sig   <- volcano_df$p_adj < 0.05
volcano_df$large <- abs(volcano_df$cliffs_d) >= 0.474
volcano_df$label <- ifelse(volcano_df$sig & volcano_df$large,
                            as.character(volcano_df$MAG), NA)

p_volcano <- ggplot(volcano_df,
                    aes(x = cliffs_d, y = neg_log10_padj,
                        colour = sig, size = sig)) +
  geom_point(alpha = 0.8) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey50") +
  geom_vline(xintercept = c(-0.474, 0.474), linetype = "dashed", colour = "grey50") +
  geom_text_repel(aes(label = label), size = 3,
                  max.overlaps = 15, colour = "grey20") +
  scale_colour_manual(values = c("FALSE" = "grey75", "TRUE" = "#ef4444"),
                      labels = c("Not significant", "q < 0.05")) +
  scale_size_manual(values = c("FALSE" = 1.5, "TRUE" = 2.5), guide = "none") +
  annotate("text", x = -0.9, y = 0.3, label = "Spring-enriched",
           colour = "#22c55e", fontface = "italic", size = 3.5) +
  annotate("text", x =  0.9, y = 0.3, label = "Summer-enriched",
           colour = "#ef4444", fontface = "italic", size = 3.5) +
  labs(x      = "Cliff's delta  (negative = Spring, positive = Summer)",
       y      = "-log10(adjusted p-value)",
       title  = "Differential abundance — Spring vs Summer",
       colour = NULL) +
  theme_classic(base_size = 13) +
  theme(legend.position = "top",
        plot.title = element_text(face = "bold"))

print(p_volcano)

Step 9 — Boxplots of top MAGs

top_mags <- head(sig_season$MAG, 6)

plot_df <- as.data.frame(mag_t[, top_mags])
plot_df$Season <- Season
plot_long <- pivot_longer(plot_df, -Season,
                           names_to = "MAG", values_to = "Abundance")

p_box <- ggplot(plot_long, aes(x = Season, y = Abundance, fill = Season)) +
  geom_boxplot(width = 0.5, outlier.shape = 21,
               outlier.size = 1.5, alpha = 0.85) +
  geom_jitter(width = 0.1, size = 1.5, alpha = 0.5, colour = "grey30") +
  scale_fill_manual(values = c("Spring" = "#22c55e", "Summer" = "#ef4444")) +
  facet_wrap(~ MAG, scales = "free_y", ncol = 3) +
  labs(x = NULL, y = "Abundance",
       title = "Top differentially abundant MAGs — Spring vs Summer") +
  theme_classic(base_size = 11) +
  theme(legend.position = "none",
        strip.text = element_text(face = "italic", size = 8),
        plot.title = element_text(face = "bold"))

print(p_box)

Step 10 — Save combined figure

combined <- p_volcano / p_box +
  plot_annotation(tag_levels = "A")

ggsave("figures/day4_nonparametric_figure.tiff",
       combined, width = 12, height = 10,
       dpi = 300, compression = "lzw")

⚠️ Common Mistakes

Mistake 1: Using a t-test on MAG abundance data. The most common mistake in the field. MAG abundances are zero-inflated, right-skewed, and compositional. The t-test’s normality assumption is violated. Use Wilcoxon. Every time.

Mistake 2: Reporting raw p-values across 61 MAGs. With 61 tests at α = 0.05, you expect ~3 false positives by chance. Always apply p.adjust(..., method = "BH") and report adjusted p-values. This is not optional.

Mistake 3: Reporting p-values without effect sizes. A MAG with q = 0.001 and Cliff’s δ = 0.12 (negligible effect) is statistically significant but biologically irrelevant. Always pair your p-value with Cliff’s delta and its magnitude category.

Mistake 4: Not visualising before testing. Always plot distributions before running the test. A quick boxplot reveals if the “significant” difference is driven by a single outlier, which the test statistic alone will not tell you.

Mistake 5: Treating non-significant as “no difference”. With only 8 Spring samples, power is limited. Non-significant means “insufficient evidence” — not “proven identical.” Report effect sizes even for non-significant results.

Mistake 6: Forgetting exact = FALSE with ties. Microbiome data has many zeros — ties in the rank matrix. Always set exact = FALSE in wilcox.test() to avoid errors or unreliable exact p-values.

🧭 Key Takeaways

Wilcoxon rank-sum is the correct test for two-group comparisons of MAG abundances — not the t-test, which requires normality that microbiome data never satisfies.
Kruskal-Wallis extends to three or more groups, followed by post-hoc pairwise Wilcoxon with FDR correction.
FDR correction (BH method) is mandatory when testing 61 MAGs simultaneously — without it, your results contain an unknown number of false positives.
Cliff’s delta is the effect size for Wilcoxon — always report it alongside your p-value. A result without effect size is scientifically incomplete.
Volcano plots combine effect size and significance into a single, standard visualisation for differential abundance analysis.
Non-parametric tests and PERMANOVA answer complementary questions: PERMANOVA asks “do communities differ overall?”, Wilcoxon asks “which specific MAGs differ, and by how much?”

🔗 What’s Next

Tomorrow on Day 5, we move from group comparisons to regression — asking not just “which MAGs differ between groups” but “how does a specific MAG’s abundance change continuously with Salinity, Temperature, or ChlA?” Linear models for MAG abundance, model diagnostics, and biological interpretation of regression coefficients.

All R code and data: github.com/jojyjohn28/microbiome_stats
Found this useful? Share it with someone learning microbiome statistics.

ERMANOVA analysis of MAG community composition across seasons and estuaries