Metadata & Diversity Explainer

A metadata file links each sample ID to its context — treatment, season, location, measured variables. Without it, QIIME 2 and R cannot group or compare your samples. Click any row to highlight that treatment group across all panels.

my_metadata.txt

#SampleID · Treatment · Replicate · Season · Shannon

#SampleID	Treatment	Rep	Season	Shannon

Click any row to highlight its treatment group

What each column does

#SampleID must match feature table exactly

Treatment categorical → boxplot groups

Replicate numeric or categorical

Season categorical → Kruskal-Wallis

Shannon numeric (calculated separately)

How to use in QIIME 2

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table table-no-contam.qza \
  --p-sampling-depth 50000 \
  --m-metadata-file my_metadata.txt \
  --output-dir core-metrics/

Alpha diversity measures how diverse each individual sample is. The metadata file's grouping column (Treatment) is what lets you compare distributions — without it, you just have 9 numbers with no biological meaning.

Shannon index by treatment

NC (negative control)

NPK fertiliser

nOB9 bioinoculant

Group statistics

Three metrics explained

Shannon index

richness + evenness combined

most common

Observed features

raw ASV count after rarefaction

simple count

Faith's PD

phylogenetic branch length covered

phylogenetic

R code to plot

ggplot(alpha_df,
  aes(Treatment, Shannon,
      fill=Treatment)) +
  geom_boxplot(alpha=.5) +
  geom_jitter(width=.1, size=2) +
  theme_bw()

Beta diversity asks how different communities are between samples. The PCoA plot below is coloured by the Treatment column from the metadata. Switch the colour-by variable to see how different metadata columns reveal different structure. PERMANOVA tests whether your groups are significantly different.

Colour points by:

Bray-Curtis PCoA

PERMANOVA (Treatment): R²=0.68, p=0.001 — treatment explains 68% of community variation

Three beta metrics

Bray-Curtis

relative abundance differences

most used

Unweighted UniFrac

presence/absence only, phylogenetic

presence/absence

Weighted UniFrac

abundance + phylogeny combined

phylogenetic

PERMANOVA in R

library(vegan)
dist_mat <- read.delim(
  "bray_distance/distance-matrix.tsv",
  row.names=1)
dist_obj <- as.dist(dist_mat)

adon <- adonis2(
  dist_obj ~ Treatment,
  data = meta,
  permutations = 999)

# R² = variance explained
# p < 0.05 → significant

What the PCoA shows

Clusters close togethersimilar communities

Clusters far apartdifferent communities

Mixed/overlappingtreatment not significant

A minimal lab metadata file has Treatment, Replicate, and Season. A real environmental study adds physical and chemical measurements. Each extra column unlocks a new class of analysis. Below is a real-world example from an estuarine microbiome study.

Environmental metadata example

#SampleID	Bay	Season	Fraction	Salinity	Temp °C
CP_Sp_FL_01	CP	Spring	FL	8.2	14.3
CP_Su_PA_01	CP	Summer	PA	12.7	26.8
DE_Su_FL_01	DE	Summer	FL	28.6	25.4
DE_Fa_PA_01	DE	Fall	PA	30.2	17.8

FL = Free Living · PA = Particle Attached

FL (0.2–0.8 µm)planktonic bacteria

PA (>0.8 µm)particle-attached bacteria

What each variable enables

Salinity

continuous, 0–40 PSU

Spearman ρ with Shannon

Temperature

continuous, °C

dbRDA gradient analysis

Bay

Chesapeake / Delaware

PERMANOVA factor

Season

Spring / Summer / Fall

Kruskal-Wallis

Size fraction

FL / PA

Wilcoxon test

Fraction × Season

interaction term

two-way PERMANOVA

pH, DO, depth

continuous drivers

indicator species analysis

Complexity comparison

Lab experiment3 treatments × 3 reps = 9 rows

Environmental study2 bays × 3 seasons × 2 fractions × 3 reps = 36 rows

QIIME 2 has strict requirements for metadata files. Most analysis failures trace back to a formatting mistake here. Read these once and save yourself hours of debugging.

Required format

✓
First column header must be #SampleID, sample-id, or id
✓
Tab-separated (.tsv), not comma-separated (.csv)
✓
UTF-8 encoding. Use Google Sheets or Excel "Tab Delimited Text" export
✓
Column names must be unique regardless of case
✗
No spaces in column names — use underscores: Sample_ID not Sample ID
✗
IDs cannot start with # — rows starting with # are treated as comments

Common mistakes

✗
Sample IDs differ between metadata and feature table (extra space, underscore, capital)
✗
Forgot to remove the #q2:types row before reading into R with as.numeric()
✗
Numeric treatment IDs (1, 2, 3) inferred as continuous — declare as categorical
✗
Saved as CSV instead of TSV from Excel
✗
Stale metadata rows for samples removed from the feature table

Validation tips

✓
Use Keemei in Google Sheets to validate before importing
✓
In R: setdiff(meta$SampleID, colnames(table)) should return character(0)
✓
Use qiime metadata tabulate to check column types after import
✓
Keep a backup of the original metadata before editing
✓
When in doubt, collect more metadata in the field — you cannot retroactively measure salinity

The #q2:types row

#SampleID  Treatment  pH
#q2:types  categorical  numeric
S1a        NC          6.8
S1b        NC          6.7

This row tells QIIME 2 the column type explicitly. Always filter it out before calling as.numeric() in R.