What is a Metadata File — and Why Does Your Microbiome Analysis Depend on It?
What is a metadata file — and why does your microbiome analysis depend on it?
🧬 𝐷𝑎𝑦 75 𝑜𝑓 𝐷𝑎𝑖𝑙𝑦 𝐵𝑖𝑜𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑐𝑠 𝑓𝑟𝑜𝑚 𝐽𝑜𝑗𝑦’𝑠
You have run your 16S samples through QIIME 2. You have a feature table with thousands of ASVs. You have a taxonomy file. You have alpha diversity vectors.
Now what?
You run qiime diversity core-metrics-phylogenetic and it asks for a metadata file. You open a blank spreadsheet and wonder: what exactly goes in this? Why does every tutorial spend three paragraphs warning you about it? Why does the analysis crash if the column names look slightly wrong?
This post explains what metadata actually is, why it is the most important file in your entire analysis, how to build one from scratch, and how it directly drives the diversity figures you are trying to make.
What is a metadata file?
A metadata file is simply a table that links each of your sample IDs to information about that sample — what it is, where it came from, what treatment it received, when it was collected.
Think of it this way. Your sequencing machine gives you 36 FASTQ files. Each file is just a pile of DNA reads. On its own, the machine has no idea whether sample S01 came from a nitrogen-fertilized pot or an untreated control, whether it was collected in spring or fall, or whether it came from Chesapeake Bay or Delaware Bay. That context lives entirely in the metadata file — and without it, every diversity comparison, every statistical test, and every figure you want to make is impossible.
The metadata file is not a supplementary file. It is the file that connects biology to data.
The minimum viable metadata file
Here is the simplest possible metadata file — a soil amendment experiment with three treatments (negative control, NPK fertiliser, and a Bacillus-based bioinoculant):
#SampleID Treatment Replicate
S1a NC 1
S1b NC 2
S1c NC 3
S2a NPK 1
S2b NPK 2
S2c NPK 3
S3a nOB9 1
S3b nOB9 2
S3c nOB9 3
That is it. Two columns beyond the sample IDs. This is enough to:
- Group samples by treatment in alpha diversity boxplots
- Test whether treatments differ in community composition (PERMANOVA)
- Colour a PCoA plot by treatment
- Generate faceted taxonomy barplots
The rules you must follow:
First, the first column must be the sample ID column, and its header must be one of the values QIIME 2 accepts: #SampleID, sample-id, id, or sampleid. If you use anything else, QIIME 2 will throw a confusing error about missing identifiers.
Second, every sample ID in your metadata must exactly match a sample ID in your feature table. If they differ by even a single space, underscore, or capital letter, the join will silently drop samples and your figures will be missing data with no warning.
Third, save the file as TSV (tab-separated values), not CSV. Excel exports TSV if you choose “Tab Delimited Text” as the format. Google Sheets exports TSV if you choose “Tab-separated values (.tsv)”.
Fourth, avoid the #q2:types row unless you need it. QIIME 2 uses a second header row to specify column types (categorical vs numeric). If you include it, it must be the second row exactly — and it is the row that confuses R users who forget to filter it out before calling as.numeric().
A slightly richer metadata file
In practice, you will want more columns. Here is the same experiment with a few extra variables that make the analysis more interesting:
#SampleID Treatment Replicate Season SoilpH MoisturePercent
S1a NC 1 Spring 6.8 32.1
S1b NC 2 Spring 6.7 31.4
S1c NC 3 Spring 6.9 33.0
S2a NPK 1 Spring 7.1 29.8
S2b NPK 2 Spring 7.0 30.2
S2c NPK 3 Spring 7.3 28.9
S3a nOB9 1 Spring 6.8 34.5
S3b nOB9 2 Spring 6.8 35.1
S3c nOB9 3 Spring 6.9 33.8
Now you can ask: does soil pH correlate with community diversity? Is moisture a better predictor of community structure than treatment? These are the questions that turn a simple experiment into a paper.
From metadata to diversity — how it actually works
Alpha diversity: within-sample richness
Alpha diversity asks: how diverse is each individual sample? Three common metrics:
- Shannon index — accounts for richness (how many taxa) and evenness (how equally abundant they are). A sample with 100 equally abundant taxa scores higher than a sample with 100 taxa where 95% of reads come from one organism.
- Observed features — simply counts the unique ASVs detected in each sample after rarefaction. Easy to interpret, sensitive to sequencing depth.
- Faith’s phylogenetic diversity (PD) — sums the branch lengths of the phylogenetic tree covered by the detected ASVs. A sample with many distantly related organisms scores higher than one with many closely related organisms.
The metadata file enters here in one key step: after calculating alpha diversity values per sample, you join them to the metadata to ask does diversity differ between my groups?
In QIIME 2:
qiime diversity alpha-group-significance \
--i-alpha-diversity core-metrics-results/shannon_vector.qza \
--m-metadata-file my_metadata.txt \
--o-visualization shannon_significance.qzv
In R:
alpha_df <- meta %>%
left_join(shannon, by = "SampleID")
ggplot(alpha_df, aes(x = Treatment, y = Shannon, fill = Treatment)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.1, size = 2) +
theme_bw() +
labs(title = "Alpha diversity by treatment", x = NULL, y = "Shannon index")
In our Bacillus experiment, if the nOB9 inoculant is promoting microbial diversity in the soil, you would expect the nOB9 group to show higher Shannon values than NC. The metadata file is what connects Shannon values (which are just numbers attached to sample IDs) to the treatment grouping that makes that comparison meaningful.
Beta diversity: between-sample community differences
Beta diversity asks: how different are communities between samples? The most common metrics:
- Bray-Curtis dissimilarity — compares the relative abundances of taxa between two samples. Two samples sharing all the same taxa at the same abundances have Bray-Curtis = 0. Two samples sharing no taxa have Bray-Curtis = 1.
- Unweighted UniFrac — a phylogenetic metric that asks whether the same lineages are present or absent. Does not account for abundance differences.
- Weighted UniFrac — like unweighted UniFrac but weighted by relative abundance. Accounts for both presence/absence and how dominant each lineage is.
Beta diversity is visualised with PCoA (Principal Coordinates Analysis). PCoA takes a pairwise distance matrix and projects samples into 2D or 3D space so that similar samples cluster together and dissimilar samples are spread apart.
The metadata file drives PCoA in two ways: it provides the colour coding for the plot (e.g., colour points by treatment), and it provides the grouping variable for PERMANOVA, which tests whether the between-group distances are significantly larger than within-group distances.
# PERMANOVA: does treatment predict community composition?
adon <- adonis2(dist_obj ~ Treatment, data = meta, permutations = 999)
A significant PERMANOVA (p < 0.05) means treatments have statistically different microbial communities. R² tells you how much of the community variation is explained by treatment.
Real environmental metadata: what it looks like in practice
The experiment above is clean and controlled — same location, same time, three treatments. Environmental studies are messier and richer. Here is what a metadata file looks like for an estuarine microbiome study across two bays, three seasons, and two size fractions:
#SampleID Bay Season Fraction Salinity Temperature pH Depth_m Month
CP_Sp_FL_01 Chesapeake Spring Free_Living 8.2 14.3 7.8 0.5 March
CP_Sp_PA_01 Chesapeake Spring Particle_Attached 8.2 14.3 7.8 0.5 March
CP_Su_FL_01 Chesapeake Summer Free_Living 12.7 26.8 7.6 0.5 July
CP_Su_PA_01 Chesapeake Summer Particle_Attached 12.7 26.8 7.6 0.5 July
CP_Fa_FL_01 Chesapeake Fall Free_Living 15.1 18.2 7.7 0.5 October
DE_Sp_FL_01 Delaware Spring Free_Living 24.3 13.1 7.9 0.5 March
DE_Sp_PA_01 Delaware Spring Particle_Attached 24.3 13.1 7.9 0.5 March
DE_Su_FL_01 Delaware Summer Free_Living 28.6 25.4 7.5 0.5 July
DE_Su_PA_01 Delaware Summer Particle_Attached 28.6 25.4 7.5 0.5 July
DE_Fa_FL_01 Delaware Fall Free_Living 30.2 17.8 7.8 0.5 October
Notice several things:
Categorical columns (Bay, Season, Fraction, Month) define the groupings for boxplots, PERMANOVA, and faceted barplots.
Numeric columns (Salinity, Temperature, pH, Depth_m) allow correlation analyses — you can ask “does salinity predict Shannon diversity?” or “which environmental variable best explains community turnover?”
Sample IDs encode information — CP_Sp_FL_01 tells you immediately that this is Chesapeake Bay, Spring, Free-Living, replicate 1. This is good practice: meaningful IDs make it easier to spot errors when samples appear in unexpected places.
The same sample location contributes multiple rows — the Spring Chesapeake station has both a FL (free-living, 0.2–0.8 µm filter) and a PA (particle-attached, >0.8 µm filter) sample. These are separate samples with separate DNA extractions and separate rows in the metadata.
How this richer metadata enables richer analysis
With this metadata you can ask questions that the simple soil experiment could not:
- Does the free-living community have higher Shannon diversity than the particle-attached community? (alpha diversity, grouped by
Fraction) - Are Chesapeake and Delaware Bay communities more similar within a season or within a bay? (beta diversity PERMANOVA with both
BayandSeason) - Does salinity predict community composition better than temperature? (distance-based redundancy analysis,
dbrda(dist_obj ~ Salinity + Temperature, data = meta)) - Which ASVs are specifically associated with high-salinity samples? (indicator species analysis using
Salinityas a continuous predictor) - Does the free-living to particle-attached transition look different in the two bays? (two-way PERMANOVA:
adonis2(dist ~ Fraction * Bay))
None of these analyses is possible without the metadata file. The sequences contain the raw biological signal; the metadata tells you what that signal means.
The difference between minimal and environmental metadata
Here is a side-by-side summary:
| Feature | Lab experiment metadata | Environmental study metadata |
|---|---|---|
| Sample IDs | Short, simple (S1a) | Encode location/time/fraction |
| Key grouping variables | Treatment (2–5 levels) | Bay, Season, Fraction, Site |
| Continuous variables | Rarely needed | Salinity, Temperature, pH, DO |
| Replication | Biological replicates | Spatial and temporal replicates |
| Typical rows | 9–30 | 30–300+ |
| Main comparisons | Treatment vs. control | Multi-factor, gradient analyses |
| PERMANOVA structure | ~ Treatment | ~ Bay * Season + Fraction |
Common beginner mistakes with metadata
Wrong column separator. QIIME 2 needs TSV, not CSV. If you save from Excel as “Comma Separated Values” instead of “Tab Delimited Text,” every cell with a comma in it will break the column structure silently.
Spaces in column names. Sample ID with a space in the column name will cause problems in R (meta$Sample ID is not valid syntax). Use underscores: Sample_ID.
Sample IDs that do not match the feature table. The most common silent failure. If your feature table has S1a but your metadata has S-1a or s1a, the join returns empty rows with NA values for every diversity metric. Always run setdiff(meta$SampleID, colnames(feature_table)) to check.
Mixing numeric and categorical in one column. If your Treatment column has values 1, 2, 3, QIIME 2 will infer it as numeric and treat it as a continuous variable. Either rename the values (Control, Low, High) or use the #q2:types row to declare it categorical.
Forgetting to update metadata after removing samples. If you drop a sample that failed QC from your feature table, remove it from the metadata too. Stale rows in the metadata will not break the analysis, but they will show up as empty points in boxplots and cause confusion.
Summary
The metadata file is a simple TSV table with one row per sample and one column per variable. But every statistical comparison, every diversity figure, every plot in your paper ultimately flows through this file. The sequences tell you who is there. The metadata tells you what it means.
For a basic lab experiment, the metadata file might be just two columns beyond the sample IDs. For a multi-site, multi-season environmental study, it can have twenty columns of physical, chemical, and biological variables — and each one opens a new analytical door.
When in doubt, collect more metadata in the field than you think you will need. You cannot go back and remeasure salinity on a sample from two years ago, but you can always choose not to use a column you collected.
QIIME 2 metadata documentation: docs.qiime2.org. Keemei validator for Google Sheets: keemei.qiime2.org.