Pangenome Analysis with Anvi’o: From HPC to Publication-Ready Figures

*🧬 Day 64 of Daily Bioinformatics from Jojy’s Desk

This post walks through a complete pangenome analysis of 15 bacterial genomes using Anvi’o — from raw FASTA files to publication-ready figures. Heavy computation runs on HPC; visualisation and figure generation happen on your laptop. No prior pangenomics experience needed.

What Is a Pangenome — And Why Should You Care?

Imagine you have 15 strains of the same bacterial species isolated from different environments. They’re all the same species, but are they really the same? Do they carry the same genes? Are some strains missing key metabolic pathways? Does your environmental isolate have unique genes not found anywhere else?

A pangenome answers all of these questions at once.

The word “pan” comes from the Greek for “all.” A pangenome represents the complete set of genes across all the genomes in your dataset, organised into three categories:

Category	Definition	Example
Core genome	Genes present in all genomes	Housekeeping genes, ribosomal proteins
Accessory genome	Genes present in some but not all genomes	Metabolic flexibility, niche adaptation
Singleton / unique	Genes present in only one genome	Novel functions, horizontal gene transfer

💡 In plain terms: The core genome is what makes all your strains the same species. The accessory genome is what makes each strain different. And singletons are what makes one strain truly unique.

In this workflow, we analyse 15 Winogradskyella genomes — a genus of marine bacteria. Our goal: understand what genes are shared, what’s variable, and what’s unique to our environmental isolate PC_D3_3.

The Big Picture: Hybrid Workflow

One of the challenges of pangenome analysis is that the computational steps are heavy (hours of CPU time, gigabytes of RAM) while the visualisation steps require an interactive graphical interface — something most HPCs don’t support well.

The solution is a hybrid strategy:

HPC Cluster                          Your Laptop
─────────────────────────────────    ──────────────────────────────
Reformat FASTA headers            7. Interactive visualisation
Create contigs databases    ──▶   8. Export SVG from Anvi'o
Build genome storage              9. R figures (ANI heatmap,
Run pan-genome analysis               gene cluster bar chart,
Compute ANI similarity                presence/absence heatmap)
Export summary tables      ──▶   10. Assemble final figure

This is not a limitation — it’s actually a cleaner workflow. Heavy computation belongs on HPC; interactive design work belongs on your laptop.

What You’ll Need

On HPC:

An active HPC account with at least 16 CPU cores and ~32 GB RAM available
Conda or Miniconda installed (see Day 2 of our HPC series)
Internet access to download Anvi’o

On your laptop:

Conda or Miniconda installed
R (version ≥ 4.0)
R packages: ggplot2, dplyr, tidyr, pheatmap
Enough disk space for the output files (~1–5 GB depending on dataset size)

Your data:

FASTA files for each genome (one file per genome, .fa or .fasta extension)
All genome files in one directory

Step 1: Install Anvi’o

Anvi’o needs to be installed on both your HPC and your laptop. The HPC installation handles computation; the laptop installation handles visualisation.

# Create a dedicated conda environment — don't install into your base environment
conda create -n anvio-9 python=3.10 -y
conda activate anvio-9

# Install Anvi'o from conda channels
conda install -c conda-forge -c bioconda anvio -y

This takes 5–15 minutes. Grab a coffee.

Once it finishes, verify the installation works:

anvi-self-test

You should see a series of tests run and pass. If you see errors, check the official Anvi’o installation guide — it has excellent troubleshooting steps.

💡 What version are we using? Anvi’o 9 (Sonata). If a newer version is available, it will generally work the same way — command names and core concepts are stable between versions.

Step 2: Organise Your Genome Files

Create a clean project directory and put all your genome FASTA files in one place:

mkdir -p anvio_project/genomes
cd anvio_project/

# Copy your FASTA files in
# Your structure should look like this:
ls genomes/
# PC_D3_3.fa
# GCF_003386165.fa
# GCF_013404085.fa
# GCF_002025905.fa
# ... (15 genomes total)

Naming tip: Use simple, descriptive names. Avoid spaces, special characters, and very long names. PC_D3_3.fa is good. My Genome (environmental isolate, 2024) — FINAL v3.fasta is not.

Step 3: Reformat FASTA Headers

This is a step beginners often skip — and then spend an hour debugging mysterious errors.

Anvi’o requires that FASTA headers are simple and contain no special characters (no spaces, pipes, colons, or parentheses). Reference genomes downloaded from NCBI often have complex headers like:

>NZ_CP023570.1 Winogradskyella sp. PC-D3-3 chromosome, complete genome

Anvi’o will reject this. The anvi-script-reformat-fasta command fixes it automatically:

mkdir genomes_reformatted/

for genome in genomes/*.fa
do
    # Extract the filename without extension for naming
    name=$(basename "$genome" .fa)
    echo "Reformatting: $name"

    anvi-script-reformat-fasta \
        "$genome" \
        -o genomes_reformatted/${name}.fa \
        --simplify-names \
        --report-file genomes_reformatted/${name}_reformat_report.txt
done

echo "All genomes reformatted."

What --simplify-names does: it renames each contig/sequence to a short, clean identifier like c_000000001, c_000000002, etc. The --report-file creates a mapping between the original and new names, so you can always trace back.

After this step:

ls genomes_reformatted/
# PC_D3_3.fa
# GCF_003386165.fa
# ...
# PC_D3_3_reformat_report.txt
# GCF_003386165_reformat_report.txt
# ...

Step 4: Create Contigs Databases

This is Anvi’o’s foundation. Every genome needs its own contigs database — a file that stores the genome sequence, predicted genes, and functional annotations.

mkdir contigs_db/

for genome in genomes_reformatted/*.fa
do
    name=$(basename "$genome" .fa)
    echo "Creating contigs database for: $name"

    anvi-gen-contigs-database \
        -f "$genome" \
        -o contigs_db/${name}.db \
        -n "$name" \
        --num-threads 4
done

echo "All contigs databases created."

What happens under the hood:

Prodigal identifies and predicts open reading frames (ORFs) — the genes
Gene sequences are stored in the database
k-mer frequencies are calculated for each contig

This step takes a few minutes per genome. For 15 genomes, expect 20–45 minutes depending on genome size.

Verify your databases were created:

ls contigs_db/
# PC_D3_3.db
# GCF_003386165.db
# ...

# Check one database
anvi-db-info contigs_db/PC_D3_3.db

Step 5: Run HMM Profiles (Optional but Recommended)

Before building the pangenome, it’s good practice to annotate your genomes with hidden Markov model (HMM) profiles. These identify universal single-copy genes — useful for checking genome completeness and for phylogenetics.

for db in contigs_db/*.db
do
    echo "Running HMMs for: $db"
    anvi-run-hmms -c "$db" --num-threads 4
done

This step uses HMMER to search for bacterial single-copy genes. It takes a few minutes per genome but is worth running — the completeness estimates it produces help you identify low-quality genomes before they distort your pangenome.

Step 6: Create the Genome Storage Database

Now we bring all 15 contigs databases together into a single genome storage database. This is the master file that tells Anvi’o about all your genomes.

First, create a tab-separated file listing your genomes. This is called the external genomes file:

# Create the external genomes file
echo -e "name\tcontigs_db_path" > external-genomes.txt

for db in contigs_db/*.db
do
    name=$(basename "$db" .db)
    echo -e "${name}\tcontigs_db/${name}.db" >> external-genomes.txt
done

# Check it looks right
cat external-genomes.txt

It should look like:

name              contigs_db_path
PC_D3_3           contigs_db/PC_D3_3.db
GCF_003386165     contigs_db/GCF_003386165.db
GCF_013404085     contigs_db/GCF_013404085.db
...

Now create the genome storage:

anvi-gen-genomes-storage \
    -e external-genomes.txt \
    -o GENOMES.db

This creates GENOMES.db — a single database containing references to all 15 genomes. You’ll pass this file to almost every downstream Anvi’o command.

Step 7: Run the Pangenome Analysis

This is the computationally heavy step — run this on HPC, not your laptop.

anvi-pan-genome \
    -g GENOMES.db \
    -n Winogradskyella_pan \
    -o PAN \
    --num-threads 16 \
    --minbit 0.5 \
    --mcl-inflation 10 \
    --use-ncbi-blast

Let’s break down what each flag does:

Flag	What it does
`-g GENOMES.db`	The genome storage we just created
`-n Winogradskyella_pan`	Name prefix for all output files
`-o PAN`	Output directory
`--num-threads 16`	Use 16 CPU threads (adjust to what your HPC allocation gives you)
`--minbit 0.5`	Sensitivity filter for gene clustering (0.5 is a reasonable default)
`--mcl-inflation 10`	Controls how tightly genes are grouped into clusters (higher = tighter)
`--use-ncbi-blast`	Use BLAST for protein similarity (more sensitive than the default DIAMOND)

For a SLURM job script:

#!/bin/bash
#SBATCH --job-name=pangenome
#SBATCH --output=logs/pangenome_%j.out
#SBATCH --error=logs/pangenome_%j.err
#SBATCH --time=08:00:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=16

conda activate anvio-9

mkdir -p logs/

anvi-pan-genome \
    -g GENOMES.db \
    -n Winogradskyella_pan \
    -o PAN \
    --num-threads $SLURM_CPUS_PER_TASK \
    --minbit 0.5 \
    --mcl-inflation 10 \
    --use-ncbi-blast

The output is a pan-genome database:

PAN/
├── Winogradskyella_pan-PAN.db      ← the main pan-genome database
└── Winogradskyella_pan-GENOMES.db  ← a copy of genome info

Step 8: Compute Genome Similarity (ANI)

ANI (Average Nucleotide Identity) measures how similar two genomes are at the DNA sequence level. It’s the standard metric for defining bacterial species (≥ 95% ANI = same species).

anvi-compute-genome-similarity \
    -e external-genomes.txt \
    -o ANI \
    --program pyANI \
    --num-threads 16 \
    --pan-db PAN/Winogradskyella_pan-PAN.db

In a SLURM script:

#!/bin/bash
#SBATCH --job-name=ani
#SBATCH --output=logs/ani_%j.out
#SBATCH --time=04:00:00
#SBATCH --mem=16G
#SBATCH --cpus-per-task=16

conda activate anvio-9

anvi-compute-genome-similarity \
    -e external-genomes.txt \
    -o ANI \
    --program pyANI \
    --num-threads $SLURM_CPUS_PER_TASK \
    --pan-db PAN/Winogradskyella_pan-PAN.db

The key output file you’ll use for figures:

ANI/ANIb_percentage_identity.txt

This is a matrix where each cell contains the percentage identity between a pair of genomes.

Step 9: Export Summary Tables

Before downloading to your laptop, export the gene cluster summary — the table that describes every gene cluster, which category it belongs to, and which genomes contain it.

# Export gene cluster summary
anvi-summarize \
    -p PAN/Winogradskyella_pan-PAN.db \
    -g GENOMES.db \
    -o PAN_SUMMARY \
    -C default

# The key file:
# PAN_SUMMARY/Winogradskyella_pan_gene_clusters_summary.txt

Also export specific category lists for downstream R analysis:

# Create output directory for tables
mkdir -p supplementary_tables/

# Quick extraction of cluster categories using anvi-export-table
anvi-export-table \
    PAN/Winogradskyella_pan-PAN.db \
    --table gene_cluster_presence_absence \
    -o supplementary_tables/gene_cluster_presence_absence.tsv

Step 10: The HPC Shortcut — Why Not Visualise There?

Anvi’o has a beautiful interactive viewer accessed with:

anvi-display-pan \
    -p PAN/Winogradskyella_pan-PAN.db \
    -g GENOMES.db

This opens a local web server that you view in a browser. On your laptop, this works perfectly. On HPC, you hit several walls:

Login nodes don’t allow long-running interactive processes (see Day 3)
Compute nodes are usually behind firewalls that block browser connections
Port forwarding is possible but complex to set up and often blocked by institutional IT policies

Rather than fighting the infrastructure, copy the results to your laptop and continue there. This is faster, easier, and means your figure-generation code (R scripts) lives alongside your analysis in a reproducible project.

Step 11: Download Results to Your Laptop

# On your laptop — download the key results
rsync -avz --progress \
    username@hpc.university.edu:~/anvio_project/PAN/ \
    ./anvio_project/PAN/

rsync -avz --progress \
    username@hpc.university.edu:~/anvio_project/ANI/ \
    ./anvio_project/ANI/

rsync -avz --progress \
    username@hpc.university.edu:~/anvio_project/PAN_SUMMARY/ \
    ./anvio_project/PAN_SUMMARY/

rsync -avz --progress \
    username@hpc.university.edu:~/anvio_project/GENOMES.db \
    ./anvio_project/

Your local directory should now look like:

anvio_project/
├── PAN/
│   ├── Winogradskyella_pan-PAN.db
│   └── Winogradskyella_pan-GENOMES.db
├── ANI/
│   └── ANIb_percentage_identity.txt
├── GENOMES.db
└── PAN_SUMMARY/
    └── Winogradskyella_pan_gene_clusters_summary.txt

Step 12: Interactive Visualisation on Your Laptop

With Anvi’o installed locally (same conda install from Step 1), launch the interactive viewer:

conda activate anvio-9

anvi-display-pan \
    -p PAN/Winogradskyella_pan-PAN.db \
    -g GENOMES.db

Anvi’o will open a browser window showing a circular pangenome diagram. Each ring represents one genome; each radial segment represents a gene cluster. Core gene clusters run all the way around; singletons appear in only one ring.

From the viewer you can:

Colour gene clusters by category (core / accessory / singleton)
View ANI similarity between genomes
Inspect specific gene clusters
Export the figure as SVG: Settings → Export SVG

Save as pangenome_circle.svg — this becomes Panel C of your final figure.

Step 13: Generate Supplementary Tables in R

Switch to R for the quantitative analysis. First, load the gene cluster summary:

library(dplyr)
library(tidyr)

# Load the gene cluster summary table
pan <- read.delim(
    "PAN_SUMMARY/Winogradskyella_pan_gene_clusters_summary.txt",
    check.names = FALSE
)

# Preview the structure
head(pan)
dim(pan)    # rows = gene clusters, columns = metadata + genome columns

The table has columns for metadata (cluster name, genome count, category) and one column per genome (containing 0 or 1 for absent/present).

# Identify which columns are genome columns vs metadata
meta_cols   <- c("gene_cluster_name", "genome_count", "category")
genome_cols <- setdiff(colnames(pan), meta_cols)

cat("Number of gene clusters:", nrow(pan), "\n")
cat("Number of genomes:", length(genome_cols), "\n")
cat("Categories:", unique(pan$category), "\n")

Reshape to long format for counting:

# Convert to long format (one row per genome-cluster pair)
long <- pan %>%
    pivot_longer(
        cols      = all_of(genome_cols),
        names_to  = "genome",
        values_to = "present"
    ) %>%
    filter(present == 1)   # keep only gene clusters present in each genome

# Count gene clusters per genome per category
per_genome <- long %>%
    group_by(genome, category) %>%
    summarise(n = n(), .groups = "drop") %>%
    pivot_wider(
        names_from  = category,
        values_from = n,
        values_fill = 0
    )

# Save as supplementary table
write.table(
    per_genome,
    "supplementary_tables/per_genome_gene_cluster_counts.tsv",
    sep       = "\t",
    quote     = FALSE,
    row.names = FALSE
)

print(per_genome)

Also export per-category tables that become part of your supplementary data:

# Core gene clusters
core <- pan %>%
    filter(category == "CORE") %>%
    select(gene_cluster_name, genome_count, all_of(genome_cols))

write.table(core, "supplementary_tables/core_gene_clusters.tsv",
            sep = "\t", quote = FALSE, row.names = FALSE)

# Accessory gene clusters
accessory <- pan %>%
    filter(category == "ACCESSORY") %>%
    select(gene_cluster_name, genome_count, all_of(genome_cols))

write.table(accessory, "supplementary_tables/accessory_gene_clusters.tsv",
            sep = "\t", quote = FALSE, row.names = FALSE)

# Singleton gene clusters
singletons <- pan %>%
    filter(category == "SINGLETON") %>%
    select(gene_cluster_name, genome_count, all_of(genome_cols))

write.table(singletons, "supplementary_tables/singleton_gene_clusters.tsv",
            sep = "\t", quote = FALSE, row.names = FALSE)

# Unique to PC_D3_3
pc_d3_unique <- pan %>%
    filter(category == "SINGLETON", PC_D3_3 == 1) %>%
    select(gene_cluster_name, genome_count, all_of(genome_cols))

write.table(pc_d3_unique, "supplementary_tables/PC_D3_3_unique_gene_clusters.tsv",
            sep = "\t", quote = FALSE, row.names = FALSE)

cat("PC_D3_3 unique gene clusters:", nrow(pc_d3_unique), "\n")

Step 14: Panel A — ANI Heatmap

library(pheatmap)
library(RColorBrewer)

# Load ANI matrix
ani <- read.delim(
    "ANI/ANIb_percentage_identity.txt",
    row.names = 1,
    check.names = FALSE
)

# ANI values are 0–1; multiply by 100 for percentage
ani_pct <- as.matrix(ani) * 100

# Create heatmap
pdf("figures/Panel_A_ANI_heatmap.pdf", width = 8, height = 7)

pheatmap(
    ani_pct,
    cluster_rows   = TRUE,
    cluster_cols   = TRUE,
    border_color   = NA,
    color          = colorRampPalette(c("#f0f9ff", "#0ea5e9", "#1e3a5f"))(100),
    main           = "Genome Similarity (ANI %)",
    fontsize        = 10,
    fontsize_row    = 9,
    fontsize_col    = 9,
    display_numbers = TRUE,     # show values in cells
    number_format   = "%.1f",   # one decimal place
    number_color    = "white"
)

dev.off()

💡 Interpreting ANI: Values above 95% indicate the same species. Values between 70–95% suggest related species. The heatmap clustering will group your most similar genomes together automatically.

Step 15: Panel B — Gene Cluster Category Counts

library(ggplot2)

# Count clusters per category
counts <- pan %>%
    count(category) %>%
    mutate(category = factor(category, levels = c("CORE", "ACCESSORY", "SINGLETON")))

# Define colours consistent with Anvi'o's default palette
category_colours <- c(
    "CORE"      = "#2a9d8f",
    "ACCESSORY" = "#e9c46a",
    "SINGLETON" = "#e76f51"
)

# Create bar chart
panel_b <- ggplot(counts, aes(x = category, y = n, fill = category)) +
    geom_col(width = 0.6, show.legend = FALSE) +
    geom_text(aes(label = n), vjust = -0.4, fontface = "bold", size = 4) +
    scale_fill_manual(values = category_colours) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    theme_bw(base_size = 12) +
    theme(
        panel.grid.major.x = element_blank(),
        panel.grid.minor   = element_blank(),
        axis.title.x       = element_blank()
    ) +
    labs(
        y     = "Number of gene clusters",
        title = "Pangenome Composition"
    )

ggsave("figures/Panel_B_gene_cluster_categories.pdf",
       panel_b, width = 5, height = 5)

print(panel_b)

Step 16: Panel D — Gene Presence/Absence Heatmap

# Build presence/absence matrix
pa <- pan[, genome_cols]
rownames(pa) <- pan$gene_cluster_name

# Order rows: core first, then accessory, then singletons
row_order <- order(pan$category, pan$genome_count, decreasing = TRUE)
pa_ordered <- pa[row_order, ]

# Create annotation for rows (category)
row_annotation <- data.frame(
    Category = pan$category[row_order]
)
rownames(row_annotation) <- rownames(pa_ordered)

# Annotation colours
ann_colours <- list(
    Category = c(
        "CORE"      = "#2a9d8f",
        "ACCESSORY" = "#e9c46a",
        "SINGLETON" = "#e76f51"
    )
)

pdf("figures/Panel_D_presence_absence_heatmap.pdf", width = 10, height = 8)

pheatmap(
    as.matrix(pa_ordered),
    cluster_rows        = FALSE,      # keep category order
    cluster_cols        = TRUE,       # cluster genomes by similarity
    show_rownames       = FALSE,      # too many clusters to label
    show_colnames       = TRUE,
    annotation_row      = row_annotation,
    annotation_colors   = ann_colours,
    color               = c("#f8fafc", "#0f172a"),   # white = absent, dark = present
    border_color        = NA,
    main                = "Gene Cluster Presence / Absence",
    fontsize_col        = 9,
    gaps_row            = c(
        sum(pan$category[row_order] == "CORE"),
        sum(pan$category[row_order] %in% c("CORE", "ACCESSORY"))
    )
)

dev.off()

Step 17: Assemble the Final Figure

Your four panels are:

Panel	File	Content
A	`Panel_A_ANI_heatmap.pdf`	Genome similarity matrix
B	`Panel_B_gene_cluster_categories.pdf`	Core / accessory / singleton counts
C	`pangenome_circle.svg`	Circular pangenome (exported from Anvi’o viewer)
D	`Panel_D_presence_absence_heatmap.pdf`	Gene presence/absence across all genomes

Combine them in Inkscape (free), Adobe Illustrator, or Affinity Designer:

Open a new document at your journal’s required figure width (usually 180 mm for double-column)
Import each panel individually
Arrange in a 2×2 grid
Add panel labels (A, B, C, D) in the top-left corner of each panel — bold, 10pt
Add a unified legend if needed
Export as Figure1.pdf (vector) and Figure1.tiff at 300 DPI (for submission)

┌─────────────────┬─────────────────┐
│  A: ANI Heatmap │  B: Bar Chart   │
│                 │   Core/Acc/Sin  │
├─────────────────┼─────────────────┤
│  C: Circular    │  D: Presence/   │
│  Pangenome      │  Absence Heatmap│
└─────────────────┴─────────────────┘

💡 SVG tip: Panel C (the Anvi’o circle) is exported as SVG, which means it’s a vector graphic — you can resize it to any size without losing quality, and you can edit individual elements in Inkscape.

Complete Project Directory Structure

Here is what your final project should look like:

anvio_project/
│
├── genomes/                    ← original FASTA files
├── genomes_reformatted/        ← clean FASTA files (Anvi'o-compatible)
├── contigs_db/                 ← one .db file per genome
│
├── external-genomes.txt        ← genome list for Anvi'o
├── GENOMES.db                  ← genome storage database
│
├── PAN/                        ← pangenome results
│   ├── Winogradskyella_pan-PAN.db
│   └── Winogradskyella_pan-GENOMES.db
│
├── ANI/                        ← genome similarity results
│   └── ANIb_percentage_identity.txt
│
├── PAN_SUMMARY/                ← exported summary tables
│   └── Winogradskyella_pan_gene_clusters_summary.txt
│
├── supplementary_tables/       ← tables for paper submission
│   ├── per_genome_gene_cluster_counts.tsv
│   ├── core_gene_clusters.tsv
│   ├── accessory_gene_clusters.tsv
│   ├── singleton_gene_clusters.tsv
│   └── PC_D3_3_unique_gene_clusters.tsv
│
├── figures/                    ← publication figures
│   ├── Panel_A_ANI_heatmap.pdf
│   ├── Panel_B_gene_cluster_categories.pdf
│   ├── pangenome_circle.svg
│   ├── Panel_D_presence_absence_heatmap.pdf
│   └── Figure1.pdf             ← final assembled figure
│
└── scripts/
    ├── 01_reformat_fasta.sh
    ├── 02_create_contigs_db.sh
    ├── 03_run_pangenome.sh
    ├── 04_compute_ani.sh
    └── 05_figures.R

Key Results to Report in Your Paper

After running this analysis, you’ll have the numbers to fill in your methods and results sections:

# Quick summary statistics — run this in R
cat("=== Pangenome Summary ===\n")
cat("Total gene clusters:    ", nrow(pan), "\n")
cat("Core gene clusters:     ", sum(pan$category == "CORE"), "\n")
cat("Accessory gene clusters:", sum(pan$category == "ACCESSORY"), "\n")
cat("Singleton gene clusters:", sum(pan$category == "SINGLETON"), "\n")
cat("PC_D3_3 unique clusters:", nrow(pc_d3_unique), "\n")
cat("\n=== Genomes ===\n")
cat("Number of genomes:", length(genome_cols), "\n")

A typical results paragraph looks like:

“The pangenome of 15 Winogradskyella genomes comprised X gene clusters in total. Of these, Y (Z%) constituted the core genome, A (B%) were accessory gene clusters, and C (D%) were singletons unique to individual genomes. The environmental isolate PC_D3_3 harboured E unique gene clusters not found in any other genome in the dataset (Supplementary Table X). Average nucleotide identity (ANI) analysis confirmed that all strains shared >95% identity, consistent with membership in the same species (Figure 1A).”

Common Errors and How to Fix Them

Error: “Sequence names in FASTA have characters Anvi’o doesn’t like”

# Fix: rerun Step 3 with --simplify-names
anvi-script-reformat-fasta genome.fa -o genome_clean.fa --simplify-names

Error: “Some genome names in your external genomes file do not match”

# Check that names in external-genomes.txt exactly match database names
head external-genomes.txt
anvi-db-info contigs_db/PC_D3_3.db | grep "project_name"
# Both should show exactly the same string

Error: anvi-pan-genome runs out of memory

# In your SLURM script, increase memory and reduce threads
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8
# (Fewer threads can sometimes reduce peak memory)

Error: “MCL is not installed”

# Install MCL in your conda environment
conda install -c bioconda mcl -y

R error: “object ‘pan’ not found”

You likely loaded the wrong file or the column names contain extra whitespace. Check with:

names(pan)[1:5]      # check column names
nrow(pan)            # check number of rows
class(pan$category)  # check it's a character/factor

Summary

This workflow demonstrates a clean hybrid approach to pangenome analysis:

Phase	Where	Tools
Genome preparation	HPC or laptop	Anvi’o
Contigs databases	HPC	Anvi’o + Prodigal
Pangenome construction	HPC	Anvi’o + BLAST/DIAMOND + MCL
ANI computation	HPC	Anvi’o + pyANI
Interactive visualisation	Laptop	Anvi’o browser viewer
Figure generation	Laptop	R, ggplot2, pheatmap
Final figure assembly	Laptop	Inkscape / Illustrator

The key insight: don’t fight your HPC’s GUI limitations. Move the results to your laptop and leverage the full power of R for publication-quality figures. The compute belongs on HPC; the creativity belongs on your machine.

References

Anvi’o: Eren AM et al. (2015) Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ. merenlab.org/software/anvio
pyANI: Pritchard L et al. (2016) Genomics and taxonomy in diagnostics for food security. Anal. Methods.
MCL: Enright AJ et al. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res.
pheatmap: Kolde R (2019). pheatmap: Pretty Heatmaps. R package.

Questions about this workflow? Drop a comment below. The complete R script and SLURM submission scripts are available in the companion GitHub repository.

PanGenomics-Anvio