Exploring the Pan-Genome with panX: A Practical Workflow for DARPA Isolates

🧬 Exploring the Pan-Genome with panX

Today’s post walks through the complete setup and execution of panX, a powerful tool for analyzing bacterial pan-genomes.

One of my Master’s students is working on a project involving DARPA isolate genomes, focusing on:

Abiotic stress adaptation
Niche classification (generalist vs specialist)
Functional redundancy

As part of the analysis, she selected 20 isolates and classified them using:

Genome size
Total gene count
CAZyme profiles
Transporters
Growth on carbon sources

But after carbon utilization assays, we realized there was still a missing piece:
👉 The pangenome perspective
So we turned to panX.

panX (Neher et al., 2016) integrates:

Ortholog clustering
Gene presence/absence matrix
Phylogenetic inference
Interactive web visualization

I often use Anvi’o, but because of time constraints, panX was the best choice:
✔ Fast
✔ Lightweight
✔ Clear cluster outputs
✔ Better suited for quick exploratory analysis

⚙️ Step 1 — Cloning and Setup

Clone the official repository:

git clone https://github.com/neherlab/pan-genome-analysis.git
cd pan-genome-analysis

🧱 Step 2 — Install panX Dependencies (Conda)

The documentation suggests Miniconda2, but Miniconda3 works perfectly.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/miniconda3/etc/profile.d/conda.sh

Create the panX environment:

conda env create -f panX-environment.yml
conda activate panX

***If you see dependency conflicts: conda env create -f panX-environment.yml –solver=libmamba Test your installation:

sh run-TestSet.sh

🧬 Step 3 — Preparing Genomes with Prokka

panX requires GenBank (.gbk) files. So I annotated all genomes using Prokka. Install Prokka:

conda create -n prokka_env -c bioconda -c conda-forge prokka
conda activate prokka_env

Batch annotation:

for f in *.fa; do
  base=$(basename "$f" .fa)
  prokka "$f" \
    --outdir ../prokka_out/${base} \
    --prefix ${base} \
    --cpus 8 \
    --force \
    --rnammer off
done

Collect all .gbk files:

mkdir -p ../panx_input
find ../prokka_out -name "*.gbk" -exec cp {} ../panx_input/ \;

🚀 Step 4 — Running panX

Move into the panX project directory:

cd ~/pan-genome-analysis/projects
mkdir DARPA_20
cp /project/bcampb7/camplab/Alisha/pangenome_analysis/panx_input/*.gbk DARPA_20/
cd DARPA_20

Run the full pipeline:

python ../../scripts/run-pipeline.py \
    -fn . \
    -sl all \
    -st 13456 \
    --threads 32 \
    --diamond

You will get output folders like: ● 1_parsed_input/ ● 2_clusters/ ● 3_alignments/ ● 4_trees/ ● 5_statistics/ ● summary/

🌐 Step 5 — Interactive Visualization (Optional)

panX includes a full browser-based viewer using Node.js. Clone visualization module:

git clone https://github.com/neherlab/pan-genome-visualization
cd pan-genome-visualization
npm install --legacy-peer-deps
git submodule update --init --recursive

Build and launch:

export DATA_ROOT_URL=/dataset/
npm run build
npm start

Open:👉 http://localhost:8000 📂 Linking Your Dataset to the Viewer Edit: public/dataset/all_downloads_table.json { “panX link”: “gbk”, “species name”: “GBK dataset”, “source”: “Local”, “gene cluster json”: “dataset/gbk/geneCluster.json”, “metadata table”: “dataset/gbk/metainfo.tsv”, “strain/species tree”: “dataset/gbk/strain_tree.nwk” } Refresh the page → your dataset appears in the menu.

🧬 Step 6 — Extracting Core and Pan Genes Without Visualization

Often on HPC (like Palmetto) you only need:

● Core gene count ● Pan gene count ● Accessory genes per genome ● PanX provides allclusters_final.tsv which we can parse directly.

👉 Here is the Python script:

#!/usr/bin/env python3
import pandas as pd
from collections import Counter

infile = "allclusters_final.tsv"
outfile = "core_pan_gene_counts_per_genome.tsv"

clusters = []
with open(infile) as f:
    for line in f:
        genes = [x.strip() for x in line.strip().split("\t") if x.strip()]
        clusters.append(genes)

def genome_name(g): return g.split("|")[0]

cluster_genomes = [list(set(genome_name(g) for g in cl)) for cl in clusters]
all_genomes = sorted({g for cl in cluster_genomes for g in cl})

core_clusters = [cl for cl in cluster_genomes if len(cl) == len(all_genomes)]
genome_counter = Counter(g for cl in cluster_genomes for g in cl)

summary = pd.DataFrame({
    "Genome": all_genomes,
    "Pan_genes": [genome_counter[g] for g in all_genomes],
})
summary["Core_genes"] = len(core_clusters)
summary["Accessory_genes"] = summary["Pan_genes"] - summary["Core_genes"]
summary["Core_fraction(%)"] = (summary["Core_genes"] / summary["Pan_genes"] * 100).round(2)

summary.to_csv(outfile, sep="\t", index=False)
print("✅ Saved:", outfile)

Run with:python core_pan_from_list.py Output file: core_pan_gene_counts_per_genome.tsv Contains:

● Core genes ● Pan genes ● Accessory genes ● Core gene % per genome

Perfect for downstream comparative genomics.

For more details on PanX visit :https://pangenome.org/

ncbi_submission