Whole Genome Sequencing β€” Day 5: Comparative & Downstream Genomic Analyses

🧬 Whole Genome Sequencing β€” Day 5

Comparative & Downstream Genomic Analyses πŸ‘‰In Day 1, we cleaned raw reads. πŸ‘‰In Day 2, we assembled genomes and evaluated quality. πŸ‘‰In Day 3, we placed genomes in taxonomic and phylogenetic context. πŸ‘‰In Day 4, we annotated genomes and identified functional potential.

Today, in Day 5, we shift from individual genomes to comparative insights: How do genomes differ? What unique functions do they encode? What risks or opportunities do they present? This step transforms genome collections into ecological and biotechnological insights by comparing gene content, identifying biosynthetic potential, and screening for clinically relevant features.

🎯 Goal of Day 3

PMove from single genomes to comparative and ecological insights Specifically, we aim to: βœ” Define core and accessory genomes across collections βœ” Identify secondary metabolite biosynthetic gene clusters βœ” Screen for antimicrobial resistance and virulence factors βœ” Detect and characterize plasmids βœ” Validate genome topology and circularization This workflow scales from small genome sets to hundreds of isolates or MAGs.

πŸ§ͺ Comparative Genomics Strategy

Comparative analyses work best when structured around specific biological questions:

Analysis Question
Pangenome analysis What genes are shared vs. strain-specific?
Secondary metabolite screening What biosynthetic potential exists?
Pathogen screening Are there AMR genes or virulence factors?
Plasmid detection What mobile genetic elements are present?
Topology validation Are genomes complete and circularized?

🧠 Why Comparative Genomics Matters

Individual genomes tell us what one organism can do.

Comparative genomics reveals: ● Functional diversity within populations

● Evolutionary adaptation strategies

● Horizontal gene transfer and mobile elements

● Biotechnological potential (enzymes, metabolites)

● Public health risks (AMR, virulence)

Comparative genomics is essential for: ● Understanding strain-level variation ● Identifying biomarkers or diagnostic targets ● Prioritizing genomes for downstream work ● Linking genomic diversity to ecological roles

🧰 Tools Used in Day 5

πŸ”Ή PanX Pangenome analysis and visualization πŸ”Ή antiSMASH Secondary metabolite biosynthetic gene cluster prediction πŸ”Ή BiG-SCAPE Biosynthetic gene cluster similarity networks πŸ”Ή Abricate Rapid screening for AMR genes, virulence factors, and plasmids πŸ”Ή Plsdb / PlasmidFinder Plasmid detection and characterization πŸ”Ή PLSMER Machine learning-based plasmid sequence identification πŸ”Ή SPAdes (plasmidSPAdes mode) Targeted plasmid assembly πŸ”Ή SnapGene / Geneious Plasmid visualization and annotation Each tool addresses a different dimension of comparative and applied genomics.

🧬 Step 1: Pangenome Analysis with PanX

Pangenome analysis partitions genes into: ● Core genome – genes present in all strains ● Accessory genome – genes present in some strains ● Unique genes – strain-specific genes This reveals functional diversity, niche adaptation, and horizontal gene transfer.

For more deatails about PanX from installation to run please visit : https://jojyjohn28.github.io/blog/panx-pangenome-analysis/

You can always Extract Core and Pan Genes Without Visualization using custom Python Script. Please see core_pan_from_list.py , inside https://github.com/jojyjohn28/whole-genome-sequencing-analysis/day5

What PanX Tells Us βœ” Which genes are universally conserved (core metabolism) βœ” Which genes are variable (niche-specific functions) βœ” Gene gain/loss patterns across phylogeny βœ” Functional stratification within populations Pangenomes are especially powerful when combined with Day 3 phylogenies and Day 4 functional annotations.

🧬 Step 2: Secondary Metabolite Screening with antiSMASH

Secondary metabolites include antibiotics, toxins, pigments, and signaling molecules. Many are encoded by biosynthetic gene clusters (BGCs). antiSMASH predicts BGCs and compares them to known compounds.

# Install antiSMASH via conda
conda install -c bioconda antismash

Running antiSMASH

antismash genome.gbk \
  --output-dir antismash_out \
  --genefinding-tool prodigal \
  --cpus 16

If it is 1 genome or few genomes and you are not experienced to run the stand-alone version, antiSMASH is available as web based platform. Please visit :https://antismash.secondarymetabolites.org/#!/start antismash web

Key Outputs

● HTML report with interactive BGC visualization

● GenBank files with annotated BGCs

● Predicted compound classes (NRPS, PKS, terpene, etc.)

What antiSMASH Reveals βœ” Biosynthetic potential for novel metabolites βœ” Known BGCs with compound matches βœ” Evolutionary conservation of BGCs across strains

🧬 Step 3: BGC Similarity Networks with BiG-SCAPE

When you have many genomes, BiG-SCAPE clusters similar BGCs into gene cluster families (GCFs). This enables: ● Cross-genome BGC comparison ● Identification of conserved vs. unique BGCs ● Prioritization of novel clusters

Installation

# Install BiG-SCAPE
conda install -c bioconda bigscape

Running BiG-SCAPE

bigscape -i antismash_out/ \
  -o bigscape_out \
  --pfam_dir /path/to/pfam \
  --mode auto \
  --cutoffs 0.3 0.5 0.7

Key Outputs

● Network diagrams of BGC families

● GCF assignments for each genome

● Distance matrices for comparative analysis

BiG-SCAPE + antiSMASH together provide a genome-scale view of biosynthetic diversity. bigscape

This figure shows a new starin of Streptomyces SOSIST-3 isolated from Southern Oceans, the manuscript is accepted in Molecular biology Reports. More details see : https://jojyjohn28.github.io/collaborations/ and https://jojyjohn28.github.io/publications/

🧬 Step 4: Pathogen Screening with Abricate

Whole genomes are typically screened against two different types of databases to assess their pathogenic potential: i ARGs (Antibiotic Resistance Genes) iiVirulence genes

Tool: Abricate

Abricate (https://github.com/tseemann/abricate) is a mass screening tool that rapidly identifies antimicrobial resistance and virulence genes by BLAST against multiple curated databases.

Available Databases in Abricate

Abricate supports the following databases:

● NCBI – NCBI Bacterial Antimicrobial Resistance Reference Gene Database

● CARD – Comprehensive Antibiotic Resistance Database

● ARG-ANNOT – Antibiotic Resistance Gene-ANNOTation

● ResFinder – Acquired antimicrobial resistance genes

● MEGARES – MEGARes resistance database

● EcOH – E. coli virulence genes

● PlasmidFinder – Plasmid replicons

● Ecoli_VF – E. coli virulence factors

● VFDB – Virulence Factor Database

Database Selection Strategy

For comprehensive screening, use the following databases:

For ARGs (Antibiotic Resistance Genes): ● ARG-ANNOT ● CARD ● ResFinder

For Virulence Genes: ● VFDB (Virulence Factor Database)

Installation

conda install -c bioconda abricate

Running Abricate

*Screen for ARGs

# Using CARD database
abricate --db card genome.fasta > genome_card.tab

# Using ResFinder database
abricate --db resfinder genome.fasta > genome_resfinder.tab

# Using ARG-ANNOT database
abricate --db argannot genome.fasta > genome_argannot.tab

*Screen for Virulence Factors

# Using VFDB database
abricate --db vfdb genome.fasta > genome_vfdb.tab

*Screen for Plasmid Replicons

abricate --db plasmidfinder genome.fasta > genome_plasmids.tab

Scaling to Multiple Genomes

# Batch screening for ARGs (CARD example)
for genome in *.fasta; do
  abricate --db card $genome > ${genome%.fasta}_card.tab
done

# Batch screening for virulence factors
for genome in *.fasta; do
  abricate --db vfdb $genome > ${genome%.fasta}_vfdb.tab
done

# Summarize results across all genomes
abricate --summary *_card.tab > card_summary.tab
abricate --summary *_vfdb.tab > vfdb_summary.tab

Comprehensive Screening Workflow For complete pathogen characterization, screen each genome against all relevant databases. An example bash script (comp_screening.sh) is included in https://github.com/jojyjohn28/whole-genome-sequencing-analysis/day5

What Abricate Provides βœ” Presence/absence of resistance genes across multiple databases βœ” Virulence factor distribution and identification βœ” Plasmid-associated gene detection βœ” Gene coverage and identity statistics βœ” Rapid, scalable screening for large datasets

Using multiple databases (CARD, ResFinder, ARG-ANNOT) for ARG screening ensures comprehensive detection, as different databases may capture different resistance mechanisms or gene variants.

🧬 Step 5: Plasmid Detection & Characterization

Plasmids carry mobile genes involved in: ● Antibiotic resistance ● Virulence ● Metabolic versatility ● Horizontal gene transfer

Detecting and characterizing plasmids helps us understand genome plasticity and ecological adaptation. Tools for Plasmid Detection πŸ”Ή Plasmer – plasmid identification πŸ”Ή PlasmidFinder – Database-based replicon detection πŸ”Ή Prokka – Annotates plasmid sequences πŸ”Ή SPAdes (plasmidSPAdes mode) – Assembles plasmids from raw reads πŸ”Ή SnapGene / Geneious – Visualizes and annotates plasmid maps

a. Plasmid Identification with Plasmer Plasmer uses machine learning to classify sequences as chromosomal or plasmid-derived. GitHub: https://github.com/nekokoe/Plasmer

Installation Option 1: Using Conda (Recommended)

conda install -c iskoldt -c bioconda -c conda-forge -c defaults plasmer

Running Plasmer

Plasmer -g input.fasta \
  -p output_prefix \
  -d /path/to/plasmer_db \
  -t 16 \
  -m 500 \
  -l 500000 \
  -o output_directory

Key Parameters

-g –genome Input fasta file [required] -p –prefix Prefix for output files [Default: output] -d –db Path to Plasmer databases [required] -t –threads Number of threads [Default: 8] -m –minimum_length Minimum sequence length (bp) [Default: 500] -l –length Chromosome length threshold (bp) [Default: 500000] -o –outpath Output directory [required]

What Plasmer Provides βœ” Plasmid vs. chromosome classification βœ” Machine learning-based prediction βœ” Suitable for draft assemblies with multiple contigs

b. Plasmid Replicon Detection with PlasmidFinder PlasmidFinder identifies plasmid replicons via BLAST against a curated database.

Installation

conda install -c bioconda plasmidfinder

Running PlasmidFinder

plasmidfinder.py -i genome.fasta -o plasmidfinder_out -p /path/to/plasmidfinder_db

c. Plasmid Assembly with SPAdes If plasmids are present in raw sequencing data, plasmidSPAdes can assemble them separately. Running plasmidSPAdes

spades.py --plasmid \
  -1 reads_R1.fastq.gz \
  -2 reads_R2.fastq.gz \
  -o plasmid_assembly \
  -t 16

Key Output ● plasmids.fasta β†’ assembled plasmid sequences

d. Plasmid Annotation with Prokka Once plasmids are assembled or extracted, annotate them with Prokka.

prokka plasmid.fasta \
  --outdir plasmid_annotation \
  --prefix plasmid \
  --compliant

e. Plasmid Visualization with SnapGene For circular plasmid maps and publication-quality figures:

● Import annotated GenBank file into SnapGene ● Add features (AMR genes, replicons, origins) ● Export circular map as PNG or SVG

Alternative: Use Geneious or custom scripts with biopython and matplotlib.

plasmid

The image shows a multidrug plasmid annotated with a whole genome of *Salmonella sp.

🧬 Step 6: Genome Circularization & Topology Validation

Complete genomes should be: ● Circularized – chromosomes form closed loops

● Topology-validated – no misassemblies or chimeric joins

This is especially important for:

● Plasmid characterization

● Complete reference genomes

● Comparing assembly quality

Tools for Topology Validation πŸ”Ή Unicycler – Includes circularization detection πŸ”Ή CheckM / CheckM2 – Flags genome completeness issues πŸ”Ή Bandage – Visualizes assembly graphs πŸ”Ή Custom scripts – Check for overlapping contig termini

For more details please see:https://jojyjohn28.github.io/blog/genome-topology-and-genome-report/

πŸ“Š Scaling Comparative Analyses to 100+ Genomes

For large genome collections: βœ” Use PanX for genome-wide gene presence/absence βœ” Run antiSMASH + BiG-SCAPE for biosynthetic diversity βœ” Screen all genomes with Abricate for AMR/virulence βœ” Detect plasmids with PLSMER + PlasmidFinder βœ” Aggregate results into summary tables or heatmaps This enables: ● Functional trait mapping onto phylogenies ● Identification of high-risk or high-value strains ● Comparative metabolic and biosynthetic potential

🧠 Key Takeaways from Day 5

● Pangenome analysis reveals core vs. variable gene content ● antiSMASH identifies biosynthetic potential for novel compounds ● Abricate rapidly screens for AMR, virulence, and plasmids ● Plasmid detection and characterization reveal mobile genetic elements ● Topology validation ensures genome completeness ● Comparative genomics bridges sequence data and biological insight

πŸ”œ What’s Next?

With comparative and functional analyses complete, you now have: ● Annotated, taxonomically placed genomes ● Functional trait distributions ● AMR and biosynthetic potential ● Pangenome structure These datasets can be integrated with: ● Metagenomics (for community-level insights) ● Transcriptomics (for gene expression validation) ● Metabolomics (for linking genes to metabolites) ● Ecological metadata (for trait-environment correlations)

All codes can be found at day 5 of : https://github.com/jojyjohn28/whole-genome-sequencing-analysis