Whole Genome Sequencing β Day 5: Comparative & Downstream Genomic Analyses
𧬠Whole Genome Sequencing β Day 5
Comparative & Downstream Genomic Analyses πIn Day 1, we cleaned raw reads. πIn Day 2, we assembled genomes and evaluated quality. πIn Day 3, we placed genomes in taxonomic and phylogenetic context. πIn Day 4, we annotated genomes and identified functional potential.
Today, in Day 5, we shift from individual genomes to comparative insights: How do genomes differ? What unique functions do they encode? What risks or opportunities do they present? This step transforms genome collections into ecological and biotechnological insights by comparing gene content, identifying biosynthetic potential, and screening for clinically relevant features.
π― Goal of Day 3
PMove from single genomes to comparative and ecological insights Specifically, we aim to: β Define core and accessory genomes across collections β Identify secondary metabolite biosynthetic gene clusters β Screen for antimicrobial resistance and virulence factors β Detect and characterize plasmids β Validate genome topology and circularization This workflow scales from small genome sets to hundreds of isolates or MAGs.
π§ͺ Comparative Genomics Strategy
Comparative analyses work best when structured around specific biological questions:
| Analysis | Question |
|---|---|
| Pangenome analysis | What genes are shared vs. strain-specific? |
| Secondary metabolite screening | What biosynthetic potential exists? |
| Pathogen screening | Are there AMR genes or virulence factors? |
| Plasmid detection | What mobile genetic elements are present? |
| Topology validation | Are genomes complete and circularized? |
π§ Why Comparative Genomics Matters
Individual genomes tell us what one organism can do.
Comparative genomics reveals: β Functional diversity within populations
β Evolutionary adaptation strategies
β Horizontal gene transfer and mobile elements
β Biotechnological potential (enzymes, metabolites)
β Public health risks (AMR, virulence)
Comparative genomics is essential for: β Understanding strain-level variation β Identifying biomarkers or diagnostic targets β Prioritizing genomes for downstream work β Linking genomic diversity to ecological roles
π§° Tools Used in Day 5
πΉ PanX Pangenome analysis and visualization πΉ antiSMASH Secondary metabolite biosynthetic gene cluster prediction πΉ BiG-SCAPE Biosynthetic gene cluster similarity networks πΉ Abricate Rapid screening for AMR genes, virulence factors, and plasmids πΉ Plsdb / PlasmidFinder Plasmid detection and characterization πΉ PLSMER Machine learning-based plasmid sequence identification πΉ SPAdes (plasmidSPAdes mode) Targeted plasmid assembly πΉ SnapGene / Geneious Plasmid visualization and annotation Each tool addresses a different dimension of comparative and applied genomics.
𧬠Step 1: Pangenome Analysis with PanX
Pangenome analysis partitions genes into: β Core genome β genes present in all strains β Accessory genome β genes present in some strains β Unique genes β strain-specific genes This reveals functional diversity, niche adaptation, and horizontal gene transfer.
For more deatails about PanX from installation to run please visit : https://jojyjohn28.github.io/blog/panx-pangenome-analysis/
You can always Extract Core and Pan Genes Without Visualization using custom Python Script. Please see core_pan_from_list.py , inside https://github.com/jojyjohn28/whole-genome-sequencing-analysis/day5
What PanX Tells Us β Which genes are universally conserved (core metabolism) β Which genes are variable (niche-specific functions) β Gene gain/loss patterns across phylogeny β Functional stratification within populations Pangenomes are especially powerful when combined with Day 3 phylogenies and Day 4 functional annotations.
𧬠Step 2: Secondary Metabolite Screening with antiSMASH
Secondary metabolites include antibiotics, toxins, pigments, and signaling molecules. Many are encoded by biosynthetic gene clusters (BGCs). antiSMASH predicts BGCs and compares them to known compounds.
# Install antiSMASH via conda
conda install -c bioconda antismash
Running antiSMASH
antismash genome.gbk \
--output-dir antismash_out \
--genefinding-tool prodigal \
--cpus 16
If it is 1 genome or few genomes and you are not experienced to run the stand-alone version, antiSMASH is available as web based platform. Please visit :https://antismash.secondarymetabolites.org/#!/start 
Key Outputs
β HTML report with interactive BGC visualization
β GenBank files with annotated BGCs
β Predicted compound classes (NRPS, PKS, terpene, etc.)
What antiSMASH Reveals β Biosynthetic potential for novel metabolites β Known BGCs with compound matches β Evolutionary conservation of BGCs across strains
𧬠Step 3: BGC Similarity Networks with BiG-SCAPE
When you have many genomes, BiG-SCAPE clusters similar BGCs into gene cluster families (GCFs). This enables: β Cross-genome BGC comparison β Identification of conserved vs. unique BGCs β Prioritization of novel clusters
Installation
# Install BiG-SCAPE
conda install -c bioconda bigscape
Running BiG-SCAPE
bigscape -i antismash_out/ \
-o bigscape_out \
--pfam_dir /path/to/pfam \
--mode auto \
--cutoffs 0.3 0.5 0.7
Key Outputs
β Network diagrams of BGC families
β GCF assignments for each genome
β Distance matrices for comparative analysis
BiG-SCAPE + antiSMASH together provide a genome-scale view of biosynthetic diversity. 
This figure shows a new starin of Streptomyces SOSIST-3 isolated from Southern Oceans, the manuscript is accepted in Molecular biology Reports. More details see : https://jojyjohn28.github.io/collaborations/ and https://jojyjohn28.github.io/publications/
𧬠Step 4: Pathogen Screening with Abricate
Whole genomes are typically screened against two different types of databases to assess their pathogenic potential: i ARGs (Antibiotic Resistance Genes) iiVirulence genes
Tool: Abricate
Abricate (https://github.com/tseemann/abricate) is a mass screening tool that rapidly identifies antimicrobial resistance and virulence genes by BLAST against multiple curated databases.
Available Databases in Abricate
Abricate supports the following databases:
β NCBI β NCBI Bacterial Antimicrobial Resistance Reference Gene Database
β CARD β Comprehensive Antibiotic Resistance Database
β ARG-ANNOT β Antibiotic Resistance Gene-ANNOTation
β ResFinder β Acquired antimicrobial resistance genes
β MEGARES β MEGARes resistance database
β EcOH β E. coli virulence genes
β PlasmidFinder β Plasmid replicons
β Ecoli_VF β E. coli virulence factors
β VFDB β Virulence Factor Database
Database Selection Strategy
For comprehensive screening, use the following databases:
For ARGs (Antibiotic Resistance Genes): β ARG-ANNOT β CARD β ResFinder
For Virulence Genes: β VFDB (Virulence Factor Database)
Installation
conda install -c bioconda abricate
Running Abricate
*Screen for ARGs
# Using CARD database
abricate --db card genome.fasta > genome_card.tab
# Using ResFinder database
abricate --db resfinder genome.fasta > genome_resfinder.tab
# Using ARG-ANNOT database
abricate --db argannot genome.fasta > genome_argannot.tab
*Screen for Virulence Factors
# Using VFDB database
abricate --db vfdb genome.fasta > genome_vfdb.tab
*Screen for Plasmid Replicons
abricate --db plasmidfinder genome.fasta > genome_plasmids.tab
Scaling to Multiple Genomes
# Batch screening for ARGs (CARD example)
for genome in *.fasta; do
abricate --db card $genome > ${genome%.fasta}_card.tab
done
# Batch screening for virulence factors
for genome in *.fasta; do
abricate --db vfdb $genome > ${genome%.fasta}_vfdb.tab
done
# Summarize results across all genomes
abricate --summary *_card.tab > card_summary.tab
abricate --summary *_vfdb.tab > vfdb_summary.tab
Comprehensive Screening Workflow For complete pathogen characterization, screen each genome against all relevant databases. An example bash script (comp_screening.sh) is included in https://github.com/jojyjohn28/whole-genome-sequencing-analysis/day5
What Abricate Provides β Presence/absence of resistance genes across multiple databases β Virulence factor distribution and identification β Plasmid-associated gene detection β Gene coverage and identity statistics β Rapid, scalable screening for large datasets
Using multiple databases (CARD, ResFinder, ARG-ANNOT) for ARG screening ensures comprehensive detection, as different databases may capture different resistance mechanisms or gene variants.
𧬠Step 5: Plasmid Detection & Characterization
Plasmids carry mobile genes involved in: β Antibiotic resistance β Virulence β Metabolic versatility β Horizontal gene transfer
Detecting and characterizing plasmids helps us understand genome plasticity and ecological adaptation. Tools for Plasmid Detection πΉ Plasmer β plasmid identification πΉ PlasmidFinder β Database-based replicon detection πΉ Prokka β Annotates plasmid sequences πΉ SPAdes (plasmidSPAdes mode) β Assembles plasmids from raw reads πΉ SnapGene / Geneious β Visualizes and annotates plasmid maps
a. Plasmid Identification with Plasmer Plasmer uses machine learning to classify sequences as chromosomal or plasmid-derived. GitHub: https://github.com/nekokoe/Plasmer
Installation Option 1: Using Conda (Recommended)
conda install -c iskoldt -c bioconda -c conda-forge -c defaults plasmer
Running Plasmer
Plasmer -g input.fasta \
-p output_prefix \
-d /path/to/plasmer_db \
-t 16 \
-m 500 \
-l 500000 \
-o output_directory
Key Parameters
-g βgenome Input fasta file [required] -p βprefix Prefix for output files [Default: output] -d βdb Path to Plasmer databases [required] -t βthreads Number of threads [Default: 8] -m βminimum_length Minimum sequence length (bp) [Default: 500] -l βlength Chromosome length threshold (bp) [Default: 500000] -o βoutpath Output directory [required]
What Plasmer Provides β Plasmid vs. chromosome classification β Machine learning-based prediction β Suitable for draft assemblies with multiple contigs
b. Plasmid Replicon Detection with PlasmidFinder PlasmidFinder identifies plasmid replicons via BLAST against a curated database.
Installation
conda install -c bioconda plasmidfinder
Running PlasmidFinder
plasmidfinder.py -i genome.fasta -o plasmidfinder_out -p /path/to/plasmidfinder_db
c. Plasmid Assembly with SPAdes If plasmids are present in raw sequencing data, plasmidSPAdes can assemble them separately. Running plasmidSPAdes
spades.py --plasmid \
-1 reads_R1.fastq.gz \
-2 reads_R2.fastq.gz \
-o plasmid_assembly \
-t 16
Key Output β plasmids.fasta β assembled plasmid sequences
d. Plasmid Annotation with Prokka Once plasmids are assembled or extracted, annotate them with Prokka.
prokka plasmid.fasta \
--outdir plasmid_annotation \
--prefix plasmid \
--compliant
e. Plasmid Visualization with SnapGene For circular plasmid maps and publication-quality figures:
β Import annotated GenBank file into SnapGene β Add features (AMR genes, replicons, origins) β Export circular map as PNG or SVG
Alternative: Use Geneious or custom scripts with biopython and matplotlib.

The image shows a multidrug plasmid annotated with a whole genome of *Salmonella sp.
𧬠Step 6: Genome Circularization & Topology Validation
Complete genomes should be: β Circularized β chromosomes form closed loops
β Topology-validated β no misassemblies or chimeric joins
This is especially important for:
β Plasmid characterization
β Complete reference genomes
β Comparing assembly quality
Tools for Topology Validation πΉ Unicycler β Includes circularization detection πΉ CheckM / CheckM2 β Flags genome completeness issues πΉ Bandage β Visualizes assembly graphs πΉ Custom scripts β Check for overlapping contig termini
For more details please see:https://jojyjohn28.github.io/blog/genome-topology-and-genome-report/
π Scaling Comparative Analyses to 100+ Genomes
For large genome collections: β Use PanX for genome-wide gene presence/absence β Run antiSMASH + BiG-SCAPE for biosynthetic diversity β Screen all genomes with Abricate for AMR/virulence β Detect plasmids with PLSMER + PlasmidFinder β Aggregate results into summary tables or heatmaps This enables: β Functional trait mapping onto phylogenies β Identification of high-risk or high-value strains β Comparative metabolic and biosynthetic potential
π§ Key Takeaways from Day 5
β Pangenome analysis reveals core vs. variable gene content β antiSMASH identifies biosynthetic potential for novel compounds β Abricate rapidly screens for AMR, virulence, and plasmids β Plasmid detection and characterization reveal mobile genetic elements β Topology validation ensures genome completeness β Comparative genomics bridges sequence data and biological insight
π Whatβs Next?
With comparative and functional analyses complete, you now have: β Annotated, taxonomically placed genomes β Functional trait distributions β AMR and biosynthetic potential β Pangenome structure These datasets can be integrated with: β Metagenomics (for community-level insights) β Transcriptomics (for gene expression validation) β Metabolomics (for linking genes to metabolites) β Ecological metadata (for trait-environment correlations)
All codes can be found at day 5 of : https://github.com/jojyjohn28/whole-genome-sequencing-analysis