Whole Genome Sequencing β Day 4: Genome Annotation & Functional Potential
𧬠Whole Genome Sequencing β Day 4
Genome Annotation & Functional Potential
πIn Day 1, we cleaned raw reads. πIn Day 2, we assembled genomes and evaluated quality and topology. πIn Day 3, we placed genomes in a taxonomic and phylogenetic context.
Today, in Day 4, we answer the most biologically meaningful question:
What can this genome actually do?
This step transforms assembled genomes into functional blueprints by identifying genes, protein domains, pathways, and metabolic capabilities.
π― Goal of Day 4
Identify genes, pathways, and metabolic potential encoded in genomes
Specifically, we aim to:
β Predict protein-coding genes β Assign functional annotations β Reconstruct metabolic pathways β Compare functional potential across many genomes
This workflow is designed for isolate genomes and MAG collections, and scales cleanly to hundreds of genomes.
π§ͺ Annotation Strategy: A Layered Approach
Genome annotation works best when done in layers, each answering a different question: | Layer | Question | | ββββββ | βββββββββββββ- | | Gene prediction | Where are the genes? | | Rapid annotation | What are they likely doing? | | Domain annotation | What conserved functions do they encode? | | Pathway annotation | How do genes connect into metabolism? | | Trait screening | What ecological traits are present? |
π§ Why Genome Annotation Matters
A genome sequence without annotation is just a long string of nucleotides.
Genome annotation enables us to:
β Interpret ecological strategies
β Predict metabolic roles
β Identify functional redundancy or specialization
β Link taxonomy to ecosystem function
β Good annotation is the foundation for:
β Comparative genomics
β Metabolic modeling
β Transcriptomic integration
β Functional redundancy analyses
π§ͺ Annotation Strategy Overview
Genome annotation is best approached in layers, not with a single tool.
Layered annotation strategy used here:
1οΈβ£ Gene prediction β Where are the genes? 2οΈβ£ Rapid annotation β What are they likely doing? 3οΈβ£ Domain-based annotation β What functional motifs do proteins contain? 4οΈβ£ Pathway-level reconstruction β How do genes connect into metabolism?
π§° Tools Used in Day 4
πΉ Prokka
Rapid, genome-scale annotation pipeline for bacterial and archaeal genomes.
πΉ Prodigal
Accurate gene prediction engine used internally by many annotation tools.
πΉ InterProScan
Protein domain and motif identification across multiple databases.
πΉ DRAM
High-level metabolic and pathway-centric genome annotation.
Each tool complements the others rather than replacing them.
𧬠Step 1: Gene Prediction with Prodigal
Gene prediction identifies open reading frames (ORFs) and defines the basic gene catalog.
prodigal \
-i genome.fasta \
-a genome.faa \
-d genome.fna \
-o genome.genes.gbk \
-p single
Key outputs
β faa β predicted proteins
β fna β nucleotide CDS
β gbk β gene coordinates
Prodigal is: β Fast β Accurate β Well-suited for isolates and high-quality MAGs
𧬠Step 2: Rapid Genome Annotation with Prokka
Prokka wraps gene prediction and functional assignment into a single, fast pipeline.
prokka genome.fasta \
--outdir prokka_out \
--prefix genome \
--cpus 16
What Prokka provides
β Gene names and product descriptions
β rRNA, tRNA, tmRNA prediction
β GFF, GenBank, and protein FASTA files
Prokka is ideal for: β First-pass annotation β NCBI submissions β Comparative gene counts
𧬠Step 3: Protein Domain Annotation with InterProScan
While Prokka assigns gene names, InterProScan identifies conserved protein domains, which is critical for:
β Novel proteins
β Hypothetical genes
β Functional inference beyond BLAST hits
interproscan.sh \
-i genome.faa \
-o genome.interpro.tsv \
-f TSV \
-dp \
--cpu 32
Why domain-based annotation matters
β Domains are more conserved than full-length proteins
β Enables functional inference for unknown genes
β Links proteins to GO terms and enzyme classes
𧬠Step 4: Metabolic & Pathway Annotation with DRAM
DRAM integrates multiple databases to reconstruct genome-scale metabolic potential. DRAM annotation
DRAM.py annotate \
-i genome.fasta \
-o dram_out \
--threads 32
DRAM distillation (summaries)
DRAM.py distill \
-i dram_out/annotations.tsv \
-o dram_out/genome_summaries
DRAM excels at β Central carbon metabolism β Nitrogen, sulfur, hydrogen pathways β CAZymes and polysaccharide utilization β Transporters and redox systems
This moves annotation from gene lists β biological interpretation. π§ Focus Areas: Core Metabolic Pathways
DRAM distillation gives details on core metabolic potential, including:
πΉ Energy metabolism
Glycolysis
TCA cycle
Oxidative phosphorylation
Fermentation pathways
πΉ Carbon utilization
Sugar transporters
Polysaccharide degradation
Organic acid metabolism
πΉ Nitrogen & sulfur metabolism
Nitrate/nitrite reduction
Amino acid biosynthesis
Sulfur assimilation
These pathways define how organisms:
β Acquire energy
β Use resources
β Interact with their environment
π Scaling Annotation to 100+ Genomes
For large genome sets:
β Run Prokka and Prodigal in batch mode β Use DRAM summaries for comparison tables β Aggregate pathway presence/absence matrices β Combine with Day 3 taxonomy and phylogeny
This enables:
β Functional clustering
β Ecological stratification
β Trait-based comparisons

π Trait-Based Screening: From Annotation to Ecology
Once annotation is complete, we can screen genomes for specific ecological traits, such as:
β Biofilm formation
β EPS (exopolysaccharide) production
β Motility
β Stress response
β Host or particle association
This is especially powerful for comparative genomics and functional redundancy analyses.
𧬠Example 1: Screening for Biofilm-Related Genes (InterProScan)
Step 1: Create a reference keyword list
Create a simple table (CSV or Excel) with keywords or domain IDs:
Keyword
biofilm
adhesion
fimbrial
pilus
curli
polysaccharide-binding
This can also include:
β Pfam IDs
β GO terms
β InterPro IDs
Step 2: Filter InterProScan results using pandas
You can use filter.py for filtering the annotation file with reference file. All codes can be found at : https://github.com/jojyjohn28/whole-genome-sequencing-analysis This produces a genome-specific list of candidate biofilm genes.
𧬠Example 2: Screening for EPS-Related Genes (DRAM / KEGG)
DRAM outputs include KEGG IDs, CAZymes, and functional descriptions, making them ideal for EPS screening.
Step 1: Prepare a reference list
Keyword
exopolysaccharide
capsule
glycosyltransferase
alginate
cellulose
levan
𧬠Merging Functional Traits with Quantitative Data
Trait screening becomes even more powerful when merged with abundance or expression data.
A generalized version of workflow is given in /scripts as merge.py
π How Keywords Are Used for Functional Screening
Many ecological traits (e.g., biofilm formation, EPS production, adhesion, motility) are not represented by a single gene, but by groups of genes, domains, or pathways. Keyword-based screening provides a flexible, transparent way to identify such traits across genomes.
Below are two common and complementary approaches.
π§ Option 1: Build Keyword-Based References from Databases
Keywords can be translated into reference identifiers using established databases.
πΉ InterPro / Pfam (protein domains)
Use keywords on the InterPro or Pfam websites to identify relevant domains:
Examples:
β biofilm β adhesion domains, pili, fimbriae
β EPS β glycosyltransferases, polysaccharide biosynthesis proteins
You can then screen InterProScan outputs using:
β InterPro IDs
β Pfam accessions
β Domain descriptions
This approach is robust for trait-level inference, even when gene names are ambiguous.
πΉ KEGG (pathways & enzymes)
KEGG keywords can be used to identify:
β Pathway modules (e.g., polysaccharide biosynthesis)
β Enzyme classes (e.g., glycosyltransferases)
β KO identifiers linked to EPS or biofilm pathways
DRAM outputs already integrate KEGG annotations, making KEGG-based screening especially convenient.
ποΈ Option 2: Create a Keyword Filter List and Screen Locally
For large genome sets, a practical approach is to create a simple keyword reference file and use it to filter annotation outputs.
Example keyword list (trait_keywords.txt):
biofilm
adhesion
pilus
fimbrial
exopolysaccharide
capsule
glycosyltransferase
alginate
cellulose
This file can be reused across:
β InterProScan outputs
β DRAM annotations
β Prokka product descriptions
π Example: Keyword-Based Screening in R
You can use screening_traits.R as minimal R workflow for screening annotation tables using keywords.
What this does
β Searches annotation descriptions for any keyword match
β Returns a genome-specific list of candidate trait genes
β Scales easily to hundreds of genomes
π§ Best Practices
β Use keywords + domain IDs when possible
β Keep keyword lists transparent and version-controlled
β Treat results as hypothesis-generating, not definitive proof
β Combine with pathway context (DRAM) and expression data when available
This keyword-driven approach bridges the gap between raw annotation tables and ecological interpretation, making it especially powerful for comparative genomics and other downstream analyses.
π§ Key Takeaways from Day 4
β Annotation works best in layers β Prodigal ensures accurate gene prediction β Prokka provides rapid functional context β InterProScan enables trait-level screening β DRAM connects genes into metabolic pathways β Simple keyword-based filtering can reveal powerful ecological insights
π Coming Up: Day 5
Day 5 β Comparative Genomics & Functional Comparisons
Weβll integrate taxonomy (Day 3) and function (Day 4) to compare genomes across environments, clades, and ecological strategies.
All codes can be found at : https://github.com/jojyjohn28/whole-genome-sequencing-analysis
These functional screening results can be combined with genome taxonomy and phylogeny and used as annotation layers in iTOL. For example, the figure below illustrates biofilm-related gene screening across 200+ genomes, where the presence or absence of biofilm-associated traits is overlaid onto the phylogenomic tree as metadata strips or heatmaps. This integration allows rapid visualization of how functional traits are distributed across taxonomic lineages and evolutionary clades, enabling direct comparison of functional potential within and between groups.
