Genome Annotation Tools — From RAST's 19% to a Complete Picture with PGAP, eggNOG, DRAM, and InterProScan

Genome annotation tools — from RAST’s 19% to a complete picture

🧬 𝐷𝑎𝑦 74 𝑜𝑓 𝐷𝑎𝑖𝑙𝑦 𝐵𝑖𝑜𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑐𝑠 𝑓𝑟𝑜𝑚 𝐽𝑜𝑗𝑦’𝑠 desk

You assembled your genome. You uploaded it to RAST, waited a few minutes, and got your annotation back. You opened the results and saw this number staring at you:

19% of genes assigned to subsystems.

That means 81% of your genes are sitting in the “not assigned” category. You start wondering — did the assembly fail? Is the genome bad? Is the organism too unusual for RAST to know about?

The answer is almost certainly none of those things. RAST is working exactly as intended. The problem is that RAST only tells you one part of the story, and you need several tools together to see the full picture.

This post explains what each tool does, why the numbers differ so dramatically between tools, and how to combine PGAP, eggNOG-mapper, DRAM, and InterProScan to produce annotation figures that will satisfy a reviewer. At the end, there is a complete R visualization script for your manuscript.

Why RAST only annotated 19% of your genes

RAST uses the SEED subsystem framework. A subsystem is a hand-curated set of genes that together perform a specific metabolic or cellular function — things like “arginine biosynthesis” or “sporulation.” RAST assigns a gene to a subsystem only if the gene is a known, well-characterised member of one of these curated pathways.

The key word is curated. SEED subsystems exist for pathways that have been studied in detail in well-known organisms (mostly E. coli, Bacillus subtilis, and other model bacteria). If your organism is a novel isolate, an environmental bacterium from an unusual habitat, or contains genes that have not been formally placed into a SEED subsystem, RAST will not annotate them — even if the gene has a clear homolog in another database.

So 19% assigned to subsystems is not a failure. It is the normal result for a novel environmental isolate. RAST’s 19% tells you about the conserved, well-understood metabolic core. The other 81% needs a different kind of tool.

The annotation toolkit: what each tool does

Think of genome annotation like describing a person to someone who has never met them. RAST tells you their job title if it is one of ten well-known professions. PGAP gives you a full CV. eggNOG-mapper groups them by family resemblance to other people. InterProScan describes their structural features and protein domains. DRAM links them to specific metabolic pathways relevant to ecology.

RAST / BV-BRC

What it does: Web-based genome annotation using the SEED subsystem framework. Fast (minutes), no installation, good for a first look and for conservative pathway interpretation.

Best for: Ecological discussion, subsystem category counts, nitrogen/carbon/stress metabolic interpretation.

Limitation: Only annotates genes that fit into curated SEED subsystems. Everything else is “not assigned.”

Use it for: Getting a quick first annotation, the “conservative subsystem” column in your Table 1, and ecological narrative.

Web server: rast.nmpdr.org or bv-brc.org

PGAP (NCBI Prokaryotic Genome Annotation Pipeline)

What it does: NCBI’s official prokaryotic annotation pipeline. Identifies CDS, RNA genes, pseudogenes, regulatory elements. Uses NCBI’s protein databases plus HMM-based searches. Produces GenBank-ready output that NCBI actually accepts for genome submission.

Best for: Primary annotation for your manuscript, reviewer compliance, GenBank submission.

The honest truth about PGAP: You will likely still see 30–50% “hypothetical proteins.” This is not bad — it means those genes have no characterised homolog in NCBI’s databases. Hypothetical proteins in novel genomes are normal and expected. PGAP annotates them correctly as “hypothetical protein” rather than making up a function.

Installation: Docker-based on a local Linux/Mac machine. Avoid HPC Singularity/Apptainer — too many cgroups and SSL issues.

eggNOG-mapper

What it does: Maps your proteins against the eggNOG database of orthologous groups using DIAMOND as the search engine. Assigns COG categories, KEGG pathways and KO numbers, GO terms, and EC numbers.

Best for: Functional depth, orthology-based annotation, COG category plots, KEGG pathway analysis.

Why the numbers jump: eggNOG uses sequence similarity to orthologous groups across all of the sequenced microbial world — not just curated subsystems. A gene that RAST never heard of can still hit an eggNOG ortholog and get a COG and KEGG assignment. This is why eggNOG typically annotates 60–80% of proteins compared to RAST’s 10–20%.

DRAM (Distilled and Refined Annotation of Metabolism)

What it does: Specifically designed for environmental metagenomes and isolate genomes. Integrates annotations from KEGG, MEROPS, dbCAN (for CAZymes), PFAM, and TIGRfam into a unified distillation framework. Produces metabolism summary tables and heat maps of key metabolic pathways.

Best for: Ecological interpretation of metabolic potential, especially for environmental genomes. Excellent for showing nitrogen cycling, carbon degradation, sulfur metabolism, and energy conservation genes.

Installation: Conda.

mamba create -n DRAM -c conda-forge -c bioconda dram
conda activate DRAM
DRAM-setup.py prepare_databases --output_dir /path/to/DRAM_databases

Run DRAM:

DRAM.py annotate \
  -i genome.fna \
  -o dram_output \
  --threads 16

DRAM.py distill \
  -i dram_output/annotations.tsv \
  -o dram_distill \
  --trna_path dram_output/trnas.tsv \
  --rrna_path dram_output/rrnas.tsv

Key output: dram_distill/product.html — an interactive heat map of metabolic modules. This single file is often worth more for ecological discussion than any other annotation output.

InterProScan

What it does: Scans protein sequences against 13+ integrated protein signature databases: Pfam, TIGRFAM, PRINTS, ProSite, SUPERFAMILY, Gene3D, and others. Assigns protein domains, structural families, and functional sites.

Best for: Understanding what protein families your genes belong to, especially for hypothetical proteins. An “unknown protein” that hits a PFAM domain for a histidine kinase receiver domain is no longer truly unknown.

Installation and run:

# Download from https://www.ebi.ac.uk/interpro/download/
# Or via conda:
conda install -c bioconda interproscan

interproscan.sh \
  -i annot_translated_cds.faa \
  -f TSV,GFF3 \
  -o interproscan_results \
  --cpu 16 \
  --goterms \
  --iprlookup \
  --pathways

The key output is a TSV with protein domain assignments. Even if a protein has no known function, a Pfam domain hit tells you something about its structure and likely molecular function.

The recommended strategy

Here is the order and logic that worked well for a novel environmental isolate (Radiobacillus sp.):

Step	Tool	Purpose
1	RAST	Quick first look, subsystem counts for table
2	PGAP (Docker)	Primary manuscript annotation, GenBank submission
3	eggNOG-mapper	COG/KEGG/GO annotation using PGAP proteins
4	DRAM	Metabolic pathway distillation for ecology
5	InterProScan	Domain annotation for hypothetical proteins
6	R + ggplot2	Manuscript figures and supplementary tables

The key insight: use PGAP proteins as the input for eggNOG-mapper and InterProScan. This keeps all tools using the same gene calls and avoids the inconsistency of running Prodigal separately for each tool.

Part 1: PGAP — primary annotation

Install Docker and PGAP

# Install Docker (Ubuntu)
sudo apt update && sudo apt install docker.io
sudo systemctl enable docker && sudo systemctl start docker
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world   # should print hello-world message

# Download PGAP
mkdir -p ~/pgap && cd ~/pgap
curl -LO https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
chmod +x pgap.py
./pgap.py --update   # downloads container + databases (~60 GB)

Run PGAP

./pgap.py \
  -g Radiobacillus_sp._PE_A8_2.fna \
  -s "Radiobacillus sp." \
  -o pgap_out \
  -n

Check the output

cd pgap_out

# Total CDS
grep -c "CDS" annot.gff

# Hypothetical proteins
grep -c "product=hypothetical protein" annot.gff

# Percent functionally annotated
TOTAL=$(grep -c "CDS" annot.gff)
HYP=$(grep -c "product=hypothetical protein" annot.gff)
echo "scale=2; ($TOTAL - $HYP)/$TOTAL * 100" | bc

Export a simple annotation table

awk -F'\t' '$3=="CDS" {
    split($9,a,";");
    id=""; product="";
    for(i in a){
        if(a[i] ~ /ID=/){split(a[i],b,"="); id=b[2]}
        if(a[i] ~ /product=/){split(a[i],b,"="); product=b[2]}
    }
    print id"\t"product
}' annot.gff > pgap_annotations.tsv

Key PGAP output files

File	What it contains
`annot.gff`	Main annotation in GFF3 format
`annot.gbk`	GenBank format (for genome browsers)
`annot_translated_cds.faa`	Protein FASTA — use this for all downstream tools
`annot.sqn`	Submission-ready file for NCBI

Part 2: eggNOG-mapper — functional depth

Install

conda create -n eggnog -c bioconda -c conda-forge eggnog-mapper diamond
conda activate eggnog
download_eggnog_data.py -y   # downloads ~50 GB

Run

Use the PGAP protein file as input:

emapper.py \
  -i pgap_out/annot_translated_cds.faa \
  -o Radiobacillus_eggnog \
  --cpu 16 \
  --data_dir /path/to/eggnog_database \
  --itype proteins \
  -m diamond \
  --override

Use --override not --resume. If a previous run was interrupted and output files exist but are empty or partial, --resume will silently keep the broken intermediates and report zero processed hits. --override starts clean.

A successful run ends with:

Total hits processed: XXXX

If that number is zero, something went wrong — check that the input file is not empty and the database path is correct.

Part 3: DRAM — metabolic distillation

conda activate DRAM

DRAM.py annotate \
  -i genome.fna \
  -o dram_output \
  --threads 16

DRAM.py distill \
  -i dram_output/annotations.tsv \
  -o dram_distill \
  --trna_path dram_output/trnas.tsv \
  --rrna_path dram_output/rrnas.tsv

Open dram_distill/product.html in a browser. You will see a heat map of metabolic modules — rows are pathways, columns are completeness. This is the figure that goes into the ecological discussion section.

Part 4: InterProScan — domain annotation

interproscan.sh \
  -i pgap_out/annot_translated_cds.faa \
  -f TSV,GFF3 \
  -o interproscan_results \
  --cpu 16 \
  --goterms \
  --iprlookup \
  --pathways

The TSV output has one row per domain hit. A protein that PGAP called “hypothetical protein” may hit a Pfam domain like PF00069 (protein kinase domain) — now you know it is a kinase, even if you do not know which substrate it phosphorylates.

Part 5: Comparing the tools — the numbers

Here is what a typical novel environmental bacterial genome looks like across tools:

Tool	Total features	Annotated	% annotated
RAST/SEED	~5,000	~950	19%
PGAP	~4,800	~3,000	~63%
eggNOG-mapper	~4,800	~3,600	~75%
InterProScan	~4,800	~3,200	~67%

These numbers are consistent across novel isolates. RAST’s low number is not a bug — it is a feature of a conservative, curated database. PGAP and eggNOG are not more “correct,” they are just using broader sequence similarity approaches.

For your manuscript: report all four numbers. A sentence like “Of the 4,812 predicted protein-coding genes, 63% were assigned a function by PGAP, 75% had eggNOG ortholog assignments, and 19% were placed into SEED subsystems” is both accurate and demonstrates analytical depth.

Part 6: R visualization script

This script produces a four-panel figure: Panel A (annotation comparison across methods), Panel B (PGAP functional groups), Panel C (eggNOG COG categories), Panel D (top SEED subsystems). It also writes all data to a supplementary Excel file.

Libraries

library(tidyverse)
library(cowplot)
library(writexl)
library(scales)
library(forcats)
library(stringr)

Read input files

eggnog_file <- "Radiobacillus_eggnog.emapper.annotations"
pgap_file   <- "pgap_annotations.tsv"

# Read PGAP annotation table
pgap <- read_tsv(
  pgap_file,
  col_names = c("Gene_ID", "Product"),
  show_col_types = FALSE
) %>%
  mutate(
    Product     = replace_na(Product, "NA"),
    hypothetical = str_to_lower(Product) == "hypothetical protein"
  )

# Read eggNOG — skip comment lines, use the #query header
egg_lines   <- readLines(eggnog_file)
header_line <- egg_lines[str_detect(egg_lines, "^#query")][1]
header_vec  <- str_split(str_remove(header_line, "^#"), "\t")[[1]]

eggnog <- read_tsv(
  eggnog_file,
  comment = "#",
  col_names = header_vec,
  show_col_types = FALSE
)

Calculate summary statistics

pgap_total <- nrow(pgap)
pgap_hyp   <- sum(pgap$hypothetical, na.rm = TRUE)
pgap_annot <- pgap_total - pgap_hyp
pgap_pct   <- 100 * pgap_annot / pgap_total

eggnog_total <- nrow(eggnog)
eggnog_annot <- eggnog %>%
  mutate(Description = replace_na(Description, "-")) %>%
  summarise(n = sum(Description != "-" & Description != "")) %>%
  pull(n)
eggnog_unannot <- eggnog_total - eggnog_annot
eggnog_pct     <- 100 * eggnog_annot / eggnog_total

# SEED values from RAST output (fill these in from your RAST report)
seed_total   <- 5077
seed_annot   <- round(seed_total * 0.19)
seed_unannot <- seed_total - seed_annot
seed_pct     <- 100 * seed_annot / seed_total

# Summary table
summary_tbl <- tibble(
  Method               = c("eggNOG-mapper", "PGAP", "SEED subsystem"),
  Total_features       = c(eggnog_total, pgap_total, seed_total),
  Annotated_features   = c(eggnog_annot, pgap_annot, seed_annot),
  Unannotated_features = c(eggnog_unannot, pgap_hyp, seed_unannot),
  Percent_annotated    = c(eggnog_pct, pgap_pct, seed_pct)
)

PGAP functional groups

pgap_groups <- tibble(
  Category = c(
    "Transporters", "Dehydrogenases", "Kinases",
    "Ribosomal proteins", "Regulators", "Hypothetical proteins"
  ),
  Pattern = c(
    "transporter|permease|ABC transporter|MFS transporter",
    "dehydrogenase",
    "kinase",
    "ribosomal",
    "transcription factor|response regulator|helix-turn-helix",
    "^hypothetical protein$"
  )
) %>%
  rowwise() %>%
  mutate(Count = sum(str_detect(pgap$Product,
                                regex(Pattern, ignore_case = TRUE)), na.rm = TRUE)) %>%
  ungroup() %>%
  arrange(desc(Count))

COG categories from eggNOG

cog_map <- c(
  C = "Energy production and conversion",
  E = "Amino acid transport and metabolism",
  G = "Carbohydrate transport and metabolism",
  J = "Translation, ribosomal structure",
  K = "Transcription",
  L = "Replication, recombination and repair",
  M = "Cell wall/membrane biogenesis",
  O = "Protein modification and turnover",
  P = "Inorganic ion transport and metabolism",
  R = "General function prediction only",
  S = "Function unknown",
  T = "Signal transduction",
  V = "Defense mechanisms"
)

cog_long <- eggnog %>%
  select(query, COG_category) %>%
  mutate(COG_category = replace_na(COG_category, "-")) %>%
  filter(COG_category != "-", COG_category != "") %>%
  mutate(COG_category = str_split(COG_category, "")) %>%
  unnest(COG_category) %>%
  filter(str_detect(COG_category, "[A-Z]")) %>%
  mutate(COG_description = recode(COG_category, !!!cog_map, .default = "Other"))

cog_summary <- cog_long %>%
  count(COG_category, COG_description, sort = TRUE, name = "Count")

SEED subsystem summary (from RAST report)

seed_summary <- tribble(
  ~SEED_category,                                          ~Count,
  "Amino Acids and Derivatives",                           253,
  "Protein Metabolism",                                    180,
  "Carbohydrates",                                         166,
  "Cofactors, Vitamins, Prosthetic Groups, Pigments",      115,
  "Nucleosides and Nucleotides",                           101,
  "DNA Metabolism",                                         57,
  "RNA Metabolism",                                         49,
  "Respiration",                                            45,
  "Fatty Acids, Lipids, and Isoprenoids",                   45,
  "Membrane Transport",                                     41,
  "Virulence, Disease and Defense",                         41,
  "Cell Wall and Capsule",                                  38,
  "Stress Response",                                        32,
  "Dormancy and Sporulation",                               31
) %>%
  arrange(desc(Count))

Shared plot theme

theme_pub <- theme_bw(base_size = 12) +
  theme(
    panel.grid   = element_blank(),
    axis.text    = element_text(color = "black"),
    axis.title   = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    plot.margin  = margin(10, 5, 5, 5)
  )

Panel A: annotation comparison (stacked bar)

comp_long <- summary_tbl %>%
  select(Method, Annotated_features, Unannotated_features) %>%
  pivot_longer(cols = c(Annotated_features, Unannotated_features),
               names_to = "Status", values_to = "Count") %>%
  mutate(
    Status = recode(Status,
                    Annotated_features   = "Annotated",
                    Unannotated_features = "Unannotated / not assigned"),
    Method = factor(Method, levels = c("eggNOG-mapper", "PGAP", "SEED subsystem"))
  )

figA <- comp_long %>%
  group_by(Method) %>%
  mutate(percent = Count / sum(Count) * 100) %>%
  ungroup() %>%
  mutate(Status = factor(Status,
                         levels = c("Annotated", "Unannotated / not assigned"))) %>%
  ggplot(aes(x = Method, y = Count, fill = Status)) +
  geom_col(width = 0.72) +
  geom_text(aes(label = paste0(round(percent, 1), "%")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 4) +
  scale_fill_manual(values = c("Annotated"                    = "#1b9e77",
                                "Unannotated / not assigned"   = "#d95f02")) +
  labs(x = NULL, y = "Number of genes / proteins", fill = NULL) +
  theme_pub +
  theme(axis.text.x = element_text(angle = 20, hjust = 1),
        legend.position = "top")

Panel B: PGAP functional groups

figB <- pgap_groups %>%
  mutate(Category = str_wrap(Category, width = 22),
         Category = fct_reorder(Category, Count)) %>%
  ggplot(aes(x = Count, y = Category)) +
  geom_col(width = 0.72, fill = "#1b9e77") +
  labs(x = "Gene count", y = NULL) +
  theme_pub +
  theme(axis.text.y = element_text(size = 9))

Panel C: COG categories (eggNOG top 8)

figC <- cog_summary %>%
  slice_head(n = 8) %>%
  mutate(Label = paste(COG_category, COG_description, sep = " — "),
         Label = str_wrap(Label, width = 34),
         Label = fct_reorder(Label, Count)) %>%
  ggplot(aes(x = Count, y = Label)) +
  geom_col(width = 0.72, fill = "#d95f02") +
  labs(x = "Protein count", y = NULL) +
  theme_pub +
  theme(axis.text.y = element_text(size = 8.5))

Panel D: SEED subsystems (top 8)

figD <- seed_summary %>%
  slice_head(n = 8) %>%
  mutate(SEED_category = str_wrap(SEED_category, width = 30),
         SEED_category = fct_reorder(SEED_category, Count)) %>%
  ggplot(aes(x = Count, y = SEED_category)) +
  geom_col(width = 0.72, fill = "#7570b3") +
  labs(x = "Feature count", y = NULL) +
  theme_pub +
  theme(axis.text.y = element_text(size = 8.5))

Assemble and save

panelA <- ggdraw(figA) +
  draw_label("A", x=0.01, y=0.99, hjust=0, vjust=1, fontface="bold", size=16)
panelB <- ggdraw(figB) +
  draw_label("B", x=0.01, y=0.99, hjust=0, vjust=1, fontface="bold", size=16)
panelC <- ggdraw(figC) +
  draw_label("C", x=0.01, y=0.99, hjust=0, vjust=1, fontface="bold", size=16)
panelD <- ggdraw(figD) +
  draw_label("D", x=0.01, y=0.99, hjust=0, vjust=1, fontface="bold", size=16)

right_stack <- plot_grid(panelB, panelC, panelD,
                         ncol = 1, align = "v", rel_heights = c(1, 1, 1))

combined_fig <- plot_grid(panelA, right_stack,
                          ncol = 2, rel_widths = c(1.15, 1.35), align = "h")

ggsave("Figure_annotation_composite.png", combined_fig,
       width = 12, height = 10, dpi = 600)
ggsave("Figure_annotation_composite.pdf", combined_fig,
       width = 12, height = 10)

Supplementary tables

supp_list <- list(
  "S1_method_summary"   = summary_tbl,
  "S1_pgap_annotations" = pgap,
  "S1_pgap_groups"      = pgap_groups,
  "S1_eggnog_full"      = eggnog,
  "S1_COG_summary"      = cog_summary,
  "S1_SEED_summary"     = seed_summary
)

write_xlsx(supp_list, "Supplementary_Table_S1_Radiobacillus.xlsx")

Quick comparison: all five tools side by side

Tool	Approach	Input	Main output	Best for
RAST	Curated subsystems	FASTA	Subsystem categories	Quick first look, ecological table
PGAP	NCBI HMM + BLAST	FASTA	GenBank, GFF3, FAA	Primary manuscript annotation
eggNOG-mapper	Ortholog groups (DIAMOND)	Protein FAA	COG, KEGG, GO, EC	Functional depth, COG plots
DRAM	Multi-DB metabolic distillation	FASTA	Metabolism heat map	Ecological interpretation
InterProScan	13 domain databases	Protein FAA	Pfam, TIGRFAM, GO	Domain annotation, hypothetical proteins

What to put in your manuscript

Table 1 (genome statistics): Include a row for each tool — total features, annotated, percentage.

Main figure: The four-panel figure produced by the R script above — annotation comparison (A), functional groups (B), COG categories (C), SEED subsystems (D).

Supplementary: The full eggNOG annotation table, the full PGAP product list, and the COG/SEED summaries.

Methods section:

Genome annotation was performed using NCBI PGAP (version X.X) run via Docker. Functional annotation was extended using eggNOG-mapper v2 (version X.X) with DIAMOND search against the eggNOG 5.0 database. Metabolic potential was assessed using DRAM (version X.X). Protein domain assignments were obtained using InterProScan (version X.X) against Pfam, TIGRFAM, and SUPERFAMILY databases. Subsystem-based interpretation was performed using the RAST/SEED annotation service.

A note on “hypothetical proteins”

One more thing beginners often worry about: you might have 35% hypothetical proteins after PGAP, 25% after eggNOG, and you wonder if that makes the genome look bad.

It does not. In most environmental bacterial genomes published today, 20–40% hypothetical proteins is completely normal. These are genes that are real (they pass gene-calling filters, they have Shine-Dalgarno sequences, they are conserved across strains) but simply have no characterised homolog yet. As more environmental bacteria are sequenced and characterised, these numbers will improve.

In your manuscript, frame it positively: “Our analysis revealed X% hypothetical proteins, consistent with the novel ecological niche of this organism and suggesting a reservoir of uncharacterised functional diversity.”

This workflow was developed during annotation of Radiobacillus sp. PE-A8-2. The same pipeline applies to any novel prokaryotic genome — adapt species names and file names accordingly. PGAP requires ~60–80 GB disk space; eggNOG databases require ~50 GB.

Tags: PGAP eggNOG-mapper RAST DRAM InterProScan genome-annotation COG KEGG ggplot2 hypothetical-proteins bioinformatics beginner

see_your_plot