Genome Topology and Genome Announcement Reports: A Practical, Reviewer-Safe Workflow
𧬠Genome Topology and Genome Announcement Reports
Today, Iβm stepping slightly back from the Size-Fractionated Microbiome Analysis series. With the winter break approaching, I wanted to wrap up a few pending tasks, including finalizing genome submissions to NCBI. We recently received the final queries and accession numbers for most of our genomes. So toaday I am documenting how to answer last few NCBI submission questions and how to prepare a complete genome summary table suitable for a Genome Resource Announcement manuscript.
How to answer NCBI questions without over-claiming
When submitting genomes to NCBI (GenBank / Genome / BioProject), two questions almost always appear and often cause confusion:
[A] Is this a draft assembly or does one sequence represent the chromosome? [B] If this is a circular chromosome, are the ends of the sequence contiguous around the circle with no overlap and no gap?
These questions are not asking you to prove perfection β they are asking you to state clearly what you are claiming and what you are not.
This post documents a practical, conservative, and reviewer-safe workflow that I use to:
β Answer NCBI topology questions confidently
β Avoid over-claiming circularity
β Generate a final genome summary table suitable for Genome Resource Announcements
π Core principle (very important)
Only assert circularity when the assembly method and validation steps truly support it.
β Draft genomes and linear submissions are fully acceptable to NCBI.
β Over-claiming circularity is not
Step 1 β Determine whether a genome is draft or single-contig
For each genome FASTA:
grep -c "^>" genome.fa
I have many genomes and I copied all into one folder and used a batch script as follows
for f in *.fa; do
echo "=== $f ==="
grep -c "^>" "$f"
done
Interpretation | Result | Meaning | NCBI Answer | | βββ | βββββββ | ββββββ- | | >1 contig | Draft assembly | [A] Draft assembly | | 1 contig | Candidate chromosome | Requires validation | If a genome has multiple contigs, do not discuss circularity.
Step 2 β Validate circularity for single-contig genomes
This step applies only to genomes with exactly one contig.
2a. Check assembler evidence (Flye / Unicycler)
Inspect assembly_info.txt
Look for circular=true
If circularity is not explicitly stated, do not assume it
Illumina-only assemblies cannot reliably establish circularity. As I used long read and hybrid assembly, I made a custom script to extract the inforamtion either from the assembly_info.txt or from the assembly. I also confirmed the results with step 2b
2b. Check for overlap between contig ends
NCBI requires no overlap and no gap at the contig ends. Extract first and last 1 kb
seqkit subseq -r 1:1000 genome.fa > first1k.fa
seqkit subseq -r -999:-1 genome.fa > last1k.fa
You need seqkit tool for this and this can be insttaled as follows: seqkit is a fast, lightweight toolkit for manipulating FASTA/FASTQ files. In this workflow, it is used to extract the first and last regions of contigs for circularity checks.
conda install -c bioconda seqkit
Align ends
nucmer --maxmatch first1k.fa last1k.fa -p ends
show-coords -rcl ends.delta
for this step you need the tool nucmer. MUMmer is a whole-genome alignment package. Here, nucmer is used to align the start and end of contigs to detect overlapping duplicated regions, which is critical before claiming circularity.
conda install -c bioconda mummer
This installs nucmer and show-coords, which are used for end-to-end contig validation.
Interpretation | Result | Meaning | Action | | βββββ- | βββββ- | βββββββββββ- | | Strong alignment | Overlap present | Must trim before claiming circular | | No alignment | Not circularized | Treat as linear |
If no alignments are shown (only headers):
β The genome is not circularized β Do not answer YES to question [B] π« Do not infer circularity for Illumina-only assemblies
π Genome Resource Announcements
Although all of the analyses described below had been completed previously, the final genome FASTA files were subsequently renamed, filtered, and curated, and a subset of genomes was reassembled. Therefore, to generate a consistent and accurate summary table for the Genome Resource Announcement, I decided to define a clear work plan and rerun the relevant steps in a streamlined, reproducible workflow.
Workflow overview | Tool | Purpose | Fields | | βββββ- | ββββββ- | βββββββββ | | QUAST | Assembly statistics | Size, GC, contigs, N50 | | CheckM2 | Quality metrics | Completeness, contamination | | Prokka | Annotation | Protein-coding genes | | Flye / Unicycler | Topology evidence | Circular / Linear | | NUCmer | End validation | Overlap detection |
Step 1β Assembly statistics with QUAST
mkdir -p quast_out
quast.py *.fa -o quast_out --threads 16
I used batch script as follows
#!/bin/bash
#SBATCH --job-name=quast_all
#SBATCH --partition=camplab
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=12:00:00
#SBATCH --output=quast_%j.out
#SBATCH --error=quast_%j.err
module load quast
GENOME_DIR="/project/dkarig/ecocoat/NCBI_oct22/fixed_genomes_dec11"
OUTDIR="/project/dkarig/ecocoat/NCBI_oct22/quast_dec11"
mkdir -p "$OUTDIR"
# Run QUAST on all genomes together
quast.py \
--threads 16 \
--min-contig 500 \
-o "$OUTDIR" \
"$GENOME_DIR"/*.fa
echo "QUAST completed at $(date)"
save this file as run_quast_all.sh and run as follows
chmod +x run_quast_all.sh
sbatch run_quast_all.sh
Final file is report.tsv:
β Genome size
β GC (%)
β number of Contigs
β N50
Step 2 β Completeness and contamination with CheckM2
CheckM was used to assess genome quality by estimating completeness and contamination based on lineage-specific single-copy marker genes. The lineage workflow automatically assigns each genome to an appropriate taxonomic lineage and reports standardized quality metrics, which were exported as a tab-delimited summary table and incorporated directly into the final genome summary table for the Genome Resource Announcement.
#!/bin/bash
#SBATCH --job-name=checkm_all
#SBATCH --partition=camplab
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=96G
#SBATCH --time=24:00:00
#SBATCH --output=checkm_%j.out
#SBATCH --error=checkm_%j.err
# Load CheckM module or activate conda env
conda activate checkm_env
GENOME_DIR="/project/dkarig/ecocoat/NCBI_oct22/all_genomes_dec15"
OUTDIR="/project/dkarig/ecocoat/NCBI_oct22/checkm_dec15"
mkdir -p "$OUTDIR"
# Run CheckM lineage workflow
checkm lineage_wf \
-x fa \
-t 24 \
--reduced_tree \
"$GENOME_DIR" \
"$OUTDIR"
# Generate a clean summary table (TSV)
checkm qa \
"$OUTDIR"/lineage.ms \
"$OUTDIR" \
-o 2 \
-f "$OUTDIR/checkm_summary.tsv"
echo "CheckM completed at $(date)"
Save this as run_checkm_all.sh and run as follows
chmod +x run_checkm_all.sh
sbatch run_checkm_all.sh
the key output file you need is checkm_summary.tsv This includes (per genome):
β Completeness (%)
β Contamination (%)
β Strain heterogeneity
β Marker lineage
Step 3 β Gene prediction with Prokka
This is for finding number of predicted genes and my strtegy is below
β Run Prokka once per genome (recommended; Prokka is per-genome by design)
β Store outputs in per-genome subfolders
β Parse Prokka outputs to create one summary TSV with:
β number of CDS (protein-coding genes)
β optionally rRNA, tRNA counts
#!/bin/bash
#SBATCH --job-name=prokka_all
#SBATCH --partition=camplab
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=24:00:00
#SBATCH --output=prokka_%j.out
#SBATCH --error=prokka_%j.err
module load prokka
GENOME_DIR="/project/dkarig/ecocoat/NCBI_oct22/all_genomes_dec15"
OUTDIR="/project/dkarig/ecocoat/NCBI_oct22/prokka_dec15"
mkdir -p "$OUTDIR"
for fa in "$GENOME_DIR"/*.fa; do
base=$(basename "$fa" .fa)
echo "Running Prokka on $base"
prokka \
--outdir "$OUTDIR/$base" \
--prefix "$base" \
--cpus 16 \
--force \
"$fa"
done
echo "Prokka completed at $(date)"
save as run_prokka_all.sh and run as follows
chmod +x run_prokka_all.sh
sbatch run_prokka_all.sh
this will generate For each genome (example ABV4_150):
prokka_dec15/ABV4_150/
βββ ABV4_150.gff
βββ ABV4_150.tbl
βββ ABV4_150.txt β summary stats (VERY IMPORTANT)
βββ ABV4_150.faa
βββ ABV4_150.ffn
βββ ABV4_150.log
The file *.txt contains lines like:
CDS: 4123
rRNA: 9
tRNA: 67
step 3.1 Generate a summary TSV for predicted genes
Script: prokka_summary.sh
This extracts CDS counts (protein-coding genes) for all genomes.
#!/bin/bash
PROKKA_DIR="/project/dkarig/ecocoat/NCBI_oct22/prokka_dec15"
OUT_TSV="prokka_gene_summary.tsv"
echo -e "Genome\tCDS\trRNA\ttRNA" > "$OUT_TSV"
for d in "$PROKKA_DIR"/*; do
genome=$(basename "$d")
txt="$d/$genome.txt"
if [[ -f "$txt" ]]; then
cds=$(grep "^CDS:" "$txt" | awk '{print $2}')
rrna=$(grep "^rRNA:" "$txt" | awk '{print $2}')
trna=$(grep "^tRNA:" "$txt" | awk '{print $2}')
echo -e "$genome\t$cds\t$rrna\t$trna" >> "$OUT_TSV"
fi
done
echo "Summary written to $OUT_TSV"
Run it:
chmod +x prokka_summary.sh
./prokka_summary.sh
Step 4 β Gene prediction with Prokka
Follow the steps from first section
π§© Step 5: Merge individual outputs into a final genome summary table
At this stage, all analyses have been completed independently, and the results are available as separate tabular files:
β prokka_out.tsv β predicted gene counts
β topology.tsv β genome topology (Circular / Linear / Undetermined)
β checkm_summary.tsv β completeness and contamination
β quast_report.tsv β genome size, GC content, contigs, N50
β NCBI_list_of_genomes.tsv β genome identifiers and accession numbers
Each file contains a shared identifier column (Genome), which allows them to be merged programmatically into a single, consistent summary table.
To avoid manual errors and ensure reproducibility, I used pandas in Python to merge all tables based on this common column.
π Example: merging genome-level tables using pandas
import pandas as pd
# Load individual result tables
prokka = pd.read_csv("prokka_out.tsv", sep="\t")
topology = pd.read_csv("topology.tsv", sep="\t")
checkm = pd.read_csv("checkm_summary.tsv", sep="\t")
quast = pd.read_csv("quast_report.tsv", sep="\t")
ncbi = pd.read_csv("NCBI_list_of_genomes.tsv", sep="\t")
# Ensure consistent column names
for df in [prokka, topology, checkm, quast, ncbi]:
df.columns = df.columns.str.strip()
# Sequentially merge tables using the common 'Genome' column
final_table = (
quast
.merge(prokka, on="Genome", how="left")
.merge(checkm, on="Genome", how="left")
.merge(topology, on="Genome", how="left")
.merge(ncbi, on="Genome", how="left")
)
# Save final table
final_table.to_csv("final_genome_summary_table.tsv", sep="\t", index=False)
print("Final genome summary table successfully created.")
This approach ensures that:
β All genome-level metadata are synchronized
β Missing values are handled transparently
β The workflow is fully reproducible
π Final genome summary table format
The merged output is used directly as Table 1 in the Genome Resource Announcement manuscript.
Table 1. Genome assembly and annotation statistics | Genome | Assembly type | Genome size (bp) | GC (%) | Coverage (Γ) | Contigs | N50 (bp) | Predicted genes | Completeness (%) | Contamination (%) | Topology | Accession | | ββ | ββββ- | βββββ- | ββ | ββββ | ββ- | βββ | βββββ | βββββ- | ββββββ | βββ | βββ |
β Assembly type: Illumina / Hybrid / Long-read
β Topology: Circular / Linear / Undetermined
β Accession: βpendingβ during submission, updated post-acceptance
β genome statistics : N50, GC content, Completness, Contamination, Genome size, and Number of contigs
β Take-home messages
β Do not over-claim genome circularity. Only assert circular topology when supported by long-read or hybrid assemblies and explicit validation.
β Draft and linear genomes are fully acceptable for NCBI submission and Genome Resource Announcements.
β Topology should be assigned conservatively using assembly evidence, not biological expectation.
β A single, reproducible summary table simplifies NCBI submissions and serves directly as a manuscript-ready resource.
β Clear documentation of methods and assumptions makes genome submissions reviewer-safe and future-proof.
This workflow has worked smoothly for large-scale genome submissions and Genome Resource Announcements.
The image is showing GC content of all the genomes.
