Cleaning and Preparing Genomes for NCBI Submission โ€” A Complete Workflow

๐Ÿงฌ Cleaning and Preparing Genomes for NCBI Submission

Todayโ€™s post documents the complete workflow I used to prepare 12 bacterial genomes for NCBI submission.
Out of 205 genomes submitted, 12 failed the NCBI contamination screen:

  • 2 genomes failed due to unexpected genome size
  • 10 genomes failed due to adapter sequences

I systematically fixed all issues using seqkit, bedtools, awk, and shell scripting on the Palmetto HPC cluster.


โš™๏ธ Step 1 โ€” Generate Adapter and Contamination Reports

Each genome was screened using NCBIโ€™s contamination check.
The report includes two sections:

๐Ÿ”น Trim Section (mask these regions)

Example: Sequence name | length | span | source contig02 6271 6238..6271 adaptor:NGB02000.1

๐Ÿ”น Exclude Section (remove these contigs completely)

Example: contig05 3946 prok:g-proteobacteria

Goal:
โœ” Mask adapter regions
โœ” Remove contaminant contigs


๐Ÿงฑ Step 2 โ€” Create BED Files for Adapter Spans

I extracted adapter coordinates directly from reports:

for rep in reports/*.csv; do
  base=$(basename "$rep" _report.csv)
  awk -F',' -v OFS="\t" -v g="$base" '
    NR > 1 && $3 ~ /[0-9]+\.\.[0-9]+/ {
      split($3,a,"..");
      print g"_"$1, a[1]-1, a[2];
    }' "$rep" > beds/${base}_adapters.bed
done

These .bed files define adapter segments for masking.

๐Ÿงฌ Step 3 โ€” Mask Adapter Regions with Ns
for fa in assemblies/*.fa; do
  base=$(basename "$fa" .fa)
  bed="beds/${base}_adapters.bed"
  bedtools maskfasta -fi "$fa" -bed "$bed" \
    -fo cleaned/${base}_masked.fa -mc N
done

This replaces adapter sequences with Ns (usually 10โ€“40 bases long).

๐Ÿšซ Step 4 โ€” Remove Contaminant Contigs

From the Exclude section, I created lists of contigs to delete.

for fa in cleaned/*_masked.fa; do
  base=$(basename "$fa" _masked.fa)
  excl="exclude_lists/${base}_exclude.txt"
  seqkit grep -v -f "$excl" "$fa" > final_ncbi_ready/${base}_clean.fa
done

Result: โœ” All contaminated contigs removed safely.

๐Ÿงพ Step 5 โ€” Restore NCBI Metadata Headers

NCBI FASTA headers include metadata such as strain and organism. Example correct header:

contig01 [organism=Staphylococcus ureilyticus] [strain=ABD2_031] I restored them using mapping files:

seqkit replace -p "^(contig[0-9]+)" -r "{kv}" \
  --kv-file rename_maps_fixed/${base}_map.txt --keep-key \
  -o final_ncbi_ready_renamed/${base}_clean_renamed.fa "$fa"
๐Ÿ”ข Step 6 โ€” Renumber Contigs Sequentially

Since filtering removes contigs, renumbering ensures they start from 1 again:

for fa in final_ncbi_ready_renamed/*_clean_renamed.fa; do
  base=$(basename "$fa" _clean_renamed.fa)
  awk -v base="$base" '
    /^>/ {count++; sub(/^>contig[0-9]+/, ">contig" count, $0); print; next}
    {print}
  ' "$fa" > final_ncbi_ready_renumbered/${base}_renumbered.fa
done
๐Ÿงฎ Step 7 โ€” QC for Ns and Percent N

I counted Ns and genome length:

for fa in final_ncbi_ready_renumbered/*.fa; do
  base=$(basename "$fa")
  totalN=$(grep -v ">" "$fa" | tr -cd 'Nn' | wc -c)
  totalLen=$(grep -v ">" "$fa" | tr -d '\n' | wc -c)
  perc=$(awk -v n=$totalN -v l=$totalLen 'BEGIN{printf "%.3f", (n/l)*100}')
  echo -e "${base}\t${totalN}\t${totalLen}\t${perc}%"
done

Example: AWM4_233_renumbered.fa 5807 5927574 0.098% โœ” All genomes were safely below NCBIโ€™s โ‰ค5% Ns requirement (mine were โ‰ค0.1%).

๐Ÿ“Š Step 8 โ€” Gap Summary (Runs of N โ‰ฅ10)

ABD2_031_renumbered.fa 10 391 ABF4_148_renumbered.fa 2 63 AWM4_233_renumbered.fa 182 5807 This confirms masking was minimal and acceptable.

๐Ÿงญ Final Thoughts

This workflow converts problematic assemblies into clean, polished FASTA files ready for NCBI:

โœ” Adapter regions masked โœ” Contaminant contigs removed โœ” Metadata restored โœ” Contigs renumbered โœ” QC verified

โœจ Result: 12 high-quality genomes successfully resubmitted to NCBI.

ncbi_submission