Genome Coverage — What It Is, Why It Matters, and How to Calculate It for NCBI Submission

Genome coverage — what it is, why it matters, and how to calculate it

🧬 𝐷𝑎𝑦 76 𝑜𝑓 𝐷𝑎𝑖𝑙𝑦 𝐵𝑖𝑜𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑐𝑠 𝑓𝑟𝑜𝑚 𝐽𝑜𝑗𝑦’𝑠

You assembled your genome and you are getting ready to submit it to NCBI. The submission form asks for sequencing depth. Your supervisor mentions you need at least 30×. A reviewer comes back asking for coverage statistics.

What exactly is coverage? Where does the number come from? And what do you do when you have both Illumina and ONT reads?

This post answers all of that from the beginning — no assumed knowledge — and ends with a Python script that calculates coverage for all three assembly types and produces a table ready for NCBI submission.

What is genome coverage?

Imagine your genome is a road that is 5 million base pairs long. You sequence it by taking millions of short snapshots of random pieces of that road. Coverage tells you, on average, how many times each position on that road was photographed.

The formula is dead simple:

Coverage (X) = Total sequencing bases / Genome assembly size

That is it. No magic. If you sequenced 500 million bases and your assembled genome is 5 million base pairs, your coverage is 100×.

The × symbol (pronounced “times” or “x”) is the unit. 30× means every base in your genome was covered by 30 reads on average. 100× means 100 reads on average.

A concrete example

Genome size:       4,800,000 bp  (4.8 Mb — typical small bacterium)
Total Illumina bases: 480,000,000 bp  (480 Mb — 2 × 150 bp × 1.6M read pairs)

Coverage = 480,000,000 / 4,800,000 = 100×

Why does coverage matter?

Coverage is not just a number for NCBI. It directly determines the quality and reliability of your assembly.

Low coverage (< 10×): Large gaps. The assembler does not have enough reads to build confident contigs. Many positions will have only one or two reads supporting them — any sequencing error there becomes a permanent mistake in your assembly.

Adequate coverage (20–30×): Sufficient for most Illumina assemblies of simple genomes. This is the practical minimum for a genome you would publish.

Good coverage (50–100×): Most positions covered many times. Random sequencing errors (which occur at a rate of ~0.1% per base in Illumina) are effectively corrected by consensus across many reads. Assembly is confident and complete.

Very high coverage (> 200×): Usually not necessary for assembly quality. Can even cause problems — some assemblers collapse repeated regions when coverage is too high. Useful for detecting heterozygosity or mixed populations.

Why Illumina and ONT are measured the same way but contribute differently

Illumina reads are short (150–300 bp) but extremely accurate (~99.9%). High Illumina coverage gives you precise, error-corrected sequences of small regions.

ONT (Oxford Nanopore) reads are long (thousands to hundreds of thousands of bp) but less accurate individually (~95–98%). Lower ONT coverage can still give good assemblies because long reads span repeat regions and resolve structural complexity that short reads cannot.

A hybrid assembly uses both — Illumina reads for base-level accuracy, ONT reads for long-range structural scaffolding. When calculating hybrid coverage, you add the bases from both technologies.

Assembly type	Minimum coverage (typical)	Why
Illumina-only	40–100×	Short reads need depth to correct errors
ONT-only	30–50×	Long reads resolve structure but need depth for polishing
Hybrid	30× Illumina + 20× ONT	Each technology contributes what the other lacks

What files do you need?

To calculate coverage you need three pieces of information for each sample:

1. Genome assembly size — the total number of base pairs in your final assembled genome. This comes from your assembler statistics file (e.g., from seqkit stats, assembly-stats, or the assembler’s own report). The relevant column is usually sum_len or total_length.

sample              sum_len
Eco_001.fna         4821340
Bac_002.fna         3987200
Rhizo_003.fna       6123450

2. Total sequencing bases per sample — the sum of all bases in your FASTQ reads. Not the number of reads — the total base count. You get this from tools like seqkit stats or FastQC.

# Generate read statistics with seqkit
seqkit stats -T sample_R1.fastq.gz sample_R2.fastq.gz > illumina_read_stats.tsv

The column you want is sum_len (total bases) not num_seqs (read count).

3. Sequencing technology — which strategy was used for each sample (Illumina-only, ONT-only, or hybrid). This is a simple lookup table you create:

sample    tech
Eco_001   illumina
Bac_002   long-read/ONT
Rhizo_003 hybrid

Calculating coverage: step by step manually

Let’s walk through one example of each type before looking at the Python script.

Illumina-only

Sample: Eco_001
Genome size:     4,821,340 bp
Illumina bases:  587,000,000 bp

Coverage = 587,000,000 / 4,821,340 = 121.8×

ONT-only

Sample: Bac_002
Genome size:     3,987,200 bp
ONT bases:       198,600,000 bp

Coverage = 198,600,000 / 3,987,200 = 49.8×

Hybrid assembly

Sample: Rhizo_003
Genome size:      6,123,450 bp
Illumina bases:   420,000,000 bp
ONT bases:        180,000,000 bp

Coverage = (420,000,000 + 180,000,000) / 6,123,450 = 97.9×

For hybrid assemblies you add the two base counts because both technologies contributed reads to the assembly. You are asking: across all the reads we fed into the assembler, how many times does that cover the genome?

Python script for batch coverage calculation

When you have dozens of genomes across all three assembly types, doing this by hand is error-prone. Here is a complete Python script that reads your four input files, matches samples by name, applies the correct coverage formula per technology, and outputs a clean TSV ready for NCBI submission.

Input files expected

genome_stats.tsv                 — assembly sizes (columns: file, sum_len)
illumina_read_stats_per_sample.tsv  — Illumina base counts (columns: sample, total_bases)
ont_read_stats_per_sample.tsv       — ONT base counts (columns: sample, total_bases)
sequencing_tech.txt                 — technology per sample (columns: sample, tech)

The script

import os
import pandas as pd
import numpy as np

# ── Input file paths ──
BASE     = "/project/dkarig/ecocoat/NCBI_oct22"
GENOMES  = f"{BASE}/genome_stats.tsv"
ILLUMINA = f"{BASE}/illumina_read_stats_per_sample.tsv"
ONT      = f"{BASE}/ont_read_stats_per_sample.tsv"
TECH     = f"{BASE}/sequencing_tech.txt"
OUT      = f"{BASE}/Genome_Coverage/genome_coverage_summary.tsv"

os.makedirs(f"{BASE}/Genome_Coverage", exist_ok=True)

# ── Load all four files ──
genomes  = pd.read_csv(GENOMES,  sep="\t", dtype=str)
illumina = pd.read_csv(ILLUMINA, sep="\t", dtype=str)
ont      = pd.read_csv(ONT,      sep="\t", dtype=str)
tech     = pd.read_csv(TECH,     sep="\t", dtype=str)

# Normalize column names (lowercase, strip whitespace)
for df in [genomes, illumina, ont, tech]:
    df.columns = [c.strip().lower() for c in df.columns]

# Convert numeric columns
genomes["sum_len"] = pd.to_numeric(genomes["sum_len"], errors="coerce").fillna(0)

illumina["total_bases"] = pd.to_numeric(
    illumina.get("total_bases", illumina.get("sum_len", 0)),
    errors="coerce"
).fillna(0)

ont["total_bases"] = pd.to_numeric(
    ont.get("total_bases", ont.get("sum_len", 0)),
    errors="coerce"
).fillna(0)

# ── Extract clean sample names from genome file paths ──
def extract_sample(s):
    """
    Strips path, extension, and returns the first two
    underscore-separated parts of the filename as the sample ID.
    Example: /path/to/Eco_001.fna  →  Eco_001
    """
    s = str(s)
    s = os.path.basename(s)
    for ext in [".fa", ".fna", ".fasta", ".fastq.gz", ".fq.gz"]:
        s = s.replace(ext, "")
    return "_".join(s.split("_")[:2]) if "_" in s else s

genomes["sample"]  = genomes["file"].apply(extract_sample)
illumina["sample"] = illumina["sample"].apply(lambda x: str(x).split("/")[-1])
ont["sample"]      = ont["sample"].apply(lambda x: str(x).split("/")[-1])
tech["sample"]     = tech["sample"].astype(str)

# ── Aggregate base counts per sample (in case of multiple FASTQ lanes) ──
illumina_sum = illumina.groupby("sample", as_index=False)["total_bases"].sum()
illumina_sum.rename(columns={"total_bases": "IlluminaBases_bp"}, inplace=True)

ont_sum = ont.groupby("sample", as_index=False)["total_bases"].sum()
ont_sum.rename(columns={"total_bases": "ONTBases_bp"}, inplace=True)

# ── Merge all data on sample ID ──
df = genomes[["sample", "sum_len"]].rename(columns={"sum_len": "GenomeSize_bp"})
df = df.merge(illumina_sum, on="sample", how="left")
df = df.merge(ont_sum,      on="sample", how="left")
df = df.merge(tech,         on="sample", how="left")

df = df.fillna(0)
df["IlluminaBases_bp"] = df["IlluminaBases_bp"].astype(float)
df["ONTBases_bp"]      = df["ONTBases_bp"].astype(float)
df["GenomeSize_bp"]    = df["GenomeSize_bp"].astype(float)

# ── Calculate coverage based on technology ──
def calculate_coverage(row):
    g = row["GenomeSize_bp"]
    if g == 0:
        return np.nan                         # no assembly size → skip
    t = str(row["tech"]).lower()
    if t.startswith("illumina"):
        return row["IlluminaBases_bp"] / g    # Illumina only
    elif t.startswith("hybrid"):
        return (row["IlluminaBases_bp"] + row["ONTBases_bp"]) / g  # both
    elif t.startswith("long") or t.startswith("ont"):
        return row["ONTBases_bp"] / g         # ONT only
    else:
        return np.nan                         # unknown tech

df["CoverageX"] = df.apply(calculate_coverage, axis=1)

# ── Flag samples with missing coverage data ──
missing = df[
    (df["CoverageX"].isna()) |
    ((df["IlluminaBases_bp"] == 0) & (df["ONTBases_bp"] == 0))
]
if not missing.empty:
    missing.to_csv(
        f"{BASE}/Genome_Coverage/missing_samples.tsv",
        sep="\t", index=False
    )
    print(f"⚠️  {len(missing)} samples missing coverage data → missing_samples.tsv")

# ── Save final table ──
df.to_csv(OUT, sep="\t", index=False)
print(f"✅ Coverage summary saved: {OUT}")
print("\nFirst 10 rows:")
print(df[["sample","GenomeSize_bp","IlluminaBases_bp","ONTBases_bp","tech","CoverageX"]].head(10).to_string(index=False))

What each section does

Column normalization: QIIME 2 and genomics tools produce headers with inconsistent capitalisation and spacing. Lowercasing everything and stripping whitespace prevents silent merge failures where Sample ≠ sample.

Sample name extraction: Genome files are usually named like /path/to/Eco_001.fna. The extract_sample function strips the path, removes the file extension, and returns Eco_001. This must match exactly what is in your read statistics files — if your FASTQ files use a different naming convention, edit this function.

Left merge: All merges use how="left", keeping every genome in the assembly file as a row even if read stats are missing. Missing samples go to missing_samples.tsv rather than silently disappearing.

Coverage logic: Three branches based on the tech column. The match is case-insensitive and checks the start of the string, so Illumina, illumina, and ILLUMINA-only all match.

Output table

sample     GenomeSize_bp  IlluminaBases_bp  ONTBases_bp  tech      CoverageX
Eco_001     4821340        587000000         0            illumina  121.77
Bac_002     3987200        0                 198600000    long-read  49.81
Rhizo_003   6123450        420000000         180000000   hybrid     97.94

Common problems and how to fix them

Sample names do not match across files

This is the most common failure. Your genome file might be Eco_001.fna but your read stats file might have eco_001 or Eco-001. The script normalises paths and extensions but not dashes vs underscores or case differences in the sample name itself. Check your naming convention and make it consistent before running.

# Quick check: print unique sample names from each file
cut -f1 illumina_read_stats_per_sample.tsv | sort | uniq
cut -f1 sequencing_tech.txt | sort | uniq

CoverageX comes out as NaN for some samples

Check missing_samples.tsv. The most common causes are: the tech column value does not match any of the three expected strings (illumina, hybrid, long-read), or the base count is genuinely zero (the FASTQ file was empty or the stats tool failed on that sample).

Coverage seems unrealistically low (e.g. 2×)

Usually means the base count is being read from the wrong column. The script looks for total_bases — if your seqkit stats output uses a different column name like sum_len, update the column name in the illumina["total_bases"] line.

Hybrid samples show coverage from only one technology

Means one of the read stats files has zero for that sample. Check whether the sample appears in both illumina_read_stats_per_sample.tsv and ont_read_stats_per_sample.tsv.

Generating the read statistics input files

If you do not already have illumina_read_stats_per_sample.tsv and ont_read_stats_per_sample.tsv, here is how to generate them with seqkit:

# For Illumina paired-end reads (run for all samples)
for sample_dir in /path/to/illumina/reads/*/; do
    sample=$(basename "$sample_dir")
    seqkit stats -T "${sample_dir}"*R1*.fastq.gz \
                    "${sample_dir}"*R2*.fastq.gz \
    | tail -n +2 \
    | awk -v s="$sample" '{print s"\t"$5}' >> illumina_read_stats_per_sample.tsv
done

# For ONT reads
for fq in /path/to/ont/reads/*.fastq.gz; do
    sample=$(basename "$fq" .fastq.gz)
    seqkit stats -T "$fq" \
    | tail -n +2 \
    | awk -v s="$sample" '{print s"\t"$5}' >> ont_read_stats_per_sample.tsv
done

Column 5 ($5) in seqkit stats -T output is sum_len — the total base count. Add a header manually:

sed -i '1s/^/sample\ttotal_bases\n/' illumina_read_stats_per_sample.tsv
sed -i '1s/^/sample\ttotal_bases\n/' ont_read_stats_per_sample.tsv

What to write in your methods section and NCBI submission

NCBI asks for coverage during genome submission as part of the biosample/genome metadata. You can fill it in directly from the CoverageX column — round to one decimal place.

Methods sentence you can copy:

Genome coverage was estimated as the ratio of total sequencing bases to final assembly size. For Illumina-only assemblies, coverage was calculated using total paired-end Illumina bases. For ONT-only assemblies, coverage was calculated using total nanopore bases. For hybrid assemblies, coverage was calculated using the combined total of Illumina and ONT bases. Read statistics were generated with seqkit stats.

Summary

Coverage is a single number — total sequencing bases divided by genome size — but it tells you immediately whether your assembly has enough read support to be reliable. 30× is the practical floor for publication. 50–100× is comfortable. The formula changes slightly depending on whether you used Illumina only, ONT only, or both.

The Python script above handles all three cases, flags missing data, and produces a clean TSV you can copy straight into your NCBI submission form.

Tools used: seqkit (read statistics), pandas + numpy (Python table manipulation). Assembly statistics can also be generated with assembly-stats or seqkit stats on the assembled FASTA.

see_your_plot