Genome Assembly Day: Shovill for Illumina & Flye for Nanopore Reads
As I said in earlier post about NCBI submission (https://jojyjohn28.github.io/blog/NCBI-submission-cleaning/), I missed few genome assembly previously. So Today, I am focusing on these missed assemblies.
𧬠Whole-Genome Assembly Using Shovill & Flye
Todayβs work focused on reconstructing complete bacterial genomes using two complementary assembly strategies:
- Shovill for Illumina short reads
- Flye for Nanopore long reads
These assemblies are part of 200+ genomes of DARPA, which primarly focus on synthetic biofilm producing commuities.
π Why Shovill (Instead of Running SPAdes Directly)
Shovill is a high-performance wrapper around SPAdes specifically optimized for bacterial WGS data.
I chose it over raw SPAdes because Shovill provides:
β Automatic adapter trimming
β Read correction & subsampling (reduces runtime)
β Optimized SPAdes parameters for bacteria
β Cleaner, smaller output folders
β Much faster and significantly lower memory usage
For five bacterial genomes, Shovill saved hours of compute time and produced polished assemblies with minimal tweaking.
βοΈ Step 1 β Installing Shovill (Conda)
conda create -n shovill_env -c bioconda -c conda-forge shovill
conda activate shovill_env
π Step 2 β Running Shovill on HPC (Batch Script)
Below is the script I used for assembling 5 isolates:
#!/bin/bash
READS_DIR="/project/dkarig/ecocoat/NCBI_oct22/need_biosamples/missing_assemblies"
OUTPUT_DIR="/project/dkarig/ecocoat/NCBI_oct22/need_biosamples/missing_assemblies/Shovil_assembly"
THREADS=32
mkdir -p "$OUTPUT_DIR"
for FORWARD_READ in "$READS_DIR"/*_R1.fastq.gz; do
SAMPLE=$(basename "$FORWARD_READ" _R1.fastq.gz)
REVERSE_READ="$READS_DIR/${SAMPLE}_R2.fastq.gz"
if [[ -f "$REVERSE_READ" ]]; then
echo "Processing $SAMPLE..."
shovill \
--outdir "$OUTPUT_DIR/$SAMPLE" \
--R1 "$FORWARD_READ" \
--R2 "$REVERSE_READ" \
--cpus "$THREADS"
echo "Assembly for $SAMPLE completed."
else
echo "Reverse read file missing for $SAMPLE. Skipping..."
fi
done
save it as shovil.sh in the working directory (where raw sequences are saved) direct the terminal to working directory
chmod +x shovil_new.sh
./shovil_new.sh
Shovill outputs:
β contigs.fa
β SPAdes log files
β assembly statistics
𧬠Long-Read Assembly Using Flye
For three Nanopore samples, I used Flye, a robust long-read assembler that works beautifully on HPC.
π Why Flye (Instead of Epi2me Labs)
I prefer Flye because:
β Works entirely on HPC (no GUI required)
β No need for ONTβs desktop software
β Excellent for bacterial genomes
β Transparent logs & customizable parameters
β Supports polishing and repeat resolution
Epi2me is great for small standalone analyses, but Flye is far better for reproducible, scriptable, HPC-scale assembly.
βοΈ Step 1 β Installing Flye
git clone https://github.com/fenderglass/Flye.git
cd Flye
python setup.py install --user
π Step 2 β Running Flye on an Interactive Node
I assembled each genome individually on my interactive node, as the number of genomes are less:
cd /home/jojyj/Flye
python /home/jojyj/Flye/bin/flye --nano-raw path to raw sequence --threads 32 --out-dir path to output folde
Flye outputs include: β assembly.fasta
β assembly_graph.gfa
β assembly_info.txt
π§ Summary
Todayβs assembly workflow produced:
β Five high-quality Illumina assemblies (Shovill)
β Three Nanopore long-read assemblies (Flye)
β Clean contigs ready for downstream steps (QC, annotation, pangenome, etc.)
