Whole Genome Analysis β Day 1: From Raw Reads to Clean, Assembly-Ready Data
𧬠Whole Genome Analysis β Day 1
From Raw Reads to Clean, Assembly-Ready Data
Every genome project starts at the same place: raw sequencing files.
Before assembly, annotation, or comparative genomics, it is critical to confirm that the data you received is complete, uncorrupted, and of sufficient quality.
In this post, I walk through Day 1 of a whole-genome sequencing (WGS) workflow, covering both Illumina (short reads) and Oxford Nanopore (long reads).
π Input Data Formats
Typical WGS inputs include:
Illumina (paired-end)
sample_R1.fastq.gz
sample_R2.fastq.gz
π Oxford Nanopore
sample_nanopore.fastq.gz
These files often arrive from sequencing facilities or collaborators and should never be trusted blindly.
π Step 0 β File Integrity Check (md5sum)
Before opening or processing FASTQ files, verify they were transferred correctly.
md5sum *.fastq.gz
Compare the output against checksums provided by the sequencing facility.
β Why this matters
β’ Prevents silent file corruption
β’ Saves hours of debugging downstream
β’ Essential for reproducible genomics
π Step 1 β Quality Control of Illumina Reads (FastQC)
What is FastQC?
FastQC provides a quick visual summary of sequencing quality for FASTQ files. It works on raw or trimmed Illumina data and produces an interactive HTML report.
Installation
conda install -c bioconda fastqc
or on HPC systems you can use module load.
Run FastQC
mkdir fastqc_raw
fastqc -o fastqc_raw *.fastq.gz
π Scaling FastQC for 100+ Genomes
Running FastQC on one or two samples is easy. But real projects often involve hundreds of genomes or metagenomes, each with paired-end reads.
Instead of running FastQC manually for each sample, you can automate it using either:
β’ a simple Bash loop (local or login node)
β’ a SLURM batch script (recommended for HPC environments)
π§ͺ Option 1 β FastQC Using a Bash Loop
This approach works well for:
β’ local machines
β’ small to medium datasets
β’ quick exploratory QC
β Advantages
β’ Simple
β’ Transparent
β’ Easy to debug
β οΈ Limitations
β’ Runs sequentially
β’ Not ideal for very large datasets
π₯οΈ Option 2 β FastQC as a SLURM Batch Job (HPC-friendly)
For large projects (100+ samples), always prefer batch execution.
βοΈ Why Batch Processing Matters
When working with large datasets, batch processing:
β’ avoids overloading login nodes
β’ scales efficiently across cores
β’ ensures reproducibility
β’ allows unattended execution
π‘ Tip: Combine FastQC with MultiQC later to summarize all reports in one dashboard.
All example scripts are available in the project repository. https://github.com/jojyjohn28/whole-genome-sequencing-analysis
π Interpreting FastQC Reports
Key modules to inspect:
β’ Basic Statistics β total reads, GC %, read length
β’ Per Base Sequence Quality β look for quality drops at read ends
β’ Per Sequence Quality Scores β distribution of read quality
β’ Adapter Content β presence of sequencing adapters
β’ Overrepresented Sequences β contamination or adapters
β’ Sequence Duplication Levels β PCR artifacts or bias
Understanding the plot
β’ Red line β median quality
β’ Yellow box β interquartile range (25β75%)
β’ Whiskers β 10β90%
β’ Blue line β mean quality
β’ Green / Orange / Red background β good / warning / poor quality
π Raw data often looks messy β thatβs normal.
See the plots below as an example to learn (source-Google)

Left: Raw reads with severe quality drop toward read ends Right: Trimmed reads with stabilized base quality (These plots clearly show why trimming is required before assembly.)
If you have many genomes you can use standalone Python script that parses FastQC outputs and generates a summary table you can use for reporting, plotting, or downstream filtering. For more details please see Day 1 in https://github.com/jojyjohn28/whole-genome-sequencing-analysis look for fastqc_to_table.py and fastqc_to_table.md.
π¬ Step 2 β Quality Control of Nanopore Reads (NanoPlot)
Illumina-style QC tools are not suitable for Nanopore data.
What is NanoPlot?
NanoPlot visualizes:
β’ Read length distribution
β’ Quality vs read length
β’ Yield over time
β’ N50 / N90 statistics
Installation
conda install -c bioconda nanoplot
Run NanoPlot
NanoPlot \
--fastq sample_nanopore.fastq.gz \
--outdir nanoplot_results \
--threads 8
For loop/batch script please visit the project repository (Day1). https://github.com/jojyjohn28/whole-genome-sequencing-analysis
What to look for
β’ Wide read length distribution (expected)
β’ No extreme quality collapse
β’ Reasonable N50 for your library type
βοΈ Step 3 β Read Trimming (Illumina)
Raw Illumina reads almost always contain:
β’ Adapter contamination
β’ Low-quality bases at ends
Two commonly used tools are Trimmomatic and Cutadapt.
π§ Trimmomatic Installation
conda install -c bioconda trimmomatic
General Default Command
trimmomatic PE -threads 16 \
sample_R1.fastq.gz sample_R2.fastq.gz \
sample_R1.paired.fastq.gz sample_R1.unpaired.fastq.gz \
sample_R2.paired.fastq.gz sample_R2.unpaired.fastq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:50
Key Trimming Options
β’ ILLUMINACLIP β adapter removal
β’ SLIDINGWINDOW β dynamic quality trimming
β’ LEADING / TRAILING β end trimming
β’ MINLEN β drop short reads
βοΈ Cutadapt (Alternative)
Cutadapt provides more flexible adapter handling and is increasingly popular. General Default Command
cutadapt \
-a AGATCGGAAGAGC \
-A AGATCGGAAGAGC \
-q 20,20 \
-m 50 \
-o sample_R1.trimmed.fastq.gz \
-p sample_R2.trimmed.fastq.gz \
sample_R1.fastq.gz sample_R2.fastq.gz
π Step 4 β Post-Trimming QC
Always rerun FastQC after trimming:
You should observe:
β’ Cleaner base quality profiles
β’ Reduced adapter content
β’ More uniform read lengths
π Output of Day 1
By the end of Day 1, you should have:
Illumina:
sample_R1.trimmed.fastq.gz
sample_R2.trimmed.fastq.gz
Nanopore:
sample_nanopore.fastq.gz (QC-validated)
π§ Why Day 1 Matters
All downstream steps assume:
β’ Accurate base calls
β’ Minimal sequencing artifacts
β’ Reliable read lengths
Skipping or rushing QC can:
β’ Break assemblies
β’ Inflate errors
β’ Bias comparative genomics
Clean input = trustworthy genomes