GTDB-Tk Complete Workflow

When I began in bioinformatics, those beautifully annotated phylogenetic trees felt impossible to create. Later, when I started binning and whole-genome analysis, visualizing my own genomes with that same level of detail became one of my top goals.

That’s when I finally ended up using GTDB-Tk for tree building and iTOL for annotation, and it opened a whole new world of possibilities.

Introduction

GTDB-Tk (Genome Taxonomy Database Toolkit) has become one of the most widely used tools for classifying bacterial and archaeal genomes using the GTDB standard taxonomy.

In this blog post, I walk through a complete, reproducible workflow:

✔ Installing GTDB-Tk

✔ Setting up a conda environment

✔ Downloading the GTDB reference database

✔ Running identify, align, and classify

✔ Handling common errors like MASH DB

✔ Running the full de_novo_wf

✔ Exporting the final tree to iTOL

✔ A short intro to iTOL annotations

All commands are based on my workflows on the Palmetto HPC Cluster.

1. Creating & Activating GTDB-Tk Conda Environment

If you don’t already have GTDB-Tk installed, create a fresh environment:

conda create -n gtdbtk_env -c bioconda -c conda-forge gtdbtk

Or install it in a custom project folder: You may need little more storage to set up the updated database

conda create -p path/to/desired/folder -c bioconda -c conda-forge gtdbtk

Activate and Verify installation:

gtdbtk --version

2. Downloading the GTDB Reference Database

GTDB-Tk requires the full reference database (e.g., release R207, R214, etc.). Download from: 👉 https://data.gtdb.ecogenomic.org/ Example (Bacteria + Archaea): Keep in mind this is almost more than 100 GB;

wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/gtdbtk/gtdbtk_r214_data.tar.gz

Extract and set environment variable

tar -xzf gtdbtk_r214_data.tar.gz -C /project/bcampb7/camplab/databases/
export GTDBTK_DATA_PATH=/project/bcampb7/camplab/databases/gtdbtk_r214_data #or run line below
echo "export GTDBTK_DATA_PATH=/project/bcampb7/camplab/databases/gtdbtk_r214_data" >> ~/.bashrc #to make it permanant
gtdbtk check_install

3. GTDB-Tk Workflow (Identify → Align → Classify)

Now we run GTDB-Tk on your genomes. 3.1 Identify Marker Genes

gtdbtk identify \
  --genome_dir path/to/directory/with/genomes \
  --out_dir path/to/output/gtdbtk/identify \
  --extension fa \
  --cpus 32 #change accordingly

#change according to your genome.extension; fa or fasta

3.2 Align Marker Genes

gtdbtk align \
  --identify_dir path_to/gtdbtk/identify \
  --out_dir path_for/gtdbtk/align \
  --cpus 32

3.3 Classify Genomes

gtdbtk classify \
  --genome_dir path \
  --align_dir /path/align \
  --out_dir /path/classify \
  -x fasta \
  --cpus 32 \
  --mash_db path_for_saving_db

4. Running the Full de_novo_wf (Automatic Tree Building)

This workflow performs everything: identify → alignment → tree → taxonomy.

gtdbtk de_novo_wf \
  --genome_dir /project/dkarig/ecocoat/NCBI_oct22/need_biosamples/missing_assemblies_all \
  --out_dir /project/dkarig/ecocoat/NCBI_oct22/need_biosamples/missing_assemblies_all/gtdbtk \
  --extension fa \
  --bacteria \
  --cpus 32 \
  --outgroup_taxon p__Chloroflexota \
  --skip_gtdb_refs \
  --custom_taxonomy_file /scratch/jojyj/UK/Genomes/gtdbtk/custom_taxonomy_file.tsv

Notes:

● Outgroup: Choose a phylum not present in your samples based on the sample type. I used Chloroflexota for my esturine MAGs. But for soil it is not correct.

● –skip_gtdb_refs: reduces tree size by removing thousands of reference genomes.

● Custom taxonomy file: generated in classify output → you can edit to customize clade names.

Output tree (important file):

gtdbtk.bac120.decorated.tree

5. Converting GTDB-Tk Tree For iTOL

gtdbtk convert_to_itol \
  --input_tree /scratch/jojyj/UK/Genomes/gtdbtk/de_novo/gtdbtk.bac120.decorated.tree \
  --output_tree /scratch/jojyj/UK/Genomes/gtdbtk/de_novo/OUTPUT_TREE

This creates:

OUTPUT_TREE.tree
OUTPUT_TREE.metadata.tsv

Upload OUTPUT_TREE.tree to iTOL. 👉 https://itol.embl.de/

6. Quick Start Guide to iTOL

iTOL (Interactive Tree of Life) is one of the best online tree-visualization tools.

What you can do in iTOL

● Upload trees (Newick format)

● Add colors for taxonomy (phylum, genus, species)

● Add heatmaps (genome size, CAZymes, AMGs, completeness)

● Add data strips (free-living vs particle-attached, salinity, depth)

● Collapse clades

● Export high-resolution images (SVG, PNG, PDF)

Quick annotation starter files

● colors.txt → Color by clades

● metadata_strip.txt → Add sample categories (Bay, Season, Salinity)

● metadata_heatmap.txt → Add numeric values (gene counts, FRed)

7. Summary

By combining GTDB-Tk + iTOL, you can go from raw genomes to:

✔ GTDB-consistent taxonomy ✔ High-quality phylogenomic alignment ✔ Beautiful publication-ready trees ✔ Metadata-rich visualization

This pipeline is essential for MAG studies, isolate genomes, and comparative genomics.

tree