Tree building with expam
Part One: Building a tree
Note
Instructions for installing all software mentioned in this section can be found in Dependencies tutorial.
There are three stages to building a distance tree:
Sketching the sequences (a compressed form that still contains a lot of the important details),
Using these sketches to estimate sequence distance,
Apply Neighbour Joining (NJ) to the distance matrix to compute a distance tree.
This process can be done manually, or automated through the
mashtree
application.
Basic tree building
We’ll build a tree from the small collection of reference genomes supplied with the expam source code.
First create a new database and add these sequences. * I will refer to the location of these sequences as
../expam/test/data/sequence/
, but this may differ for you.
$ expam create -db tree
$ expam add -db tree -d ../expam/test/data/sequence/
We’ll set parameters that have been used before to build the database, but setting them here will be relevant for tree building.
$ expam set -db tree -k 31 -n 4 -s 100
This means we will be creating sequence sketches with a k-mers size
k=31
and using a sketch sizes=100
.We will also let the tree building software use up to
n=4
threads at a time.
The easy way
From here,
mashtree
can automate the process for us (you must have it installed).
$ expam mashtree -db tree
$ expam tree -db tree
The less easy way
We’ll manually go through the three steps outlined above.
First sketch the sequences.
$ expam sketch -db tree
$ ls tree/phylogeny/sketch
default.k31.s100.msh
Note
By default, this will use mash
to create the sketches. Alternatively,
provide the --sourmash
flag to use sourmash
for sketching.
Now we create the distance matrix from these sketches.
$ expam distance -db tree
$ ls tree/phylogeny/distance
default.k31.s100.tab
$ head -n 3 tree/phylogeny/distance/default.k31.s100.tab
6
GCF_000006765.1_ASM676v1_genomic.fna.gz 0 1 1 1 1 1
GCF_000005845.2_ASM584v2_genomic.fna.gz 1 0 1 1 0.0158863 1
This shows the tab-delimited computed distance matrix.
Warning
If you used --sourmash
to create the sketches, you must also supply
--sourmash
when computing the distance matrix.
Finally, we’ll use a NJ tool to compute the tree from this matrix.
$ expam nj -db tree
Note
By default, expam relies on RapidNJ to do NJ. However, it can call a local installation
of QuickTree using --quicktree
(if you have that installed).
$ expam nj -db tree --quicktree
The tree can now be finalised and attached to the database using the
tree
command.
$ expam tree -db tree
Note
Running expam sketch
, expam distance
and expam nj
is therefore
roughly equivalent to expam mashtree
, at least from the perspective of outcome.
Part Two: Building a tree in parts
You may wish to build a tree containing both bacterial and viral genomes, or even some human sequences for contamination detection. These genomes are very different sizes, and so using the same
k
ands
parameters for each of these types of genomes may not produce very accurate trees. It may be more prudent to build trees for each of these organism types separately, and then join these subtrees afterwards.expam implements a set of routines that enable you to construct a tree in this way - i.e. in parts.
Say we have two groups of sequences,
a
andb
, that we want to construct trees for separately.We can separate these sequences in the database by adding then to separate
groups
.
$ expam add -db tree --group a -d ~/Documents/Sequences/genomes/a/
$ expam add -db tree --group b -d ~/Documents/Sequences/genomes/b/
Note
Run expam print -db tree
and notice how expam lists multiple groups.
Note
Even though these sequences are added to separate groups, they are all part of the reference collection - it won’t affect the later database build behaviour.
We will use
k=31, s=1000
for groupa
, andk=21, s=100
for groupb
.
$ expam set -db tree --group a -k 31 -s 1000
$ expam set -db tree --group b -k 21 -s 100
Note
By specifying parameters alongside a --group
flag, expam recognises that these parameters
are specifically for tree building, not for database construction. Those would still need to be set via
$ expam set -db tree -k 31 -n 4
If database parameters have been set (those without --group
flags) but specific group flags have
not been set, expam will automatically use the database build parameters for tree building.
We also need to tell expam how these trees will be joined at the end. There are two rules for this template:
It is a Newick format tree, where group names appear in double braces.
The template must be placed at
database_name/phylogeny/tree/database_name.nwk
(thephylogeny
subdirectory in the database folder). Replacedatabase_name
with your database name.
The template we will use is
({{a}},{{b}});
Build in parts - mashtree
We can supply the
mashtree
command to build these two trees separately - expam takes care of this behind the scenes.
$ expam mashtree -db tree
Now finalise with the
tree
command to apply the template and cobine these tree.
$ expam tree -db tree
Note
If you wanted to, you could run mashtree
on these groups separetely.
$ expam mashtree --group a
$ expam mashtree --group b
Build in parts - manual
Despite having split the sequences into groups, running the same chain of
sketch
,distance
andnj
commands will result in expam running these commands on each group consecutively.Sketch the sequences
$ expam sketch -db tree
which is equivalent to
$ expam sketch -db tree --group a
$ expam sketch -db tree --group b
You can confirm these two groups have sketch files.
$ ls tree/phylogeny/sketch
a.k31.s1000.msh
b.k21.s100.msh
Get pairwise distances
$ expam distance -db tree
which is equivalent to
$ expam distance -db tree --group a
$ expam distance -db tree --group b
Distances can be found in the
tree/phylogeny/distance/
folder.Finally, apply NJ
$ expam nj -db tree
which is equivalent to
$ expam nj -db tree --group a
$ expam nj -db tree --group b
Finalise the tree using the template.
$ expam tree -db tree
Note
As in Part One, the --sourmash
and --quicktree
flags can be supplied to use
those alternative softwares.