Database Build
Creating a database
We create a database called
test.
$ expam create -db test
Successfully created directories /Users/seansolari/Documents/Databases/test/phylogeny!
Fresh database configuration generated!
Logs path /Users/seansolari/Documents/Databases/test/logs created!
Made path /Users/seansolari/Documents/Databases/test/database.
Made path /Users/seansolari/Documents/Databases/test/results.
This database will be located in the current directory, in a folder called
test.
Add reference sequences
We now add sequences to our new database.
Warning
When using custom reference sequences or assembled genomes, ensure as much as possible that these genomes are uncontaminated - contaminated genomes compromise distance estimation for building the reference tree and subsequently may impact classification performance.
We have supplied sequences with the source:
Reference genomes:
../expam/test/data/sequence/,Reads (for later):
../expam/test/data/reads/.
Add these sequences to our database
test.
$ expam add -db test -d ~/Documents/expam/test/data/sequences
Added 6 files from /Users/seansolari/Documents/expam/test/data/sequences/.
Note
expam can handle both compressed and uncompressed sequences files and automatically detects how it should open most regular sequence files.
Specify build parameters
These are bacterial genomes, so
k=31should work fine.We’ll use 4 processes to speed up the build.
$ expam set -db test -k 31 -n 4
Build a tree for the reference database
Expam needs a tree with the phylogenetic relationship between your sequences.
We’ll make a tree out of
mashdistances.sourmashis a portable Python version ofmash(you can install it with pip).We’ll use a sketch size of
s=1000to represent each genome, and then compare them.
$ expam set -db test -s 1000
We’ll use
RapidNJto make a tree from thesourmashdistances (see here to install).We’ll first ensure that
sourmashis installed, before running thetreecommand to build the tree.
$ python3 -m pip install sourmash
$ expam tree -db test --sourmash
...
printand match with the following output:
$ expam print -db test
<<< expam configuration file: test >>>
phylogeny --> /Users/seansolari/Documents/Databases/test/phylogeny/tree/test.nwk
k --> 31
n --> 4
sketch --> 1000
pile --> None
----------------
group name: default
k --> None
sketch --> None
sequences --> 6
Build the database
Now that we have a distance-tree for our added reference sequences, we can build the database.
$ expam build -db test
Clearing old log files...
Importing phylogeny...
* Initialising node pool...
* Checking for polytomies...
Polytomy (degree=3) detected! Resolving...
* Finalising index...
Creating LCA matrix...
Extracting sequences from /Users/ssol0002/Documents/Projects/pam/test/data/sequences/GCF_000008725.1_ASM872v1_genomic.fna.gz...
Extracting sequences from /Users/ssol0002/Documents/Projects/pam/test/data/sequences/GCF_000007765.2_ASM776v2_genomic.fna.gz...
Extracting sequences from /Users/ssol0002/Documents/Projects/pam/test/data/sequences/GCF_000005845.2_ASM584v2_genomic.fna.gz...
Extracting sequences from /Users/ssol0002/Documents/Projects/pam/test/data/sequences/GCF_000006925.2_ASM692v2_genomic.fna.gz...
Extracting sequences from /Users/ssol0002/Documents/Projects/pam/test/data/sequences/GCF_000006945.2_ASM694v2_genomic.fna.gz...
Extracting sequences from /Users/ssol0002/Documents/Projects/pam/test/data/sequences/GCF_000006765.1_ASM676v1_genomic.fna.gz...
expam: 42.359643852s
PID - 65856 dying...