Available commands
Setting the default database
The EXPAM_DEFAULT_DB environment variable defines expam’s default database.
$ expam default_db -db DB_NAME
export EXPAM_DEFAULT_DB=PATH_TO_DB
As a shortcut,
$ expam default_db -db DB_NAME >> ~/.bash_profile
sets the environment variable each time you open bash console.
Create a new database
$ expam create -db DB_NAME
Set build parameters
Set the build parameters for some database.
$ expam set -db DB_NAME [args...]
- -k <int>, --kmer <int>
K-mer size for building database.
Note
A k-mer length of 31 is often used, and is probably good for most bacterial genomes.
- -n <int>, --n_processes <int>
Number of processes to distribute building work amongst.
Note
Using more processes decreases the build time but increases the amount of computer resources used during build.
The following short python excerpt will tell you the max number of processes.
>>> import multiprocessing >>> multiprocessing.cpu_count() 8
- -p <file path>, --phylogeny <file path>
Path to Newick file detailing tree of reference sequences.
Example
$ expam set -db /path/to/db -k 31 -n 12 -p /home/seansolari/tree.nwk
Add/Remove sequences
Add reference sequences to the database.
$ expam add -db DB_NAME [args...]
$ expam remove -db DB_NAME [args...]
- -d <file path>, --directory <file path>
Add sequence at file path to the database.
Note
File path can be a file, or a folder.
If a folder, expam will add all sequences within this folder.
- --first_n <int>
(optional)
Add first n (order same as appears in ls) sequences from directory into the database.
Examples
$ expam add -db /path/to/db -d /path/to/sequence.fna
Added 1 sequence from /path/to/sequence.fna!
$ expam add -db /path/to/db -d /path/to/folder/
Added 19 sequences from /path/to/folder!
Printing database parameters
Print current database configuration.
$ expam print -db DB_NAME
<<< expam configuration file: DB_NAME >>>
phylogeny --> None
k --> 31
n --> 12
sketch --> None
pile --> 512
----------------
group name: default
k --> None
sketch --> None
sequences --> 9
Build a database
When all parameters have been set, a database can be built.
$ expam build -db DB_NAME
Classification
Classify
Run metagenomic reads against a succesfully built database. See Tutorial 2 for more details.
$ expam classify -db DB_NAME [args...]
- -d <file path>, --directory <file path>
Path containing fasta/fastq files to be classified.
Note
Path here can either be a single file, or a folder in which case expam will process all sequence files in this folder.
- --paired
To be supplied when sample files contained paired-end reads.
- -o <str>, --out <str>
Path to save classification results and output in.
- --taxonomy
Convert phylogenetic results to taxonomic results.
Note
This requires taxonomic information for all reference sequences, see the download_taxonomy command.
- --cpm <int>, --cutoff <int>
Apply cutoff to read count. cpm is short for counts-per-million, and takes priority if both cpm and cutoff are supplied.
- --phyla
Colour phylotree results by phyla.
- --keep_zeros
Keep nodes in output where no reads have been assigned.
- --ignore_names
Don’t plot names of reference genomes in output phylotree.
- --colour_list <hex string> <hex string> ...
List of colours to use when plotting groups in phylotree.
- --group <sample name> <sample name> ...
Space-separated list of sample files to be treated as a single group in phylotree. Groups are explained in this tutorial.
Note
You may also supply a hex colour code directly after the –group flag to assign some colour code to this group of samples.
$ expam classify ... --group #FF0000 sample_one sample_two
- --alpha <float>
Percentage requirement for classification subtrees (see Tutorial 2).
- --itol
Rather than use
ete3
for plotting the phylogenetic tree, expam will output files that can be used with iTOL for plotting. See the classification tutorial for details.
- --log-scale
Compute a log-transform on the counts at each node in the phylogenetic tree before depiction on the phylotree.
Note
For a given sample \(S\), with minimum and maximum counts \(\underline{c}\) and \(\overline{c}\) respectively (\(\underline{c} > 0\) i.e. the smallest non-zero score), the log-transform \(f\) of some count \(x\) is defined by
\[f(x) = \frac{ \log\left(x / \underline{c}\right) }{ \log\left(\overline{c} / \underline{c}\right) },\]so that \(f(x)\in[0,1]\). Then \(f(x)\) is treated as an opacity score for plotting purposes.
Example
$ expam classify -db DB_NAME -d /path/to/paired/reads --paired --out ~/paired_reads_analysis --taxonomy
Download taxonomic data
Download taxonomic metadata associated with all sequences used to build the database.
$ expam download_taxonomy -db mydb
Note
This command can only be run after the database has been built. This is because expam first finds NCBI accession IDs or explicit taxon IDs in the header of each reference sequence, and uses these to search against the NCBI Entrez database.
Note
The NCBI taxonomic ID of a reference sequence can be explicitly stated in the format
> accessionid or metadata|taxid|TAXID|
For instance,
>NZ_RJQC00000000.1|taxid|2486576|
If both accession ID and taxon ID are supplied, taxon ID takes precedence.
Convert results to taxonomy
Translate phylogenetic classification output to NCBI taxonomy.
$ expam to_taxonomy --db DB_NAME --out PATH_TO_RESULTS
- -o <str>, --out <str>
Path to retrieve classification results (same as was passed to
expam classify
).
Plotting results on phylotree
Results are automatically visualised on top of a phylogenetic tree when during the expam classify
command,
but can also be done after classification using the phylotree
command.
$ expam phylotree -db DB_NAME --out /path/to/classification/output [args...]
- -o <str>, --out <str>
Path to retrieve classification results for plotting.
- --phyla
Colour phylotree results by phyla.
- --keep_zeros
Keep nodes in output where no reads have been assigned.
- --ignore_names
Don’t plot names of reference genomes in output phylotree.
- --colour_list <hex string> <hex string> ...
List of colours to use when plotting groups in phylotree.
- --group <sample name> <sample name> ...
Space-separated list of sample files to be treated as a single group in phylotree. Groups are explained above, and in this tutorial.
- --itol
Rather than use
ete3
for plotting the phylogenetic tree, expam will output files that can be used with iTOL for plotting. See the classification tutorial for details.
- --log-scale
Compute a log-transform on the counts at each node in the phylogenetic tree before depiction on the phylotree.
Limiting resource usage
expam allows you to provide an expam_limit
context before the expam
call to limit
how much RAM is used. Note that this doesn’t change any underlying algorithms, it simply
prepares a graceful exit of the program if it exceeds the supplied limit. See examples
for an example usage.
- -m <int>, --memory <int>
Memory limit in bytes.
- -x <float>, --x <float>
Percentage of total available memory to limit to.
- -t <float>, --interval <float>
Intervals in which program memory usage is written to log file.
- -o <str>, --out <str>
Log file to write to. By default, logs are written to console.
Example
The following will perform a database build while restricting expam’s total memory usage to half of the available machine’s RAM, writing logs in 1 second intervals to a
build.log
file.$ expam_limit -x 0.5 -t 1.0 -o build.log expam build ...
Warning
It is important that the expam_limit
command comes before
the expam
command.
Note
The expam_limit
context works the same for any command. expam build
can be replaced with expam classify
, or any other command.
The following is an example of the (tab-separated) log file output:
2022-03-11 02:25:05,888 ... total used free shared buff/cache available
2022-03-11 02:25:05,903 ... Mem: 944Gi 1.6Gi 427Gi 0.0Ki 515Gi 938Gi
2022-03-11 02:25:06,915 ... Mem: 944Gi 1.6Gi 427Gi 0.0Ki 515Gi 938Gi
2022-03-11 02:25:07,928 ... Mem: 944Gi 2.2Gi 427Gi 38Mi 515Gi 937Gi
2022-03-11 02:25:08,940 ... Mem: 944Gi 2.2Gi 426Gi 195Mi 515Gi 937Gi
2022-03-11 02:25:09,953 ... Mem: 944Gi 2.2Gi 426Gi 353Mi 515Gi 937Gi
2022-03-11 02:25:10,966 ... Mem: 944Gi 2.2Gi 426Gi 516Mi 516Gi 937Gi
2022-03-11 02:25:11,980 ... Mem: 944Gi 2.2Gi 426Gi 682Mi 516Gi 936Gi
2022-03-11 02:25:12,992 ... Mem: 944Gi 2.2Gi 426Gi 848Mi 516Gi 936Gi
2022-03-11 02:25:14,005 ... Mem: 944Gi 2.2Gi 425Gi 1.0Gi 516Gi 936Gi