Available commands

Setting the default database

The EXPAM_DEFAULT_DB environment variable defines expam’s default database.

$ expam default_db -db DB_NAME
export EXPAM_DEFAULT_DB=PATH_TO_DB

As a shortcut,

$ expam default_db -db DB_NAME >> ~/.bash_profile

sets the environment variable each time you open bash console.

Create a new database

$ expam create -db DB_NAME

Set build parameters

Set the build parameters for some database.

$ expam set -db DB_NAME [args...]

-k <int>, --kmer <int>: K-mer size for building database.

Note

A k-mer length of 31 is often used, and is probably good for most bacterial genomes.

-n <int>, --n_processes <int>

Number of processes to distribute building work amongst.

Note

Using more processes decreases the build time but increases the amount of computer resources used during build.

The following short python excerpt will tell you the max number of processes.

>>> import multiprocessing
>>> multiprocessing.cpu_count()
8

-p <file path>, --phylogeny <file path>: Path to Newick file detailing tree of reference sequences.

Example

$ expam set -db /path/to/db -k 31 -n 12 -p /home/seansolari/tree.nwk

Add/Remove sequences

Add reference sequences to the database.

$ expam add -db DB_NAME [args...]
$ expam remove -db DB_NAME [args...]

-d <file path>, --directory <file path>: Add sequence at file path to the database.

Note

File path can be a file, or a folder.

If a folder, expam will add all sequences within this folder.

--first_n <int>

(optional)

Add first n (order same as appears in ls) sequences from directory into the database.

--group <str>

(optional)

Add sequences to particular sequence group.

See here for details.

Examples

$ expam add -db /path/to/db -d /path/to/sequence.fna
Added 1 sequence from /path/to/sequence.fna!

$ expam add -db /path/to/db -d /path/to/folder/
Added 19 sequences from /path/to/folder!

Printing database parameters

Print current database configuration.

$ expam print -db DB_NAME
<<< expam configuration file: DB_NAME >>>

phylogeny       -->     None
k               -->     31
n               -->     12
sketch          -->     None
pile            -->     512

----------------
group name: default
        k               -->     None
        sketch          -->     None
        sequences       -->     9

Build a database

When all parameters have been set, a database can be built.

$ expam build -db DB_NAME

Classification

Classify

Run metagenomic reads against a succesfully built database. See Tutorial 2 for more details.

$ expam classify -db DB_NAME [args...]

-d <file path>, --directory <file path>: Path containing fasta/fastq files to be classified.

Note

Path here can either be a single file, or a folder in which case expam will process all sequence files in this folder.

--paired: To be supplied when sample files contained paired-end reads.

-o <str>, --out <str>: Path to save classification results and output in.

--taxonomy: Convert phylogenetic results to taxonomic results.

Note

This requires taxonomic information for all reference sequences, see the download_taxonomy command.

--cpm <int>, --cutoff <int>: Apply cutoff to read count. cpm is short for counts-per-million, and takes priority if both cpm and cutoff are supplied.

--phyla: Colour phylotree results by phyla.

--keep_zeros: Keep nodes in output where no reads have been assigned.

--ignore_names: Don’t plot names of reference genomes in output phylotree.

--colour_list <hex string> <hex string> ...: List of colours to use when plotting groups in phylotree.

--group <sample name> <sample name> ...

Space-separated list of sample files to be treated as a single group in phylotree. Groups are explained in this tutorial.

Note

You may also supply a hex colour code directly after the –group flag to assign some colour code to this group of samples.

$ expam classify ... --group #FF0000 sample_one sample_two

--alpha <float>: Percentage requirement for classification subtrees (see Tutorial 2).

--itol: Rather than use ete3 for plotting the phylogenetic tree, expam will output files that can be used with iTOL for plotting. See the classification tutorial for details.

--log-scale: Compute a log-transform on the counts at each node in the phylogenetic tree before depiction on the phylotree.

Note

For a given sample $S$, with minimum and maximum counts $\underline{c}$ and $\overline{c}$ respectively ($\underline{c} > 0$ i.e. the smallest non-zero score), the log-transform $f$ of some count $x$ is defined by

\[f(x) = \frac{ \log\left(x / \underline{c}\right) }{ \log\left(\overline{c} / \underline{c}\right) },\]

so that $f(x)\in[0,1]$. Then $f(x)$ is treated as an opacity score for plotting purposes.

Example

$ expam classify -db DB_NAME -d /path/to/paired/reads --paired --out ~/paired_reads_analysis --taxonomy

Download taxonomic data

Download taxonomic metadata associated with all sequences used to build the database.

$ expam download_taxonomy -db mydb

Note

This command can only be run after the database has been built. This is because expam first finds NCBI accession IDs or explicit taxon IDs in the header of each reference sequence, and uses these to search against the NCBI Entrez database.

Note

The NCBI taxonomic ID of a reference sequence can be explicitly stated in the format

> accessionid or metadata|taxid|TAXID|

For instance,

>NZ_RJQC00000000.1|taxid|2486576|

If both accession ID and taxon ID are supplied, taxon ID takes precedence.

Convert results to taxonomy

Translate phylogenetic classification output to NCBI taxonomy.

$ expam to_taxonomy --db DB_NAME --out PATH_TO_RESULTS

-o <str>, --out <str>: Path to retrieve classification results (same as was passed to expam classify).

Plotting results on phylotree

Results are automatically visualised on top of a phylogenetic tree when during the expam classify command, but can also be done after classification using the phylotree command.

$ expam phylotree -db DB_NAME --out /path/to/classification/output [args...]

-o <str>, --out <str>: Path to retrieve classification results for plotting.

--phyla: Colour phylotree results by phyla.

--keep_zeros: Keep nodes in output where no reads have been assigned.

--ignore_names: Don’t plot names of reference genomes in output phylotree.

--colour_list <hex string> <hex string> ...: List of colours to use when plotting groups in phylotree.

--group <sample name> <sample name> ...: Space-separated list of sample files to be treated as a single group in phylotree. Groups are explained above, and in this tutorial.

--itol: Rather than use ete3 for plotting the phylogenetic tree, expam will output files that can be used with iTOL for plotting. See the classification tutorial for details.

--log-scale: Compute a log-transform on the counts at each node in the phylogenetic tree before depiction on the phylotree.

Limiting resource usage

expam allows you to provide an expam_limit context before the expam call to limit how much RAM is used. Note that this doesn’t change any underlying algorithms, it simply prepares a graceful exit of the program if it exceeds the supplied limit. See examples for an example usage.

-m <int>, --memory <int>: Memory limit in bytes.

-x <float>, --x <float>: Percentage of total available memory to limit to.

-t <float>, --interval <float>: Intervals in which program memory usage is written to log file.

-o <str>, --out <str>: Log file to write to. By default, logs are written to console.

Example

The following will perform a database build while restricting expam’s total memory usage to half of the available machine’s RAM, writing logs in 1 second intervals to a build.log file.
$ expam_limit -x 0.5 -t 1.0 -o build.log expam build ...

Warning

It is important that the expam_limit command comes before the expam command.

Note

The expam_limit context works the same for any command. expam build can be replaced with expam classify, or any other command.

The following is an example of the (tab-separated) log file output:

2022-03-11 02:25:05,888 ...         total   used    free    shared  buff/cache      available
2022-03-11 02:25:05,903 ... Mem:    944Gi   1.6Gi   427Gi   0.0Ki   515Gi   938Gi
2022-03-11 02:25:06,915 ... Mem:    944Gi   1.6Gi   427Gi   0.0Ki   515Gi   938Gi
2022-03-11 02:25:07,928 ... Mem:    944Gi   2.2Gi   427Gi   38Mi    515Gi   937Gi
2022-03-11 02:25:08,940 ... Mem:    944Gi   2.2Gi   426Gi   195Mi   515Gi   937Gi
2022-03-11 02:25:09,953 ... Mem:    944Gi   2.2Gi   426Gi   353Mi   515Gi   937Gi
2022-03-11 02:25:10,966 ... Mem:    944Gi   2.2Gi   426Gi   516Mi   516Gi   937Gi
2022-03-11 02:25:11,980 ... Mem:    944Gi   2.2Gi   426Gi   682Mi   516Gi   936Gi
2022-03-11 02:25:12,992 ... Mem:    944Gi   2.2Gi   426Gi   848Mi   516Gi   936Gi
2022-03-11 02:25:14,005 ... Mem:    944Gi   2.2Gi   425Gi   1.0Gi   516Gi   936Gi