Quickstart Tutorial
We will be using a pre-built database to classify some simulated metagenomic reads and obtain both phylogenetic and taxonomic output.
Get the database
Download one of the following compressed databases (see Tutorial 1 for instructions on building a database).
Unzip the compressed file. For instance, if you downloaded
Test Database
(test.tar.gz
) into your home directory~
, run
$ tar -xvzf test.tar.gz
The database will now be located at ~/test
, which is the directory you should pass to any -db
flag as input to expam.
Download these metagenomic reads (simulated reads from a genome used to build
Test Database
).
Unzip these metagenomic reads into a new folder, which we will call
reads
. Assuming you have downloaded and moved the above reads into your home directory~
, run
$ mkdir reads
$ mv reads.tar.gz reads
$ cd reads
$ tar -xvzf reads.tar.gz
There should be two files:
~/reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_1.fa
,
~/reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_2.fa
.
These are paired read, fasta files.
Phylogenetic classification
We will now classify these reads using the database you downloaded. We will save the results to an output folder located at
~/my_run/
.Although expam has the capacity to convert phylogenetic classifications into taxonomy, the default behaviour is only to provide phylogenetic results. The next section details how to obtain a taxonomic summary.
expam classify
will create the following folder structure, for a classification run using the flagexpam classify ... --out my_run
:
Folder/File
Name
Description
Columns
my_run/phy/
Phylogenetic output
Phylogenetic output parent folder
N/A
my_run/phy/classified.csv
Classifications summary file
Raw classification counts at phylogenetic nodes for each sample.
One column for each input sample.
my_run/phy/split.csv
Splits summary file
Split classification counts at each phylogenetic node
One column for each input sample.
my_run/phy/[SAMPLE].csv
Sample summary file
Overview of phylogenetic classification results for a particular sample
Cumulative Classified Percentage, Cumulative Classified Count, Raw Classified Count, Cumulative Split Percentage, Cumulative Split Count, Raw Split Count
my_run/phy/raw
Phylogenetic read-wise output
Phylogenetic read-wise output parent folder
N/A
my_run/phy/raw/[SAMPLE].csv
Phylogenetic read-wise output
Phylogenetic read-wise output for a particular sample
Classification Code, Read ID, Assigned Phylogenetic Node, Total Read Length, K-mer Distribution.
When
expam classify
is run, thephy
folders are populated with results.Run the
expam classify
command as follows (replacing~/test
with where you decompressed the database from Step 1 if necessary):
$ expam classify -db ~/test -d ~/reads --paired --out ~/my_run
Clearing old log files...
Loading the map and phylogeny.
Preparing shared memory allocations...
Loading database keys...
Loading database values...
* Initialising node pool...
* Checking for polytomies...
Polytomy (degree=3) detected! Resolving...
* Finalising index...
Loading reads from reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_2.fa, reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_1.fa...
The ETE3 package is not installed. Either install this module, or use the
--itol flag if you wish to use our native iTOL integration.
Skipping plotting of my_run/phylotree_classified.pdf...
Skipping plotting of my_run/phy/split.csv - no samples with counts in this matrix.
Note
Note that expam tried to plot the results on a phylotree, but since we did not have the ete3 module installed, it simply skipped plotting the results. This is the expected behaviour to let you know expam was not able to produce a graphical picture for your results.
It is also possible to visualise the results with iTOL instead of the default ete3 module. To use this
capacity, simply supply the --itol
flag with the expam classify
command and expam will
output files that serve as input to the iTOL web portal. See iTOL integration.
- The phylogenetic classifications will be located at
~/my_run/phy
, and will contain four files: ~/my_run/phy/GCF_000005845.2_ASM584v2_genomic.gz_1.csv
- sample summary file,
Node
Cumulative Classified Percentage
Cumulative Classified Count
Raw Classified Count
Cumulative Split Percentage
Cumulative Split Count
Raw Split Count
unclassified
0.0%
0
0
0.0%
0
0
p1
100.0%
1000
3
0.0%
0
0
p2
99.7%
997
232
0.0%
0
0
p5
76.5%
765
0
0.0%
0
0
GCF_000005845.2_ASM584v2_genomic
76.5%
765
765
0.0%
0
0
~/my_run/phy/classified.csv
- classified summary file,
Node
GCF_000005845.2_ASM584v2_genomic.gz_1
unclassified
0
p1
3
p2
232
GCF_000005845.2_ASM584v2_genomic
765
~/my_run/phy/split.csv
- split summary file,
Node
GCF_000005845.2_ASM584v2_genomic.gz_1
p1
0
p2
0
GCF_000005845.2_ASM584v2_genomic
0
~/my_run/phy/raw
- raw read-wise classifications. There will be a single raw read-wise output file,~/my_run/phy/raw/GCF_000005845.2_ASM584v2_genomic.gz_1.csv
.
C R6024166953890296505 p2 302 p2:240 C R5637238631728068726 p10 302 p10:33 p2:174 p10:33 C R4776396741200842676 p10 302 p10:240 C R5978962780799406918 p10 302 p10:240 C R5054328572910484091 p2 302 p2:240 C R3752077777745312170 p2 302 p2:240
- The phylogenetic classifications will be located at
The sample summary file is a comma-separated document where the first element of each row is a phylogenetic node/clade, and the corresponding values contain details of the raw and cumulative classifications and splits at this particular node.
The classified summary file is a comma-separated matrix where each row is a phylogenetic clade, each column is an input sample, and the cell value is the raw counts at this clade. The split summary file is an analogous file that contains the raw split count at any given clade. These two files are formatted such that they will always have the same column and row indices, and in the same order.
The raw read-wise output is a sub-directory containing one output file for each input sample, outlining read-wise output in kraken format.
A more comprehensive overview is given this tutorial.
Convert to taxonomy
First run
expam download_taxonomy
to download the taxonomy for all sequences in the database. This will require an internet connection.
$ expam download_taxonomy -db ~/test
Posting 6 UIDs to NCBI Entrez nuccore.
Received 6 response(s) for ESummary TaxID request!
Posting 6 UIDs to NCBI Entrez taxonomy.
Received 6 response(s) for EFetch Taxon request!
Taxonomic lineages written to ~/test/phylogeny/taxid_lineage.csv!
Taxonomic ranks written to ~/test/phylogeny/taxa_rank.csv!
The taxonomic results folder has a similar format to the corresponding phylogenetic output. Again using the example with output named
my_run
, expam will create the following folder structure.
Folder/File |
Name |
Description |
Columns |
---|---|---|---|
my_run/tax/ |
Taxonomic output |
Taxonomic output parent folder |
N/A |
my_run/tax/[SAMPLE].csv |
Taxonomic sample summary |
Overview of taxonomic classification results for a particular sample |
Cumulative Classified Percentage, Cumulative Classified Count, Raw Classified Count, Cumulative Split Percentage, Cumulative Split Count, Raw Split Count, Rank, Scientific Name |
my_run/tax/raw |
Taxonomy read-wise output |
Taxonomic read-wise output parent folder |
N/A |
my_run/tax/raw/[SAMPLE].csv |
Taxonomic read-wise output |
Taxonomic read-wise output for a particular sample |
Classification Code, Read ID, Assigned Taxonomic ID, Read Length |
We saved our previous classification results to
~/my_run
. This is the directory we pass toexpam to_taxonomy
to convert phylogenetic classifications to taxonomy.
$ expam to_taxonomy -db ~/test --out ~/my_run
* Initialising node pool...
* Checking for polytomies...
Polytomy (degree=3) detected! Resolving...
* Finalising index...
There will now be taxonomic output files located in
~/my_run/tax/
, analogous to each of the files present in the phylogenetic output, with the exception ofclassified.tsv
andsplit.tsv
- only the sample summaries and raw read-wise output are converted.~/my_run/tax/GCF_000005845.2_ASM584v2_genomic.gz_1.csv
- taxonomic summary file
Node |
Cumulative Classified Percentage |
Cumulative Classified Count |
Raw Classified Count |
Cumulative Split Percentage |
Cumulative Split Count |
Raw Split Count |
Rank |
Scientific Name |
---|---|---|---|---|---|---|---|---|
unclassified |
0.0% |
0 |
0 |
0.0% |
0 |
0 |
0 |
0 |
1 |
100.0% |
1000 |
0 |
0.0% |
0 |
0 |
root |
|
131567 |
100.0% |
1000 |
0 |
0.0% |
0 |
0 |
top |
cellular organisms |
2 |
100.0% |
1000 |
235 |
0.0% |
0 |
0 |
superkingdom |
cellular organisms Bacteria |
1224 |
76.5% |
765 |
0 |
0.0% |
0 |
0 |
phylum |
cellular organisms Bacteria Proteobacteria |
1236 |
76.5% |
765 |
0 |
0.0% |
0 |
0 |
class |
cellular organisms Bacteria Proteobacteria Gammaproteobacteria |
~/my_run/tax/raw/GCF_000005845.2_ASM584v2_genomic.gz_1.csv
- taxonomic read-wise output. The second column is the read header, the third is the assigned taxid, and the fourth is the length of the read. Observe length of 302 for paired-end 150bp reads (reads are concatenated with ‘N’s).
C R6024166953890296505 2 302
C R5637238631728068726 511145 302
C R4776396741200842676 511145 302
C R5978962780799406918 511145 302
C R5054328572910484091 2 302
C R3752077777745312170 2 302
C R3409307205017145485 511145 302
C R6600509248827337399 2 302
C R9030130456493509712 511145 302
The complete comprehensive overview is given this tutorial.