Skip to content

Latest commit

 

History

History
194 lines (137 loc) · 9.56 KB

cli.md

File metadata and controls

194 lines (137 loc) · 9.56 KB

Command-line interface

The command-line interface (CLI) of Woltka provides several commands:

Main workflow:

  • classify: Complete classification workflow that analyzes sequence alignments based on a classification system and generate profiles.

Profile utilities:

  • collapse: Collapse a profile by feature mapping and/or hierarchy.
  • normalize: Normalize a profile to fractions and/or by feature sizes.
  • filter: Filter a profile by per-sample abundance.
  • merge: Merge multiple profiles into one profile.
  • coverage: Calculate per-sample coverage of feature groups.

Classify

Basic

Option Description
--input, -i (required) Path to input alignment file or directory of alignment files. Enter "-" for stdin.
--output, -o (required) Path to output profile file or directory of profile files.

Input files

Option Description
--format, -f Format of read alignments. Options: If not specified, program will automatically infer from file content.
--filext, -e Input filename extension following sample ID.
--samples, -s Sample IDs to include in the analysis. Can be a comma-separated string or path to a list file. Also defines the order of samples in the output.
--demux/--no-demux Demultiplex alignment by first underscore in query identifier.
--exclude, -x Subject IDs to exclude while parsing alignments. Can be a comma-separated string or path to a list file. If an alignment is hit, the entire query (and its paired mate, if any) will be dropped.
--trim-sub Trim subject IDs at the last given delimiter. Can accept the default value "_" or enter a custom value.

Hierarchies

Option Description
--nodes Hierarchies defined by NCBI nodes.dmp or compatible formats.
--newick Hierarchies defined by a tree in Newick format.
--lineage Lineage strings. Can accept Greengenes-style rank prefix.
--columns Table of classification units per rank (column).
--map, -m Mapping of lower classification units to higher ones.
--map-as-rank/--map-no-rank Extract rank name from mapping filename. On by default when classifying with only mapping files.
--names, -n Names of classification units as defined by NCBI names.dmp or a plain map.

Assignment

Option Description
--rank, -r Classify sequences at this rank. Enter "none" to directly report subjects; enter "free" for free-rank classification.; enter "free" for free-rank classification. Can specify multiple comma-delimited ranks and one profile will be generated for each rank. If omitted, the program will do "free" if a classification system is provided or "none" if not.
--uniq One sequence can only be assigned to one classification unit, or remain unassigned if there is ambiguity. Otherwise, all candidate units are reported and their counts are normalized.
--major In given-rank classification, use majority rule at this percentage threshold to determine assignment when there are multiple candidates. Range: [51, 99]. Overrides "above" and "uniq".
--above In given-rank classification, allow assigning a sequence to a higher rank if it cannot be assigned to the current rank. Overrides "uniq".
--subok In free-rank classification, allow assigning a sequence to its direct subject, if applicable, before going up in hierarchy.

Gene matching

Option Description
--coords, -c Reference gene coordinates on genomes.
--overlap, Read/gene overlapping percentage threshold. Default: 80.

Stratification

Option Description
--stratify, -t Directory of read-to-feature maps for stratification. One file per sample.

Normalization

Option Description
--sizes, -z Divide counts by subject sizes. Can provide a subject-to-size mapping file, or type "." to calculate from gene coordinates (which is provided by --coords).
--frac Divide counts by total count of each sample (i.e., fractions).
--scale Scale counts by this factor. Accepts "k", "M" suffixes.
--digits Round counts to this number of digits after the decimal point.

Output table

Option Description
--to-biom/--to-tsv Force output profile format (BIOM or TSV). If omitted, format defaults to BIOM if there are multiple ranks, or based on output filename extension (.biom for BIOM, otherwise TSV) if there is only one rank.
--unassigned Report unassigned sequences (will be marked as "Unassigned").
--name-as-id Replace feature IDs with names. Otherwise append names to table as a metadata column.
--add-rank Append feature ranks to table as a metadata column.
--add-lineage Append lineage strings to table as a metadata column.

Output mapping

Option Description
--outmap, -u Write read-to-feature maps to this directory.
--zipmap Compress read-to-feature maps using this algorithm. Options: none, gz (default), bz2, xz.

Output coverage

Option Description
--outcov Write subject coverage maps to this directory.
--cov-fmt Format of subject coverage coordinates. Options: bed (BED-like, 0-based, exclusive end, equivalent to 0e) (default), gff (GFF-like, 1-based, inclusive end, equivalent to 1i), 0e, 1e, 0i, 1i.

Performance

Option Description
--chunk Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,024 for plain or range mapping, or 2 ** 20 = 1,048,576 for ordinal mapping. The latter cannot exceed 2 ** 22.
--cache Number of recent classification results to cache for faster subsequent classifications. Default: 1,024.
--no-exe Disable calling external programs (gzip, bzip2 and xz) for decompression. Otherwise, Woltka will use them if available for faster processing, or switch back to Python if not.

Collapse

Collapse a profile based on feature mapping (supports many-to-many mapping) and/or hierarchy.

Option Description
--input, -i (required) Path to input profile.
--output, -o (required) Path to output profile.
--map, -m Path to mapping of source features to target features.
--divide, -d Count each target feature as 1 / k (k is the number of targets mapped to a source). Otherwise, count as one.
--field, -f Features are stratified (strata delimited by "|"). For example, if features are like "species|gene", one can use -f 1 to collapse "species" or -f 2 to collapse "gene".
--nested, -e Features are nested (each field is a child of the previous field). For example, "A_1" represents "1" of "A", and the entire feature is equivalent to stratified feature "A|A_1". This parameter overrides the "|"-delimited strata.
--sep, -s Field separator for stratified features (default: "|") or nested features (default: "_").
--names, -n Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.

Normalize

Normalize a profile to fractions and/or by feature sizes

Option Description
--input, -i (required) Path to input profile.
--output, -o (required) Path to output profile.
--sizes, -z Path to mapping of feature sizes, by which values will be divided. If omitted, will divide values by sum per sample.
--scale, -s Scale values by this factor. Accepts "k", "M" suffixes.
--digits, -d Round values to this number of digits after the decimal point. If omitted, will keep decimal precision of input profile.

Filter

Filter a profile by per-sample abundance.

Option Description
--input, -i (required) Path to input alignment file or directory of alignment files.
--output, -o (required) Path to output profile file or directory of profile files.
--min-count, -c Per-sample minimum count threshold (>=1).
--min-percent, -p Per-sample minimum percentage threshold (<100).

Merge

Merge multiple profiles into one profile.

Option Description
--input, -i (required) Path to input profiles or directories containing profiles. Can accept multiple paths.
--output, -o (required) Path to output profile.

Coverage

Calculate per-sample coverage of feature groups in a profile.

Option Description
--input, -i (required) Path to input profile.
--map, -m (required) Path to mapping of source features to target features.
--output, -o (required) Path to output profile.
--threshold, -t Convert coverage to presence (1) / absence (0) data by this percentage threshold.
--count, -c Record numbers of covered features instead of percentages (overrides threshold).
--names, -n Path to mapping of feature groups to names. The names will be appended to the coverage table as a metadata column.

Tools

A sub menu containing all commands except for classify. It is for backward compatibility. It is deprecated and will be removed in the next release.