Skip to content

Commit

Permalink
Many docs updates
Browse files Browse the repository at this point in the history
Updated known-issues, remove references to 'hash' as appropriate,
clarify script options.
  • Loading branch information
mr-c committed Apr 1, 2014
1 parent b2d30e9 commit 686880d
Show file tree
Hide file tree
Showing 21 changed files with 100 additions and 73 deletions.
18 changes: 9 additions & 9 deletions doc/choosing-hash-sizes.txt → doc/choosing-table-sizes.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
=============================
Choosing hash sizes for khmer
=============================
==============================
Choosing table sizes for khmer
==============================

If you look at the documentation for the scripts (:doc:`scripts`) you'll
see two mysterious parameters -- ``-N`` and ``-x``, or, more verbosely,
``-n_hashes`` and ``--hashsize``. What are these, and how do you
specify them?
see two mysterious parameters -- :option:`-N` and :option:`-x`, or, more
verbosely, :option:`-n_tables` and :option:`--tablesize`. What are these, and
how do you specify them?

The really short version
========================
Expand All @@ -27,7 +27,7 @@ structure in khmer, which is basically N big hash tables of size x.
The **product** of the number of hash tables and the size of the hash
tables specifies the total amount of memory used.

This hash table is used to track k-mers. If it is too small, khmer
This table is used to track k-mers. If it is too small, khmer
will fail in various ways (and should complain), but there is no harm
in making it too large. So, **the absolute safest thing to do is to
specify as much memory as is available**. Most scripts will inform
Expand All @@ -48,8 +48,8 @@ which multiplies out to 128 Gbits of RAM, or 16 Gbytes.

Life is a bit more complicated than this, however, because some scripts --
load-into-counting and load-graph -- keep ancillary information that will
consume memory beyond this hash data structure. So if you run out of
memory, decrease the hash table size.
consume memory beyond this table data structure. So if you run out of
memory, decrease the table size.

Also see the rules of thumb, below.

Expand Down
4 changes: 2 additions & 2 deletions doc/galaxy.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ If not then you may need to `set their datatype manually <https://wiki.galaxypro
#. After selecting the input files specift if they are paired-interleaved
or not.

#. Specify the sample type or show the advanced parameters to set the hashsize
yourself. Consult :doc:`choosing-hash-sizes` for assistance.
#. Specify the sample type or show the advanced parameters to set the tablesize
yourself. Consult :doc:`choosing-table-sizes` for assistance.


16 changes: 8 additions & 8 deletions doc/guide.txt
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,10 @@ Most scripts *output* fasta, and some mangle headers. Sorry. We're
working on outputting FASTQ for FASTQ input, and removing any header
mangling.

Picking hash table sizes and k parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Picking k-mer table sizes and k parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For hash table sizes, read :doc:`choosing-hash-sizes`
For k-mer table sizes, read :doc:`choosing-table-sizes`

For k-mer sizes, we recommend k=20 for digital normalization and k=32
for partitioning; then assemble with a variety of k parameters.
Expand Down Expand Up @@ -197,16 +197,16 @@ you should load in paired end reads, or longer reads, first.
Iterative and independent normalization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can use --loadhash and --savehash to do iterative normalizations on
multiple files in multiple steps. For example, break ::
You can use :option:`--loadtable` and :option:`--savetable` to do iterative
normalizations on multiple files in multiple steps. For example, break ::

normalize-by-median.py [ ... ] file1.fa file2.fa file3.fa

into multiple steps like so::

normalize-by-median.py [ ... ] --savehash file1.kh file1.fa
normalize-by-median.py [ ... ] --loadhash file1.kh --savehash file2.kh file2.fa
normalize-by-median.py [ ... ] --loadhash file2.kh --savehash file3.kh file3.fa
normalize-by-median.py [ ... ] --savetable file1.kh file1.fa
normalize-by-median.py [ ... ] --loadtable file1.kh --savetable file2.kh file2.fa
normalize-by-median.py [ ... ] --loadtable file2.kh --savetable file3.kh file3.fa

The results should be identical!

Expand Down
10 changes: 5 additions & 5 deletions doc/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
khmer -- k-mer counting & filtering FTW
=======================================

:Authors: Michael R. Crusoe, Greg Edvenson, Jordan Fish, Adina Howe,
Eric McDonald, Joshua Nahum, Kaben Nanlohy, Jason Pell, Jared Simpson,
Camille Scott, Qingpeng Zhang, and C. Titus Brown
:Authors: Michael R. Crusoe, Greg Edvenson, Jordan Fish, Adina Howe,
Luiz Irber, Eric McDonald, Joshua Nahum, Kaben Nanlohy, Humberto
Ortiz-Zuazaga, Jason Pell, Jared Simpson, Camille Scott, Ramakrishnan
Rajaram Srinivasan, Qingpeng Zhang, and C. Titus Brown

:Contact: [email protected]
:License: BSD
Expand Down Expand Up @@ -47,14 +48,13 @@ Contents:
guide
scripts
blog-posts
choosing-hash-sizes
choosing-table-sizes
partitioning-big-data
design
details
development
galaxy
known-issues
sequence-access-patterns
release
crazy-ideas
contributors
Expand Down
2 changes: 1 addition & 1 deletion doc/introduction.txt
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ data sets.
The second most important consideration is memory usage. The
effectiveness of all of the Bloom filter-based functions (which is
everything interesting in khmer!) depends critically on having enough
memory to do a good job. See :doc:`choosing-hash-sizes` for more
memory to do a good job. See :doc:`choosing-table-sizes` for more
information.

Copyright and license
Expand Down
13 changes: 9 additions & 4 deletions doc/known-issues.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,18 @@ https://github.com/ged-lab/khmer/issues/249

If your hashfile gets truncated, perhaps from a full filesystem, then our
tools currently will get stuck. This is being tracked in
https://github.com/ged-lab/khmer/issues/247
https://github.com/ged-lab/khmer/issues/247 and
https://github.com/ged-lab/khmer/issues/96 and
https://github.com/ged-lab/khmer/issues/246

Paired-end reads from Casava 1.8 currently require renaming for use in
normalize-by-median and abund-filter when used in paired mode. The
integration of a fix for this is being tracked in
https://github.com/ged-lab/khmer/issues/23

A user has reported a floating point exception when running
count-overlap.py. There is no workaround at this time.
https://github.com/ged-lab/khmer/issues/282
annotate-partitions.py only outputs FASTA even if given a FASTQ file. This
issue is being tracked in https://github.com/ged-lab/khmer/issues/46

A user reported that abundance-dist-single.py fails with small files and many
threads. This issue is being tracked in
https://github.com/ged-lab/khmer/issues/75
6 changes: 3 additions & 3 deletions doc/scripts.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ khmer's command-line interface
The simplest way to use khmer's functionality is through the command
line scripts, located in the scripts/ directory of the khmer
distribution. Below is our documentation for these scripts. Note
that all scripts can be given '-h' as an option, which will print out
that all scripts can be given :option:`-h` which will print out
a list of arguments taken by that script.

Many scripts take '-x' and '-N' parameters, which drive khmer's memory usage.
These parameters depend on details of your data set; for more information
Many scripts take :option:`-x` and :option:`-N` parameters, which drive khmer's
memory usage. These parameters depend on details of your data set; for more information
on how to choose them, see :doc:`choosing-table-sizes`.

You can also override the default values of :option:`--ksize`/:option:`-k`,
Expand Down
2 changes: 1 addition & 1 deletion scripts/abundance-dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def get_parser():
formatter_class=argparse.ArgumentDefaultsHelpFormatter)

parser.add_argument('input_counting_table_filename', help='The name of the'
' counting table file.')
' input k-mer counting table file.')
parser.add_argument('input_sequence_filename', help='The name of the input'
' FAST[AQ] sequence file.')
parser.add_argument('output_histogram_filename', help='The columns are: '
Expand Down
7 changes: 5 additions & 2 deletions scripts/annotate-partitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,11 @@ def get_parser():

parser.add_argument('--ksize', '-k', type=int, default=DEFAULT_K,
help="k-mer size (default: %d)" % DEFAULT_K)
parser.add_argument('graphbase')
parser.add_argument('input_filenames', nargs='+')
parser.add_argument('graphbase', help='basename for input and output '
'files')
parser.add_argument('input_filenames', metavar='input_sequence_filename',
nargs='+', help='input FAST[AQ] sequences to '
'annotate.')
parser.add_argument('--version', action='version', version='%(prog)s '
+ khmer.__version__)
return parser
Expand Down
9 changes: 6 additions & 3 deletions scripts/count-median.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,12 @@ def get_parser():
description='Count k-mers summary stats for sequences',
epilog=textwrap.dedent(epilog))

parser.add_argument('ctfile', help='input k-mer count table filename')
parser.add_argument('input', help='input FAST[AQ] sequence filename')
parser.add_argument('output', help='output summary filename')
parser.add_argument('ctfile', metavar='input_counting_table_filename',
help='input k-mer count table filename')
parser.add_argument('input', metavar='input_sequence_filename',
help='input FAST[AQ] sequence filename')
parser.add_argument('output', metavar='output_summary_filename',
help='output summary filename')
parser.add_argument('--version', action='version', version='%(prog)s '
+ khmer.__version__)
return parser
Expand Down
15 changes: 9 additions & 6 deletions scripts/count-overlap.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,19 @@

def get_parser():
epilog = """
An additional report will be written to ${report_filename}.curve containing
the increase of overlap k-mers as the number of sequences in the second
database increases.
An additional report will be written to ${output_report_filename}.curve
containing the increase of overlap k-mers as the number of sequences in the
second database increases.
"""
parser = build_hashbits_args(
descr='Count the overlap k-mers which are the k-mers appearing in two '
'sequence datasets.', epilog=textwrap.dedent(epilog))
parser.add_argument('ptfile', help="input k-mer presence table filename")
parser.add_argument('fafile', help="input sequence filename")
parser.add_argument('report_filename', help='output report filename')
parser.add_argument('ptfile', metavar='input_presence_table_filename',
help="input k-mer presence table filename")
parser.add_argument('fafile', metavar='input_sequence_filename',
help="input sequence filename")
parser.add_argument('report_filename', metavar='output_report_filename',
help='output report filename')

return parser

Expand Down
3 changes: 2 additions & 1 deletion scripts/do-partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,8 @@ def get_parser():
default=True, action='store_false',
help='Keep individual subsets (default: False)')
parser.add_argument('graphbase', help="base name for output files")
parser.add_argument('input_filenames', nargs='+')
parser.add_argument('input_filenames', metavar='input_sequence_filename',
nargs='+', help='input FAST[AQ] sequence filenames')
return parser


Expand Down
5 changes: 3 additions & 2 deletions scripts/extract-partitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,9 @@ def get_parser():
description="Separate sequences that are annotated with partitions "
"into grouped files.", epilog=textwrap.dedent(epilog),
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('prefix')
parser.add_argument('part_filenames', nargs='+')
parser.add_argument('prefix', metavar='output_filename_prefix')
parser.add_argument('part_filenames', metavar='input_partition_filename',
nargs='+')
parser.add_argument('--max-size', '-X', dest='max_size',
default=DEFAULT_MAX_SIZE, type=int,
help='Max group size (n sequences)')
Expand Down
7 changes: 4 additions & 3 deletions scripts/filter-abund-single.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,12 @@

def get_parser():
epilog = """
Trimmed sequences will be placed in ${datafile}.abundfilt.
Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt.
This script is constant memory.
To trim reads based on k-mer abundance across multiple files, use
:program:`load-into-counting` and :program:`filter-abund`.
:program:`load-into-counting.py` and :program:`filter-abund.py`.
Example::
Expand All @@ -51,7 +51,8 @@ def get_parser():
parser.add_argument('--savetable', metavar="filename", default='',
help="If present, the name of the file to save the "
"k-mer counting table to")
parser.add_argument('datafile', help="FAST[AQ] sequence file to trim")
parser.add_argument('datafile', metavar='input_sequence_filename',
help="FAST[AQ] sequence file to trim")

return parser

Expand Down
17 changes: 9 additions & 8 deletions scripts/filter-abund.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@

def get_parser():
epilog = """
Trimmed sequences will be placed in ${input_filename}.abundfilt for each
input sequence file. If the input sequences are from RNAseq or metagenome
sequencing then :option:`--variable-coverage` should be used.
Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt
for each input sequence file. If the input sequences are from RNAseq or
metagenome sequencing then :option:`--variable-coverage` should be used.
Example::
Expand All @@ -40,14 +40,15 @@ def get_parser():
parser = build_counting_args(
descr='Trim sequences at a minimum k-mer abundance.',
epilog=textwrap.dedent(epilog))
parser.add_argument('input_table')
parser.add_argument('input_filename', nargs='+')
parser.add_argument('input_table', metavar='input_presence_table_filename',
help='The input k-mer presence table filename')
parser.add_argument('input_filename', metavar='input_sequence_filename',
help='Input FAST[AQ] sequence filename', nargs='+')
add_threading_args(parser)
parser.add_argument('--cutoff', '-C', dest='cutoff',
default=DEFAULT_CUTOFF, type=int,
help="Trim at k-mers below this abundance.")

parser.add_argument('-V', '--variable-coverage', action='store_true',
parser.add_argument('--variable-coverage', '-V', action='store_true',
dest='variable_coverage', default=False,
help='Only trim low-abundance k-mers from sequences '
'that have high coverage.')
Expand All @@ -56,7 +57,7 @@ def get_parser():
' k-mer abundance.',
default=DEFAULT_NORMALIZE_LIMIT)
parser.add_argument('-o', '--out', dest='single_output_filename',
default='', metavar="optional output filename",
default='', metavar="optional_output_filename",
help='Output the trimmed sequences into a single file '
'with the given filename instead of creating a new '
'file for each input file.')
Expand Down
9 changes: 5 additions & 4 deletions scripts/filter-stoptags.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,11 @@ def get_parser():
description="Trim sequences at stoptags.",
epilog=textwrap.dedent(epilog),
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('-k', default=DEFAULT_K, type=int, help='k-mer size',
dest='ksize')
parser.add_argument('stoptags_file')
parser.add_argument('input_filenames', nargs='+')
parser.add_argument('--ksize', '-k', default=DEFAULT_K, type=int,
help='k-mer size')
parser.add_argument('stoptags_file', metavar='input_stoptags_filename')
parser.add_argument('input_filenames', metavar='input_sequence_filename',
nargs='+')
parser.add_argument('--version', action='version', version='%(prog)s '
+ khmer.__version__)
return parser
Expand Down
9 changes: 5 additions & 4 deletions scripts/find-knots.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,9 @@ def get_parser():
Load an k-mer presence table/tagset pair created by load-graph, and a set
of pmap files created by partition-graph. Go through each pmap file,
select the largest partition in each, and do the same kind of traversal as
in make-initial-stoptags from each of the waypoints in that partition; this
should identify all of the HCKs in that partition. These HCKs are output to
<graphbase>.stoptags after each pmap file.
in :program:`make-initial-stoptags.py` from each of the waypoints in that
partition; this should identify all of the HCKs in that partition. These
HCKs are output to <graphbase>.stoptags after each pmap file.
Parameter choice is reasonably important. See the pipeline in
:doc:`partitioning-big-data` for an example run.
Expand All @@ -70,7 +70,8 @@ def get_parser():
parser.add_argument('--min-tablesize', '-x', type=float,
default=DEFAULT_COUNTING_HT_SIZE, help='lower bound on'
' the size of the k-mer counting table(s)')
parser.add_argument('graphbase')
parser.add_argument('graphbase', help='Basename for the input and output '
'files.')
parser.add_argument('--version', action='version', version='%(prog)s '
+ khmer.__version__)
return parser
Expand Down
7 changes: 5 additions & 2 deletions scripts/load-graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,11 @@ def get_parser():
parser.add_argument('--no-build-tagset', '-n', default=False,
action='store_true', dest='no_build_tagset',
help='Do NOT construct tagset while loading sequences')
parser.add_argument('output_filename')
parser.add_argument('input_filenames', nargs='+')
parser.add_argument('output_filename',
metavar='output_presence_table_filename', help='output'
' k-mer presence table filename.')
parser.add_argument('input_filenames', metavar='input_sequence_filename',
nargs='+', help='input FAST[AQ] sequence filename')
return parser


Expand Down
3 changes: 2 additions & 1 deletion scripts/make-initial-stoptags.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@ def get_parser():
help='Set subset size (default 1e4 is prob ok)')
parser.add_argument('--stoptags', '-S', metavar='filename', default='',
help="Use stoptags in this file during partitioning")
parser.add_argument('graphbase')
parser.add_argument('graphbase', help='basename for input and output '
'filenames')
return parser


Expand Down
8 changes: 5 additions & 3 deletions scripts/merge-partitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,16 @@ def get_parser():
Take the ${graphbase}.subset.#.pmap files and merge them all into a single
${graphbase}.pmap.merged file for :program:`annotate-partitions.py` to use.
"""
parser = argparse.ArgumentParser(description="Merge pmap files.",
epilog=textwrap.dedent(epilog))
parser = argparse.ArgumentParser(
description="Merge partition map '.pmap' files.",
epilog=textwrap.dedent(epilog))
parser.add_argument('--ksize', '-k', type=int, default=DEFAULT_K,
help="k-mer size (default: %d)" % DEFAULT_K)
parser.add_argument('--keep-subsets', dest='remove_subsets',
default=True, action='store_false',
help='Keep individual subsets (default: False)')
parser.add_argument('graphbase')
parser.add_argument('graphbase', help='basename for input and output '
'files')
parser.add_argument('--version', action='version', version='%(prog)s '
+ khmer.__version__)
return parser
Expand Down
Loading

0 comments on commit 686880d

Please sign in to comment.