Many docs updates

Updated known-issues, remove references to 'hash' as appropriate, clarify script options.
dib-lab · Apr 1, 2014 · 686880d · 686880d
1 parent b2d30e9
commit 686880d
Show file tree

Hide file tree

Showing 21 changed files with 100 additions and 73 deletions.
diff --git a/doc/choosing-hash-sizes.txt → doc/choosing-table-sizes.txt b/doc/choosing-hash-sizes.txt → doc/choosing-table-sizes.txt
@@ -1,11 +1,11 @@
-=============================
-Choosing hash sizes for khmer
-=============================
+==============================
+Choosing table sizes for khmer
+==============================
 
 If you look at the documentation for the scripts (:doc:`scripts`) you'll
-see two mysterious parameters -- ``-N`` and ``-x``, or, more verbosely,
-``-n_hashes`` and ``--hashsize``. What are these, and how do you
-specify them?
+see two mysterious parameters -- :option:`-N` and :option:`-x`, or, more
+verbosely, :option:`-n_tables` and :option:`--tablesize`. What are these, and
+how do you specify them?
 
 The really short version
 ========================
@@ -27,7 +27,7 @@ structure in khmer, which is basically N big hash tables of size x.
 The **product** of the number of hash tables and the size of the hash
 tables specifies the total amount of memory used.
 
-This hash table is used to track k-mers. If it is too small, khmer
+This table is used to track k-mers. If it is too small, khmer
 will fail in various ways (and should complain), but there is no harm
 in making it too large. So, **the absolute safest thing to do is to
 specify as much memory as is available**. Most scripts will inform
@@ -48,8 +48,8 @@ which multiplies out to 128 Gbits of RAM, or 16 Gbytes.
 
 Life is a bit more complicated than this, however, because some scripts --
 load-into-counting and load-graph -- keep ancillary information that will
-consume memory beyond this hash data structure. So if you run out of
-memory, decrease the hash table size.
+consume memory beyond this table data structure. So if you run out of
+memory, decrease the table size.
 
 Also see the rules of thumb, below.
 

diff --git a/doc/galaxy.txt b/doc/galaxy.txt
@@ -48,7 +48,7 @@ If not then you may need to `set their datatype manually <https://wiki.galaxypro
 #. After selecting the input files specift if they are paired-interleaved
 or not.
 
-#. Specify the sample type or show the advanced parameters to set the hashsize
-yourself. Consult :doc:`choosing-hash-sizes` for assistance.
+#. Specify the sample type or show the advanced parameters to set the tablesize
+yourself. Consult :doc:`choosing-table-sizes` for assistance.
 
 
diff --git a/doc/guide.txt b/doc/guide.txt
@@ -74,10 +74,10 @@ Most scripts *output* fasta, and some mangle headers. Sorry. We're
 working on outputting FASTQ for FASTQ input, and removing any header
 mangling.
 
-Picking hash table sizes and k parameters
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Picking k-mer table sizes and k parameters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-For hash table sizes, read :doc:`choosing-hash-sizes`
+For k-mer table sizes, read :doc:`choosing-table-sizes`
 
 For k-mer sizes, we recommend k=20 for digital normalization and k=32
 for partitioning; then assemble with a variety of k parameters.
@@ -197,16 +197,16 @@ you should load in paired end reads, or longer reads, first.
 Iterative and independent normalization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You can use --loadhash and --savehash to do iterative normalizations on
-multiple files in multiple steps. For example, break ::
+You can use :option:`--loadtable` and :option:`--savetable` to do iterative
+normalizations on multiple files in multiple steps. For example, break ::
 
  normalize-by-median.py [ ... ] file1.fa file2.fa file3.fa
 
 into multiple steps like so::
 
- normalize-by-median.py [ ... ] --savehash file1.kh file1.fa
- normalize-by-median.py [ ... ] --loadhash file1.kh --savehash file2.kh file2.fa
- normalize-by-median.py [ ... ] --loadhash file2.kh --savehash file3.kh file3.fa
+ normalize-by-median.py [ ... ] --savetable file1.kh file1.fa
+ normalize-by-median.py [ ... ] --loadtable file1.kh --savetable file2.kh file2.fa
+ normalize-by-median.py [ ... ] --loadtable file2.kh --savetable file3.kh file3.fa
 
 The results should be identical!
 

diff --git a/doc/index.txt b/doc/index.txt
@@ -6,9 +6,10 @@
 khmer -- k-mer counting & filtering FTW
 =======================================
 
-:Authors: Michael R. Crusoe, Greg Edvenson, Jordan Fish, Adina Howe, 
- Eric McDonald, Joshua Nahum, Kaben Nanlohy, Jason Pell, Jared Simpson,
- Camille Scott, Qingpeng Zhang, and C. Titus Brown
+:Authors: Michael R. Crusoe, Greg Edvenson, Jordan Fish, Adina Howe,
+ Luiz Irber, Eric McDonald, Joshua Nahum, Kaben Nanlohy, Humberto
+ Ortiz-Zuazaga, Jason Pell, Jared Simpson, Camille Scott, Ramakrishnan
+ Rajaram Srinivasan, Qingpeng Zhang, and C. Titus Brown
 
 :Contact: [email protected]
 :License: BSD
@@ -47,14 +48,13 @@ Contents:
  guide
  scripts
  blog-posts
- choosing-hash-sizes
+ choosing-table-sizes
  partitioning-big-data
  design
  details
  development
  galaxy
  known-issues
- sequence-access-patterns
  release
  crazy-ideas
  contributors

diff --git a/doc/introduction.txt b/doc/introduction.txt
@@ -77,7 +77,7 @@ data sets.
 The second most important consideration is memory usage. The
 effectiveness of all of the Bloom filter-based functions (which is
 everything interesting in khmer!) depends critically on having enough
-memory to do a good job. See :doc:`choosing-hash-sizes` for more
+memory to do a good job. See :doc:`choosing-table-sizes` for more
 information.
 
 Copyright and license

diff --git a/doc/known-issues.txt b/doc/known-issues.txt
@@ -14,13 +14,18 @@ https://github.com/ged-lab/khmer/issues/249
 
 If your hashfile gets truncated, perhaps from a full filesystem, then our
 tools currently will get stuck. This is being tracked in 
-https://github.com/ged-lab/khmer/issues/247
+https://github.com/ged-lab/khmer/issues/247 and
+https://github.com/ged-lab/khmer/issues/96 and
+https://github.com/ged-lab/khmer/issues/246
 
 Paired-end reads from Casava 1.8 currently require renaming for use in
 normalize-by-median and abund-filter when used in paired mode. The
 integration of a fix for this is being tracked in
 https://github.com/ged-lab/khmer/issues/23
 
-A user has reported a floating point exception when running
-count-overlap.py. There is no workaround at this time.
-https://github.com/ged-lab/khmer/issues/282
+annotate-partitions.py only outputs FASTA even if given a FASTQ file. This
+issue is being tracked in https://github.com/ged-lab/khmer/issues/46
+
+A user reported that abundance-dist-single.py fails with small files and many
+threads. This issue is being tracked in
+https://github.com/ged-lab/khmer/issues/75
diff --git a/doc/scripts.txt b/doc/scripts.txt
@@ -7,11 +7,11 @@ khmer's command-line interface
 The simplest way to use khmer's functionality is through the command
 line scripts, located in the scripts/ directory of the khmer
 distribution. Below is our documentation for these scripts. Note
-that all scripts can be given '-h' as an option, which will print out
+that all scripts can be given :option:`-h` which will print out
 a list of arguments taken by that script.
 
-Many scripts take '-x' and '-N' parameters, which drive khmer's memory usage.
-These parameters depend on details of your data set; for more information
+Many scripts take :option:`-x` and :option:`-N` parameters, which drive khmer's
+memory usage. These parameters depend on details of your data set; for more information
 on how to choose them, see :doc:`choosing-table-sizes`.
 
 You can also override the default values of :option:`--ksize`/:option:`-k`,

diff --git a/scripts/abundance-dist.py b/scripts/abundance-dist.py
@@ -30,7 +30,7 @@ def get_parser():
  formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 
  parser.add_argument('input_counting_table_filename', help='The name of the'
- ' counting table file.')
+ ' input k-mer counting table file.')
  parser.add_argument('input_sequence_filename', help='The name of the input'
  ' FAST[AQ] sequence file.')
  parser.add_argument('output_histogram_filename', help='The columns are: '

diff --git a/scripts/annotate-partitions.py b/scripts/annotate-partitions.py
@@ -47,8 +47,11 @@ def get_parser():
 
  parser.add_argument('--ksize', '-k', type=int, default=DEFAULT_K,
  help="k-mer size (default: %d)" % DEFAULT_K)
- parser.add_argument('graphbase')
- parser.add_argument('input_filenames', nargs='+')
+ parser.add_argument('graphbase', help='basename for input and output '
+ 'files')
+ parser.add_argument('input_filenames', metavar='input_sequence_filename',
+ nargs='+', help='input FAST[AQ] sequences to '
+ 'annotate.')
  parser.add_argument('--version', action='version', version='%(prog)s '
  + khmer.__version__)
  return parser

diff --git a/scripts/count-median.py b/scripts/count-median.py
@@ -42,9 +42,12 @@ def get_parser():
  description='Count k-mers summary stats for sequences',
  epilog=textwrap.dedent(epilog))
 
- parser.add_argument('ctfile', help='input k-mer count table filename')
- parser.add_argument('input', help='input FAST[AQ] sequence filename')
- parser.add_argument('output', help='output summary filename')
+ parser.add_argument('ctfile', metavar='input_counting_table_filename',
+ help='input k-mer count table filename')
+ parser.add_argument('input', metavar='input_sequence_filename',
+ help='input FAST[AQ] sequence filename')
+ parser.add_argument('output', metavar='output_summary_filename',
+ help='output summary filename')
  parser.add_argument('--version', action='version', version='%(prog)s '
  + khmer.__version__)
  return parser

diff --git a/scripts/count-overlap.py b/scripts/count-overlap.py
@@ -30,16 +30,19 @@
 
 def get_parser():
  epilog = """
- An additional report will be written to ${report_filename}.curve containing
- the increase of overlap k-mers as the number of sequences in the second
- database increases.
+ An additional report will be written to ${output_report_filename}.curve
+ containing the increase of overlap k-mers as the number of sequences in the
+ second database increases.
  """
  parser = build_hashbits_args(
  descr='Count the overlap k-mers which are the k-mers appearing in two '
  'sequence datasets.', epilog=textwrap.dedent(epilog))
- parser.add_argument('ptfile', help="input k-mer presence table filename")
- parser.add_argument('fafile', help="input sequence filename")
- parser.add_argument('report_filename', help='output report filename')
+ parser.add_argument('ptfile', metavar='input_presence_table_filename',
+ help="input k-mer presence table filename")
+ parser.add_argument('fafile', metavar='input_sequence_filename',
+ help="input sequence filename")
+ parser.add_argument('report_filename', metavar='output_report_filename',
+ help='output report filename')
 
  return parser
 

diff --git a/scripts/do-partition.py b/scripts/do-partition.py
@@ -96,7 +96,8 @@ def get_parser():
  default=True, action='store_false',
  help='Keep individual subsets (default: False)')
  parser.add_argument('graphbase', help="base name for output files")
- parser.add_argument('input_filenames', nargs='+')
+ parser.add_argument('input_filenames', metavar='input_sequence_filename',
+ nargs='+', help='input FAST[AQ] sequence filenames')
  return parser
 
 

diff --git a/scripts/extract-partitions.py b/scripts/extract-partitions.py
@@ -59,8 +59,9 @@ def get_parser():
  description="Separate sequences that are annotated with partitions "
  "into grouped files.", epilog=textwrap.dedent(epilog),
  formatter_class=argparse.ArgumentDefaultsHelpFormatter)
- parser.add_argument('prefix')
- parser.add_argument('part_filenames', nargs='+')
+ parser.add_argument('prefix', metavar='output_filename_prefix')
+ parser.add_argument('part_filenames', metavar='input_partition_filename',
+ nargs='+')
  parser.add_argument('--max-size', '-X', dest='max_size',
  default=DEFAULT_MAX_SIZE, type=int,
  help='Max group size (n sequences)')

diff --git a/scripts/filter-abund-single.py b/scripts/filter-abund-single.py
@@ -30,12 +30,12 @@
 
 def get_parser():
  epilog = """
- Trimmed sequences will be placed in ${datafile}.abundfilt.
+ Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt.
 
  This script is constant memory.
 
  To trim reads based on k-mer abundance across multiple files, use
- :program:`load-into-counting` and :program:`filter-abund`.
+ :program:`load-into-counting.py` and :program:`filter-abund.py`.
 
  Example::
 
@@ -51,7 +51,8 @@ def get_parser():
  parser.add_argument('--savetable', metavar="filename", default='',
  help="If present, the name of the file to save the "
  "k-mer counting table to")
- parser.add_argument('datafile', help="FAST[AQ] sequence file to trim")
+ parser.add_argument('datafile', metavar='input_sequence_filename',
+ help="FAST[AQ] sequence file to trim")
 
  return parser
 

diff --git a/scripts/filter-abund.py b/scripts/filter-abund.py
@@ -28,9 +28,9 @@
 
 def get_parser():
  epilog = """
- Trimmed sequences will be placed in ${input_filename}.abundfilt for each
- input sequence file. If the input sequences are from RNAseq or metagenome
- sequencing then :option:`--variable-coverage` should be used.
+ Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt
+ for each input sequence file. If the input sequences are from RNAseq or
+ metagenome sequencing then :option:`--variable-coverage` should be used.
 
  Example::
 
@@ -40,14 +40,15 @@ def get_parser():
  parser = build_counting_args(
  descr='Trim sequences at a minimum k-mer abundance.',
  epilog=textwrap.dedent(epilog))
- parser.add_argument('input_table')
- parser.add_argument('input_filename', nargs='+')
+ parser.add_argument('input_table', metavar='input_presence_table_filename',
+ help='The input k-mer presence table filename')
+ parser.add_argument('input_filename', metavar='input_sequence_filename',
+ help='Input FAST[AQ] sequence filename', nargs='+')
  add_threading_args(parser)
  parser.add_argument('--cutoff', '-C', dest='cutoff',
  default=DEFAULT_CUTOFF, type=int,
  help="Trim at k-mers below this abundance.")
-
- parser.add_argument('-V', '--variable-coverage', action='store_true',
+ parser.add_argument('--variable-coverage', '-V', action='store_true',
  dest='variable_coverage', default=False,
  help='Only trim low-abundance k-mers from sequences '
  'that have high coverage.')
@@ -56,7 +57,7 @@ def get_parser():
  ' k-mer abundance.',
  default=DEFAULT_NORMALIZE_LIMIT)
  parser.add_argument('-o', '--out', dest='single_output_filename',
- default='', metavar="optional output filename",
+ default='', metavar="optional_output_filename",
  help='Output the trimmed sequences into a single file '
  'with the given filename instead of creating a new '
  'file for each input file.')

diff --git a/scripts/filter-stoptags.py b/scripts/filter-stoptags.py
@@ -37,10 +37,11 @@ def get_parser():
  description="Trim sequences at stoptags.",
  epilog=textwrap.dedent(epilog),
  formatter_class=argparse.ArgumentDefaultsHelpFormatter)
- parser.add_argument('-k', default=DEFAULT_K, type=int, help='k-mer size',
- dest='ksize')
- parser.add_argument('stoptags_file')
- parser.add_argument('input_filenames', nargs='+')
+ parser.add_argument('--ksize', '-k', default=DEFAULT_K, type=int,
+ help='k-mer size')
+ parser.add_argument('stoptags_file', metavar='input_stoptags_filename')
+ parser.add_argument('input_filenames', metavar='input_sequence_filename',
+ nargs='+')
  parser.add_argument('--version', action='version', version='%(prog)s '
  + khmer.__version__)
  return parser

diff --git a/scripts/find-knots.py b/scripts/find-knots.py
@@ -47,9 +47,9 @@ def get_parser():
  Load an k-mer presence table/tagset pair created by load-graph, and a set
  of pmap files created by partition-graph. Go through each pmap file,
  select the largest partition in each, and do the same kind of traversal as
- in make-initial-stoptags from each of the waypoints in that partition; this
- should identify all of the HCKs in that partition. These HCKs are output to
- <graphbase>.stoptags after each pmap file.
+ in :program:`make-initial-stoptags.py` from each of the waypoints in that
+ partition; this should identify all of the HCKs in that partition. These
+ HCKs are output to <graphbase>.stoptags after each pmap file.
 
  Parameter choice is reasonably important. See the pipeline in
  :doc:`partitioning-big-data` for an example run.
@@ -70,7 +70,8 @@ def get_parser():
  parser.add_argument('--min-tablesize', '-x', type=float,
  default=DEFAULT_COUNTING_HT_SIZE, help='lower bound on'
  ' the size of the k-mer counting table(s)')
- parser.add_argument('graphbase')
+ parser.add_argument('graphbase', help='Basename for the input and output '
+ 'files.')
  parser.add_argument('--version', action='version', version='%(prog)s '
  + khmer.__version__)
  return parser

diff --git a/scripts/load-graph.py b/scripts/load-graph.py
@@ -32,8 +32,11 @@ def get_parser():
  parser.add_argument('--no-build-tagset', '-n', default=False,
  action='store_true', dest='no_build_tagset',
  help='Do NOT construct tagset while loading sequences')
- parser.add_argument('output_filename')
- parser.add_argument('input_filenames', nargs='+')
+ parser.add_argument('output_filename',
+ metavar='output_presence_table_filename', help='output'
+ ' k-mer presence table filename.')
+ parser.add_argument('input_filenames', metavar='input_sequence_filename',
+ nargs='+', help='input FAST[AQ] sequence filename')
  return parser
 
 

diff --git a/scripts/make-initial-stoptags.py b/scripts/make-initial-stoptags.py
@@ -59,7 +59,8 @@ def get_parser():
  help='Set subset size (default 1e4 is prob ok)')
  parser.add_argument('--stoptags', '-S', metavar='filename', default='',
  help="Use stoptags in this file during partitioning")
- parser.add_argument('graphbase')
+ parser.add_argument('graphbase', help='basename for input and output '
+ 'filenames')
  return parser
 
 

diff --git a/scripts/merge-partitions.py b/scripts/merge-partitions.py
@@ -31,14 +31,16 @@ def get_parser():
  Take the ${graphbase}.subset.#.pmap files and merge them all into a single
  ${graphbase}.pmap.merged file for :program:`annotate-partitions.py` to use.
  """
- parser = argparse.ArgumentParser(description="Merge pmap files.",
- epilog=textwrap.dedent(epilog))
+ parser = argparse.ArgumentParser(
+ description="Merge partition map '.pmap' files.",
+ epilog=textwrap.dedent(epilog))
  parser.add_argument('--ksize', '-k', type=int, default=DEFAULT_K,
  help="k-mer size (default: %d)" % DEFAULT_K)
  parser.add_argument('--keep-subsets', dest='remove_subsets',
  default=True, action='store_false',
  help='Keep individual subsets (default: False)')
- parser.add_argument('graphbase')
+ parser.add_argument('graphbase', help='basename for input and output '
+ 'files')
  parser.add_argument('--version', action='version', version='%(prog)s '
  + khmer.__version__)
  return parser