srfi-115-1.2.html

<!--
SPDX-FileCopyrightText: 2014 Alex Shinn

SPDX-License-Identifier: MIT
-->

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
  'http://www.w3.org/TR/REC-html40/strict.dtd'>
<html lang=en-US>
  <head>
<!-- This commented out text is for the brittle SRFI tools -->
<!--
</head>
<body>
<H1>Title</H1>

Scheme Regular Expressions

<H1>Author</H1>

Alex Shinn

<H1>Status</H1>

This SRFI is currently in ``draft'' status.
-->
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="keywords" content="Scheme, regular expressions, programming language, SRFI">
    <title>SRFI 115: Scheme Regular Expressions</title>
<style type="text/css">
body { 
   width: 7in;
   margin: 30pt;
}
thead {
   font-variant: small-caps;
}
td {
   padding-right: 20px;
}
code.proc-def {
   font-style: bold;
   color: rgb(120,0,120);
}
.code-example {
   background-color: beige;
}
var {
   font-style: bold;
   color: rgb(20,20,120);
}
</style>

  </head>

<body>
<h1><a name="Title">Title</a></h1>
<div class="title-text">

<p>
  Scheme Regular Expressions

<p>

<p>
</div>
<h1><a name="Author">Author</a></h1>

<p>
  Alex Shinn

<p>
This SRFI is currently in ``draft'' status.

To see an explanation of
each status that a SRFI can hold, see <a
href="http://srfi.schemers.org/srfi-process.html">here</a>.

To provide input on this SRFI, please
<a href="mailto:srfi minus 115 at srfi dot schemers dot org">mail to
<code>&lt;srfi minus 115 at srfi dot schemers dot org&gt;</code></a>.  See
<a href="../../srfi-list-subscribe.html">instructions here</a> to
subscribe to the list.  You can access previous messages via
<a href="mail-archive/maillist.html">the archive of the mailing list</a>.
</p>

<ul>
      <li>Received: <a href="http://srfi.schemers.org/srfi-115/srfi-115-1.1.html">2013/10/08</a></li>
      <li>Revised: <a href="http://srfi.schemers.org/srfi-115/srfi-115-1.2.html">2013/11/17</a></li>
      <li>Draft: 2013/10/12-2013/12/12</li>
    </ul>

<p>
<h1>Table of Contents</h1>
<ul id="toc-table">
<li><a href="#Abstract">Abstract</a></li>
<li><a href="#Issues">Issues</a></li>
<li><a href="#Rationale">Rationale</a></li>
<li><a href="#Types-and-Naming-Conventions">Types and Naming Conventions</a></li>
<li><a href="#Compatibility-Levels-and-Features">Compatibility Levels and Features</a></li>
<li><a href="#Library-Procedures-and-Syntax">Library Procedures and Syntax</a></li>
<li><a href="#SRE-Syntax">SRE Syntax</a></li>
<ul>
<ul>
    <li><a href="#SRE_2dSyntax_Basic-Patterns">Basic Patterns</a></li>
    <li><a href="#SRE_2dSyntax_Repeating-patterns">Repeating patterns</a></li>
    <li><a href="#SRE_2dSyntax_Submatch-Patterns">Submatch Patterns</a></li>
    <li><a href="#SRE_2dSyntax_Character-Sets">Character Sets</a></li>
    <li><a href="#SRE_2dSyntax_Named-Character-Sets">Named Character Sets</a></li>
    <li><a href="#SRE_2dSyntax_Boundary-Assertions">Boundary Assertions</a></li>
    <li><a href="#SRE_2dSyntax_Non-Greedy-Patterns">Non-Greedy Patterns</a></li>
    <li><a href="#SRE_2dSyntax_Look-Around-Patterns">Look Around Patterns</a></li>
</ul>
</ul>
<li><a href="#Implementation">Implementation</a></li>
<li><a href="#References">References</a></li>
</ul>

<h1><a name="Abstract">Abstract</a></h1>

<p>
  This SRFI provides a library for matching strings with regular
  expressions described using the SRE "Scheme Regular Expression"
  notation first introduced by <a href="#ref-SCSH">SCSH</a>, and
  extended heavily by <a href="#ref-IrRegex">IrRegex</a>.

<p>

<p>
<h1><a name="Issues">Issues</a></h1>

<p>
  How to integrate with the PCRE regular expression library?  The
  intention is to make this the primitive notation, and for POSIX
  require a separate wrapper such as <code>(pcre-&gt;sre &lt;str&gt;)</code>.
  Alternately we could allow both in the same API, as in IrRegex,
  though this introduces an ambiguity.  Finally, we could make this
  entirely separate from the PCRE API.

<p>
  From SCSH's SREs I've left out the <code>dsm</code> notation which doesn't
  seem as though it need be exposed to the user, the
  <code>posix-string</code> notation because it's better accomplished with
  <code>pcre-&gt;sre</code>, and <code>uncase</code> whose exact semantics and
  motivation I never quite understood.  I also left out the
  <code>blank</code> character class since it's a GNU extension without an
  accepted Unicode definition.

<p>
  | and &amp; are allowed, but the former must be escaped, which looks
  fairly ugly.  For aesthetics they can also be written <code>or</code> and
  <code>and</code>, respectively.

<p>
  I've kept most IrRegex extensions, but made many of the non-POSIX
  ones optional, designated by the <code>regexp-extended</code> feature, and
  <code>backref</code> specifically gets its own feature
  <code>regexp-backrefs</code>.  I left out the common utility patterns
  <code>integer</code>, <code>domain</code>, <code>url</code>, etc., which can easily
  enough be included in libraries and unquoted into SREs.

<p>
  The =&gt; shorthand for named matches used by IrRegex would perhaps
  have better been named &lt;-, the more common choice to represent
  binding in parsers, leaving =&gt; open for the send-to-procedure idiom
  used in cond.

<p>
  The API uses string indices for start, end and match positions,
  which is slow for a UTF8 implementation.  However, the reference
  implementation uses string cursors for efficient iteration,
  minimizing offset conversions, and suffers no penalty if submatch
  strings are directly extracted instead of bounds.

<p>
  Unicode properties and grapheme handling have no precedent in SRE
  implementations, though has much precedent in other regexp
  libraries.  Making Unicode the default feels right, but the vast
  majority of regexps are likely to want ASCII.

<p> Many Unicode properties as well as Unicode script names that are
  available in PCRE are not provided as char-sets here.

<p> SREs with embedded SRFI 14 char-sets can't be written and read
  back in portably.  R7RS WG2 is considering external syntax
  representations, and may include them for SRFI 14 char-sets as well,
  making this a non-issue.  On the other hand SREs with embedded
  compiled regexps, as allowed in SCSH, are not supported, largely to
  preserve writeability.  Instead you should embedded other SREs.

<p> <code>regexp->sre</code> is frequently requested in IrRegex.  It
  is useful and the only argument against it is that it would require
  more memory for compiled regexps (linearly more for most
  implementations), but I'll wait to see if it's requested in the
  discussion.

<p> Library-level features aren't supported in R7RS.

<p>
  There aren't enough examples.

<p>

<p>
<h1><a name="Rationale">Rationale</a></h1>

<p> Regular expressions, coming from a long history of formal language
  theory, are today the lingua franca of simple string matching.  A
  regular expression is an expression describing a regular language,
  the simplest level in the Chomsky hierarchy.  They have the nice
  property that they can match in linear time, whereas parsers for the
  next level in the hierarchy require cubic time.  This combined with
  their conciseness led them to be a popular choice for searching in
  editors, tools and search interfaces.  Other tools may be better
  suited to specific purposes, but it is assumed any modern language
  provide regular expression support.

<p> SREs were first introduced in SCSH as an s-expression based
  alternative to the more common string based description.  This
  format offers many advantages, including being easier to read and
  write (notably with structured editors), easier to compose (with no
  escaping issues), and faster and simpler to compile.  An efficient
  reference implementation of this SRFI can be written in under 1000
  lines of code, whereas in IrRegex the full PCRE parser alone
  requires over 500 lines.

<p>

<p>
<h1>Procedure Index</h1>
<table>
<tr>
<td><a href="#proc-regexp">regexp</a></td>
<td><a href="#proc-rx">rx</a></td>
<td><a href="#proc-char-set-sre">char-set-&gt;sre</a></td>
<td><a href="#proc-valid-sre_3f">valid-sre?</a></td>
</tr>
<tr>
<td><a href="#proc-regexp_3f">regexp?</a></td>
<td><a href="#proc-regexp-matches">regexp-matches</a></td>
<td><a href="#proc-regexp-matches_3f">regexp-matches?</a></td>
<td><a href="#proc-regexp-search">regexp-search</a></td>
</tr>
<tr>
<td><a href="#proc-regexp-fold">regexp-fold</a></td>
<td><a href="#proc-regexp-extract">regexp-extract</a></td>
<td><a href="#proc-regexp-split">regexp-split</a></td>
<td><a href="#proc-regexp-partition">regexp-partition</a></td>
</tr>
<tr>
<td><a href="#proc-regexp-replace">regexp-replace</a></td>
<td><a href="#proc-regexp-replace-all">regexp-replace-all</a></td>
<td><a href="#proc-regexp-match_3f">regexp-match?</a></td>
<td><a href="#proc-regexp-match-count">regexp-match-count</a></td>
</tr>
<tr>
<td><a href="#proc-regexp-match-submatch">regexp-match-submatch</a></td>
<td><a href="#proc-regexp-match-submatch-start">regexp-match-submatch-start</a></td>
<td><a href="#proc-regexp-match-submatch-end">regexp-match-submatch-end</a></td>
<td><a href="#proc-regexp-match-_3elist">regexp-match-&gt;list</a></td>
</tr>
</table>

<h1>Sre Syntax Index</h1>
<table><tr>
<td><a href="#proc-_3cstring_3e">&lt;string&gt;</a></td>
<td><a href="#proc-seq">seq</a></td>
<td><a href="#proc-_3a">:</a></td>
<td><a href="#proc-or">or</a></td>
</tr><tr><td><a href="#proc-_7c_5c_7c_7c">|</a></td>
<td><a href="#proc-w_2fnocase">w/nocase</a></td>
<td><a href="#proc-w_2fcase">w/case</a></td>
<td><a href="#proc-w_2fascii">w/ascii</a></td>
</tr><tr><td><a href="#proc-w_2funicode">w/unicode</a></td>
<td><a href="#proc-_3f">?</a></td>
<td><a href="#proc-_2a">*</a></td>
<td><a href="#proc-_2b">+</a></td>
</tr><tr><td><a href="#proc-_3e_3d">&gt;=</a></td>
<td><a href="#proc-_3d">=</a></td>
<td><a href="#proc-_2a_2a">**</a></td>
<td><a href="#proc-submatch">submatch</a></td>
</tr><tr><td><a href="#proc-_24">$</a></td>
<td><a href="#proc-submatch-named">submatch-named</a></td>
<td><a href="#proc-_3d_3e">=&gt;</a></td>
<td><a href="#proc-backref">backref</a></td>
</tr><tr><td><a href="#proc-_3cchar_3e">&lt;char&gt;</a></td>
<td><a href="#proc-_3cstring_3e_29">(&lt;string&gt;)</a></td>
<td><a href="#proc-_2f">/</a></td>
<td><a href="#proc-or">or</a></td>
</tr><tr><td><a href="#proc-_7e">~</a></td>
<td><a href="#proc--">-</a></td>
<td><a href="#proc-and">and</a></td>
<td><a href="#proc-_26">&amp;</a></td>
</tr><tr><td><a href="#proc-any">any</a></td>
<td><a href="#proc-nonl">nonl</a></td>
<td><a href="#proc-ascii">ascii</a></td>
<td><a href="#proc-lower-case">lower-case</a></td>
</tr><tr><td><a href="#proc-lower">lower</a></td>
<td><a href="#proc-upper-case">upper-case</a></td>
<td><a href="#proc-upper">upper</a></td>
<td><a href="#proc-alphabetic">alphabetic</a></td>
</tr><tr><td><a href="#proc-alpha">alpha</a></td>
<td><a href="#proc-numeric">numeric</a></td>
<td><a href="#proc-num">num</a></td>
<td><a href="#proc-alphanumeric">alphanumeric</a></td>
</tr><tr><td><a href="#proc-alphanum">alphanum</a></td>
<td><a href="#proc-alnum">alnum</a></td>
<td><a href="#proc-punctuation">punctuation</a></td>
<td><a href="#proc-punct">punct</a></td>
</tr><tr><td><a href="#proc-symbol">symbol</a></td>
<td><a href="#proc-graphic">graphic</a></td>
<td><a href="#proc-graph">graph</a></td>
<td><a href="#proc-whitespace">whitespace</a></td>
</tr><tr><td><a href="#proc-white">white</a></td>
<td><a href="#proc-space">space</a></td>
<td><a href="#proc-printing">printing</a></td>
<td><a href="#proc-print">print</a></td>
</tr><tr><td><a href="#proc-control">control</a></td>
<td><a href="#proc-cntrl">cntrl</a></td>
<td><a href="#proc-hex-digit">hex-digit</a></td>
<td><a href="#proc-xdigit">xdigit</a></td>
</tr><tr><td><a href="#proc-bos">bos</a></td>
<td><a href="#proc-eos">eos</a></td>
<td><a href="#proc-bol">bol</a></td>
<td><a href="#proc-eol">eol</a></td>
</tr><tr><td><a href="#proc-bow">bow</a></td>
<td><a href="#proc-eow">eow</a></td>
<td><a href="#proc-nwb">nwb</a></td>
<td><a href="#proc-word">word</a></td>
</tr><tr><td><a href="#proc-word_2b">word+</a></td>
<td><a href="#proc-word">word</a></td>
<td><a href="#proc-bog">bog</a></td>
<td><a href="#proc-eog">eog</a></td>
</tr><tr><td><a href="#proc-grapheme">grapheme</a></td>
<td><a href="#proc-_3f_3f">??</a></td>
<td><a href="#proc-_2a_3f">*?</a></td>
<td><a href="#proc-_2a_2a_3f">**?</a></td>
</tr><tr><td><a href="#proc-look-ahead">look-ahead</a></td>
<td><a href="#proc-look-behind">look-behind</a></td>
<td><a href="#proc-neg-look-ahead">neg-look-ahead</a></td>
<td><a href="#proc-neg-look-behind">neg-look-behind</a></td>
</tr>
</table>

<h1><a name="Types-and-Naming-Conventions">Types and Naming Conventions</a></h1>

<p>
  We introduce two new types, <code>regexp</code> and
  <code>regexp-match</code>, which are disjoint from all other types.  We
  also introduce the concept of an "SRE," which is not a disjoint type
  but is a Scheme object following the specification described below.

<p>
  SRFI 14 defines the <code>char-set</code> type, which can be used as
  part of an SRE.

<p>
  In the prototypes below the following naming conventions imply type
  restrictions:

<p>
<ul>
<li><var>char-set</var>: a SRFI 14 character set
<li><var>cset-sre</var>: an sre which corresponds to matching a single character out of a set of characters
<li><var>end</var>: an exact, non-negative integer, defaulting to the <code>(string-length str)</code>
<li><var>finish</var>: a procedure <code>(lambda (i regexp-match str obj) ...)</code>
<li><var>obj</var>: any object
<li><var>knil</var>: any object
<li><var>kons</var>: a procedure <code>(lambda (i regexp-match str obj) ...)</code>
<li><var>re</var>: an SRE or pre-compiled regexp object
<li><var>regexp-match</var>: an regexp-match object from a successful match
<li><var>sre</var>: an SRE as described below
<li><var>start</var>: an exact, non-negative integer, defaulting to 0
<li><var>str</var>: a string
<li><var>subst</var>: an sexp describing a substition template
<li><var>X-or-false</var>: either an object of type X or the false value

<p>

<p>
</ul>
<h1><a name="Compatibility-Levels-and-Features">Compatibility Levels and Features</a></h1>

<p>
  We specify a thorough, though not exhaustive, syntax with many
  extensions popular in modern regular expression libraries such as
  <a href="#ref-PCRE">PCRE</a>.  This is because it is assumed in many
  cases said libraries will be used as the underlying implementation,
  the features will be desirable, and if left unspecified people will
  provide their own, often incompatible, extensions.

<p> On the other hand it is acknowledged that not all implementations
  will be able to support all extensions.  Some are difficult to
  implement for DFA implementations, and some, like
  <code>backref</code>, are prohibitively expensive for any
  implementation.  Furthermore, even if an implementation has Unicode
  support, its regexp library may not.

<p>
  To resolve these differences we divide the syntax into a minimal
  core which all implementations are required to support, and
  additional extensions.  In <a href="#ref-R7RS">R7RS</a> or other
  implementations which support <a href="#ref-SRFI-0">SRFI 0</a>
  <code>cond-expand</code>, the availability can be tested with the
  following <code>cond-expand</code> features:

<p>
<ul>
<li><code>regexp-non-greedy</code> - the non-greedy repetition patterns <code>??</code>, <code>*?</code>, and <code>**?</code> are supported
<li><code>regexp-look-around</code> - the <code>[neg]-look-ahead</code> and <code>[neg]-look-behind</code> zero-width assertions are supported
<li><code>regexp-backrefs</code> - the <code>backref</code> pattern is supported
<li><code>regexp-unicode</code> - regexp character sets support Unicode
</ul>

<p>
  The first three simply refer to support for certain SRE patterns.

<p>
  <code>regexp-unicode</code> indicates support for Unicode contexts.
  Toggling between Unicode and ASCII can be done with the
  <code>w/unicode</code> and <code>w/ascii</code> patterns.  In a
  Unicode context, the named character sets have their full Unicode
  definition as described below, grapheme boundaries are "extended
  grapheme clusters," and word boundaries are "default word
  boundaries" as defined in <a href="#ref-UAX29">UAX #29</a> (Unicode
  Text Segmentation).  Thus Unicode contexts are equivalent to Level 2
  support for regular expressions as defined in Unicode TR-18.
  Implementations which provide this feature may still support
  non-Unicode characters.

<p>

<p>
<h1><a name="Library-Procedures-and-Syntax">Library Procedures and Syntax</a></h1>

<p>
<dt>(<a name="proc-regexp"><code class="proc-def">regexp</code></a> <var>re</var>) => regexp
<dd class="proc-def"></dd>

<p> Compile a regexp if given an object whose structure matches the
  SRE syntax.  This may be written as a literal or partial literal
  with <code>quote</code> or <code>quasiquote</code>, or may be
  generated entirely programmatically.  Returns <var>re</var>
  unmodified if it is already a regexp.  Raises an error if
  <var>re</var> is neither a regexp nor a valid representation of an
  SRE.

<p> Mutating <var>re</var> may invalidate the resulting regexp,
  causing unspecified results if subsequently used for matching.

<p>
<dt>(<a name="proc-rx"><code class="proc-def">rx</code></a> <var>sre</var> <var>...</var>) => regexp
<dd class="proc-def"></dd>

<p> Macro shorthand for <code>(regexp `(: <var>sre</var> ...))</code>.
  May be able to perform some or all computation at compile time if
  <var>sre</var> is not unquoted.  Note because of this equivalence
  with the procedural constructor <code>regexp</code>, the semantics
  of <code>unquote</code> differs from the original SCSH
  implementation in that unquoted expressions can expand into any
  object matching the SRE syntax, rather than a compiled regexp
  object.  Further, <code>unquote</code> and
  <code>unquote-splicing</code> both expand all matches.

<blockquote style="background:lightgray"> Rationale: Providing a
  procedural interface provides for greater flexibility, and without
  loss of potential compile-time optimizations by preserving the
  syntactic shorthand.  The alternative is to rely on eval to
  dynamically generate regular expressions.  However regexps in many
  cases come from untrusted sources, such as search parameters to a
  server, or from serialized sources such as config files or
  command-line arguments.  Moreover many applications may want to keep
  many thousands of regexps in memory at once.  Given the relatively
  heavy cost and insecurity of eval, and the frequency with which SREs
  are read and written as text, we prefer the procedural interface.
  </blockquote>

<p>
<dt>(<a name="proc-char-set-sre"><code class="proc-def">char-set-&gt;sre</code></a> <var>char-set</var>) => sre
<dd class="proc-def"></dd>

<p>
  Returns an SRE corresponding to the given SRFI 14 character set.
  The resulting SRE expands the character set into notation which does
  not make use of embedded SRFI 14 character sets, and so is suitable
  for writing portably.

<p>
<dt>(<a name="proc-valid-sre_3f"><code class="proc-def">valid-sre?</code></a> <var>obj</var>) => boolean
<dd class="proc-def"></dd>

<p>
  Returns true iff <var>obj</var> can be safely passed to <var>regexp</var>.

<p>
<dt>(<a name="proc-regexp_3f"><code class="proc-def">regexp?</code></a> <var>obj</var>) => boolean
<dd class="proc-def"></dd>

<p>
  Returns true iff <var>obj</var> is a regexp.

<p>
<dt>(<a name="proc-regexp-matches"><code class="proc-def">regexp-matches</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => regexp-match-or-false
<dd class="proc-def"></dd>

<p>
  Returns an regexp-match object if <var>re</var> successfully matches the entire
  string <var>str</var> from <var>start</var> (inclusive) to <var>end</var> (exclusive), or #f is the
  match fails.  The regexp-match object will contain information needed to
  extract any submatches.

<p>
<dt>(<a name="proc-regexp-matches_3f"><code class="proc-def">regexp-matches?</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => boolean?
<dd class="proc-def"></dd>

<p>
  Returns <code>#t</code> if <var>re</var> matches <var>str</var> as in regexp-matches, or
  <code>#f</code> otherwise.  May be faster than regexp-matches since it
  doesn't need to return submatch data.

<p>
<dt>(<a name="proc-regexp-search"><code class="proc-def">regexp-search</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => regexp-match-or-false
<dd class="proc-def"></dd>

<p>
  Returns an regexp-match object if <var>re</var> successfully matches a substring
  of <var>str</var> between <var>start</var> (inclusive) and <var>end</var> (exclusive), or
  <code>#f</code> is the match fails.  The regexp-match object will contain
  information needed to extract any submatches.

<p>
<dt>(<a name="proc-regexp-fold"><code class="proc-def">regexp-fold</code></a> <var>re</var> <var>kons</var> <var>knil</var> <var>str</var> <var>[finish</var> <var>[start</var> <var>[end]]]</var>) => obj
<dd class="proc-def"></dd>

<p>
  The fundamental regexp matching iterator.  Repeatedly searches <var>str</var>
  for the regexp <var>re</var> so long as a match can be found.  On each
  successful match, applies
<pre class="code-example">
   (<var>kons</var> <i>i</i> <i>regexp-match</i> <i>str</i> <i>acc</i>)
</pre>
  where <i>i</i> is the index since the last match (beginning with <var>start</var>),
  <i>regexp-match</i> is the resulting match, and <i>acc</i> is the result of the
  previous <var>kons</var> application, beginning with <var>knil</var>.  When no more
  matches can be found, calls <var>finish</var> with the same arguments, except
  that <i>regexp-match</i> is #f.

<p>
  By default <var>finish</var> just returns <i>acc</i>.

<p>
<dt>(<a name="proc-regexp-extract"><code class="proc-def">regexp-extract</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => list
<dd class="proc-def"></dd>

<p>
  Extract all non-empty substrings of <var>str</var> which match <var>re</var> between
  <var>start</var> and <var>end</var> as a list of strings.

<p>
<pre class="code-example">
   (regexp-extract '(+ numeric) "192.168.0.1")
   =&gt; ("192" "168" "0" "1")
</pre>


<dt>(<a name="proc-regexp-split"><code class="proc-def">regexp-split</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => list
<dd class="proc-def"></dd>

<p>
  Split <var>str</var> into a list of strings separated by matches of <var>re</var>.

<p>
<pre class="code-example">
   (regexp-split '(+ space) " fee fi  fo\tfum\n")
   =&gt; ("fee" "fi" "fo" "fum")
</pre>


<dt>(<a name="proc-regexp-partition"><code class="proc-def">regexp-partition</code></a> <var>re</var> <var>str</var> <var>[start</var> <var>[end]]</var>) => list
<dd class="proc-def"></dd>

<p>
  Partition <var>str</var> into a list of non-empty strings matching <var>re</var>,
  interspered with the unmatched portions of the string.  The first
  and every odd element is an unmatched substring, which will be the
  empty string if <var>re</var> matches at the beginning of the string or end
  of the previous match.  The second and every even element will be a
  substring matching <var>re</var>.  If the final match ends at the end of the
  string, no trailing empty string will be included.  Thus, in the
  degenerate case where <var>str</var> is the empty string, the result is
  <code>("")</code>.

<p>
<pre class="code-example">
   (regexp-partition '(+ (or space punct)) "")
   =&gt; ("")
   (regexp-partition '(+ (or space punct)) "Hello, world!\n")
   =&gt; ("Hello" ", " "world" "!\n")
   (regexp-partition '(+ (or space punct)) "¿Dónde Estás?")
   =&gt; ("" "¿" "Dónde" " " "Estás" "?")
</pre>


<dt>(<a name="proc-regexp-replace"><code class="proc-def">regexp-replace</code></a> <var>re</var> <var>str</var> <var>subst</var> <var>[start</var> <var>[end]]</var>) => string
<dd class="proc-def"></dd>

<p>
  Returns a new string replacing the first match of <var>re</var> in <var>str</var> with
  the <var>subst</var>.  <var>subst</var> can be a string, an integer or symbol
  indicating the contents of a numbered or named submatch of <var>re</var>,
  <var>'pre</var> for the substring to the left of the match, or <var>'post</var> for
  the substring to the right of the match.

<p>
<pre class="code-example">
   (regexp-replace '(+ space) "one two three" "_")
   =&gt; "one_two three"
</pre>


<dt>(<a name="proc-regexp-replace-all"><code class="proc-def">regexp-replace-all</code></a> <var>re</var> <var>str</var> <var>subst</var> <var>[start</var> <var>[end]]</var>) => string
<dd class="proc-def"></dd>

<p>
  Equivalent to <var>regexp-replace</var>, but replaces all occurrences of <var>re</var>
  in <var>str</var>.

<p>
<pre class="code-example">
   (regexp-replace-all '(+ space) "one two three" "_")
   =&gt; "one_two_three"
</pre>


<dt>(<a name="proc-regexp-match_3f"><code class="proc-def">regexp-match?</code></a> <var>obj</var>) => boolean
<dd class="proc-def"></dd>

<p>
  Returns true iff <var>obj</var> is a successful match from <var>regexp-matches</var> or
  <var>regexp-search</var>.

<p>
<dt>(<a name="proc-regexp-match-count"><code class="proc-def">regexp-match-count</code></a> <var>regexp-match</var>) => integer
<dd class="proc-def"></dd>

<p>
  Returns the number of submatches of regexp-match, regardless of whether
  they matched or not.

<p>
<dt>(<a name="proc-regexp-match-submatch"><code class="proc-def">regexp-match-submatch</code></a> <var>regexp-match</var> <var>field</var>) => string-or-false
<dd class="proc-def"></dd>

<p> Returns the substring matched in <var>regexp-match</var>
  corresponding to <var>field</var>, either an integer or a symbol for
  a named submatch.  Index 0 refers to the entire match, index 1 to
  the first lexicographic submatch, and so on.  If passed an integer
  outside the range of matches, or a symbol which does not correspond
  to a named submatch of the pattern, it is an error.  If the
  corresponding submatch did not match, returns false.

<p> The result of extracting a submatch after the original matched
string has been mutated is unspecified.

<p>
<dt>(<a name="proc-regexp-match-submatch-start"><code class="proc-def">regexp-match-submatch-start</code></a> <var>regexp-match</var> <var>field</var>) => integer-or-false
<dd class="proc-def"></dd>

<p>
  Returns the start index <var>regexp-match</var> corresponding to
  <var>field</var>, as in <var>regexp-match-submatch</var>.

<p>
<dt>(<a name="proc-regexp-match-submatch-end"><code class="proc-def">regexp-match-submatch-end</code></a> <var>regexp-match</var> <var>field</var>) => integer-or-false
<dd class="proc-def"></dd>

<p>
  Returns the end index in <var>regexp-match</var> corresponding to
  <var>field</var>, as in <var>regexp-match-submatch</var>.

<p>
<dt>(<a name="proc-regexp-match-_3elist"><code class="proc-def">regexp-match-&gt;list</code></a> <var>regexp-match</var>) => list
<dd class="proc-def"></dd>

<p>
  Returns a list of all submatches in <var>regexp-match</var> as string or false,
  beginning with the entire match 0.

<p>

<p>
<h1><a name="SRE-Syntax">SRE Syntax</a></h1>

<p> The grammar for SREs is summarized below.  Note that an SRE is a
  first-class object consisting of nested lists of strings, chars,
  char-sets, symbols and numbers.  Where the syntax is described as
  <code>(foo bar)</code>, this can be constructed equivalently as
  <code>'(foo bar)</code> or <code>(list 'foo 'bar)</code>, etc.

  The following sections explain the semantics in greater detail.

<p>
<pre class="code-example">
    &lt;sre&gt; ::=
     | &lt;string&gt;                    ; A literal string match.
     | &lt;cset-sre&gt;                  ; A character set match.
     | (* &lt;sre&gt; ...)               ; 0 or more matches.
     | (+ &lt;sre&gt; ...)               ; 1 or more matches.
     | (? &lt;sre&gt; ...)               ; 0 or 1 matches.
     | (= &lt;n&gt; &lt;sre&gt; ...)           ; &lt;n&gt; matches.
     | (&gt;= &lt;n&gt; &lt;sre&gt; ...)          ; &lt;n&gt; or more matches.
     | (** &lt;n&gt; &lt;m&gt; &lt;sre&gt; ...)      ; &lt;n&gt; to &lt;m&gt; matches.

     | (|  &lt;sre&gt; ...)              ; Alternation.
     | (or &lt;sre&gt; ...)

     | (:   &lt;sre&gt; ...)             ; Sequence.
     | (seq &lt;sre&gt; ...)     
     | ($ &lt;sre&gt; ...)               ; Numbered submatch.
     | (submatch &lt;sre&gt; ...)
     | (=&gt; &lt;name&gt; &lt;sre&gt; ...)               ;  Named submatch.  &lt;name&gt; is
     | (submatch-named &lt;name&gt; &lt;sre&gt; ...)   ;  a symbol.

     | (w/case   &lt;sre&gt; ...)        ; Introduce a case-sensitive context.
     | (w/nocase &lt;sre&gt; ...)        ; Introduce a case-insensitive context.

     | (w/unicode   &lt;sre&gt; ...)     ; Introduce a unicode context.
     | (w/ascii &lt;sre&gt; ...)         ; Introduce an ascii context.

     | bos                         ; Beginning of string.
     | eos                         ; End of string.

     | bol                         ; Beginning of line.
     | eol                         ; End of line.

     | bog                         ; Beginning of grapheme cluster.
     | eog                         ; End of grapheme cluster.
     | graheme                     ; A single grapheme cluster.

     | bow                         ; Beginning of word.
     | eow                         ; End of word.
     | nwb                         ; A non-word boundary.
     | (word &lt;sre&gt; ...)            ; A sre wrapped in word boundaries.
     | (word+ &lt;cset-sre&gt; ...)      ; A single word restricted to a cset.
     | word                        ; A single word.

     | (?? sre ...)                ; A non-greedy pattern, 0 or 1 match.
     | (*? sre ...)                ; Non-greedy 0 or more matches.
     | (**? m n sre ...)           ; Non-greedy &lt;m&gt; to &lt;n&gt; matches.
     | (look-ahead sre ...)        ; Zero-width look-ahead assertion.
     | (look-behind sre ...)       ; Zero-width look-behind assertion.
     | (neg-look-ahead sre ...)    ; Zero-width negative look-ahead assertion.
     | (neg-look-behind sre ...)   ; Zero-width negative look-behind assertion.
</pre>


  The grammar for <code>cset-sre</code> is as follows.

<p>
<pre class="code-example">
    &lt;cset-sre&gt; ::=
     | &lt;char&gt;                      ; literal char
     | "&lt;char&gt;"                    ; string of one char
     | &lt;char-set&gt;                  ; embedded SRFI 14 char set
     | (&lt;string&gt;)                  ; literal char set
     | (/ &lt;range-spec&gt; ...)        ; ranges
     | (or &lt;cset-sre&gt; ...)         ; union
     | (and &lt;cset-sre&gt; ...)        ; intersection
     | (- &lt;cset-sre&gt; ...)          ; difference
     | (~ &lt;cset-sre&gt; ...)          ; complement of union
     | (w/case &lt;cset-sre&gt; ...)     ; case and unicode toggling
     | (w/nocase &lt;cset-sre&gt; ...)
     | (w/ascii &lt;cset-sre&gt; ...)
     | (w/unicode &lt;cset-sre&gt; ...)
     | any | nonl | ascii | lower-case | lower
     | upper-case | upper | alphabetic | alpha
     | numeric | num | alphanumeric | alphanum | alnum
     | punctuation | punct | symbol | graphic | graph
     | whitespace | white | space | printing | print
     | control | cntrl | hex-digit | xdigit
</pre>


<pre class="code-example">
    &lt;range-spec&gt; ::= &lt;string&gt; | &lt;char&gt;
</pre>


<h3><a name="SRE_2dSyntax_Basic-Patterns">Basic Patterns</a></h3>

<p>
<dt><a name="proc-_3cstring_3e"><code class="proc-def">&lt;string&gt;</code></a> 
<dd class="proc-def"></dd>

<p>
  A literal string.

<p>
<pre class="code-example">
   (regexp-search "needle" "hayneedlehay") =&gt; #&lt;regexp-match&gt;
   (regexp-search "needle" "haynEEdlehay") =&gt; #f
</pre>


<dt>(<a name="proc-seq"><code class="proc-def">seq</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-:"><code class="proc-def">:</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Sequencing.

<p>
<pre class="code-example">
   (regexp-search '(: "one" space "two" space "three") "one two three") =&gt; #&lt;regexp-match&gt;
</pre>


<dt>(<a name="proc-or"><code class="proc-def">or</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-_7c_5c_7c_7c"><code class="proc-def">|\||</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Alternation.

<p>
<pre class="code-example">
   (regexp-search '(or "eeney" "meeney" "miney") "meeney") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(or "eeney" "meeney" "miney") "moe") =&gt; #f
</pre>


<dt>(<a name="proc-w_2fnocase"><code class="proc-def">w/nocase</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p> Enclosed <var>sres</var> are case-insensitive.  In a Unicode
  context character and string literals match with the default simple
  Unicode case-insensitive matching, and character sets match if any
  character in the set matches case-insensitively.  Implementations
  may, but are not required to, handle variable length case
  conversions, such as #\x00DF "ß" matching the two characters "SS".

  In an ASCII context only the 52 ASCII letters "a-zA-Z" match
  case-insensitively to each other.

<p>
<pre class="code-example">
   (regexp-search '(w/nocase "needle") "haynEEdlehay") =&gt; #&lt;regexp-match&gt;
</pre>


<dt>(<a name="proc-w_2fcase"><code class="proc-def">w/case</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Enclosed <var>sres</var> are case-sensitive.  This is the default.

<p>
<pre class="code-example">
   (regexp-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") =&gt; #&lt;regexp-match&gt;
</pre>


<dt>(<a name="proc-w_2fascii"><code class="proc-def">w/ascii</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Enclosed <var>sres</var> are interpreted in an ASCII context.  In practice
  many regular expressions are used for simple parsing and only ASCII
  characters are relevant.  Switching to ASCII mode can improve
  performance in some implementations.

<p>
<pre class="code-example">
   (regexp-search '(w/ascii bos (* letter) eos) "English") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(w/ascii bos (* letter) eos) "Ελληνική") =&gt; #f
</pre>


<dt>(<a name="proc-w_2funicode"><code class="proc-def">w/unicode</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Enclosed <var>sres</var> are interpreted in a Unicode context - character
  sets with both an ASCII and Unicode definition take the latter.  Has
  no effect if the <code>regexp-unicode</code> feature is not provided.  This
  is the default.

<p>
<pre class="code-example">
   (regexp-search '(w/unicode bos (* letter) eos) "English") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(w/unicode bos (* letter) eos) "Ελληνική") =&gt; #&lt;regexp-match&gt;
</pre>


<h3><a name="SRE_2dSyntax_Repeating-patterns">Repeating patterns</a></h3>

<p>
<dt>(<a name="proc-_3f"><code class="proc-def">?</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  An optional pattern - matches 1 or 0 times.

<p>
<pre class="code-example">
   (regexp-search '(: "match" (? "es") "!") "matches!") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "match" (? "es") "!") "match!") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "match" (? "es") "!") "matche!") =&gt; #f
</pre>


<dt>(<a name="proc-_2a"><code class="proc-def">*</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Kleene star, matches 0 or more times.

<p>
<pre class="code-example">
   (regexp-search '(: "&lt;" (* (~ #\&gt;)) "&gt;") "&lt;html&gt;") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "&lt;" (* (~ #\&gt;)) "&gt;") "&lt;&gt;") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "&lt;" (* (~ #\&gt;)) "&gt;") "&lt;html") =&gt; #f
</pre>


<dt>(<a name="proc-_2b"><code class="proc-def">+</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  1 or more matches.  Like <code>*</code> but requires at least a single match.

<p>
<pre class="code-example">
   (regexp-search '(: "&lt;" (+ (~ #\&gt;)) "&gt;") "&lt;html&gt;") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "&lt;" (+ (~ #\&gt;)) "&gt;") "&lt;a&gt;") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "&lt;" (+ (~ #\&gt;)) "&gt;") "&lt;&gt;") =&gt; #f
</pre>


<dt>(<a name="proc-_3e_3d"><code class="proc-def">&gt;=</code></a> <var>n</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  More generally, <var>n</var> or more matches.

<p>
<pre class="code-example">
   (regexp-search '(: "&lt;" (&gt;= 3 (~ #\&gt;)) "&gt;") "&lt;table&gt;") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "&lt;" (&gt;= 3 (~ #\&gt;)) "&gt;") "&lt;pre&gt;") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "&lt;" (&gt;= 3 (~ #\&gt;)) "&gt;") "&lt;tr&gt;") =&gt; #f
</pre>


<dt>(<a name="proc-_3d"><code class="proc-def">=</code></a> <var>n</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Exactly <var>n</var> matches.

<p>
<pre class="code-example">
   (regexp-search '(: "&lt;" (= 4 (~ #\&gt;)) "&gt;") "&lt;html&gt;") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: "&lt;" (= 4 (~ #\&gt;)) "&gt;") "&lt;table&gt;") =&gt; #f
</pre>


<dt>(<a name="proc-_2a_2a"><code class="proc-def">**</code></a> <var>from</var> <var>to</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  The most general form, from <var>n</var> to <var>m</var> matches, inclusive.

<p>
<pre class="code-example">
   (regexp-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") =&gt; #&lt;regexp-match&gt;
   (regexp-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") =&gt; #f
</pre>


<h3><a name="SRE_2dSyntax_Submatch-Patterns">Submatch Patterns</a></h3>

<p>
<dt>(<a name="proc-submatch"><code class="proc-def">submatch</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-_24"><code class="proc-def">$</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  A numbered submatch.  The contents matching the pattern
  will be available in the resulting regexp-match.

<p>
<dt>(<a name="proc-submatch-named"><code class="proc-def">submatch-named</code></a> <var>name</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-_3d_3e"><code class="proc-def">=&gt;</code></a> <var>name</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  A named submatch.  Behaves just like <var>submatch</var>, but the field may
  also be referred to by <var>name</var>.

<p>
<dt>(<a name="proc-backref"><code class="proc-def">backref</code></a> <var>n-or-name</var>)
<dd class="proc-def"></dd>

<p>
  Optional: Match a previously matched submatch.  The feature
  <code>regexp-backrefs</code> will be provided if this pattern is supported.
  Backreferences are expensive, and can trivially be shown to be
  NP-hard, so one should avoid their use even in implementations which
  support them.

<p>
<h3><a name="SRE_2dSyntax_Character-Sets">Character Sets</a></h3>

<p>
  A character set pattern matches a single character.

<p>
<dt><a name="proc-_3cchar_3e"><code class="proc-def">&lt;char&gt;</code></a> 
<dd class="proc-def"></dd>

<p>
  A singleton char set.

<p>
<pre class="code-example">
   (regexp-matches '(* #\-) "---") =&gt; #&lt;regexp-match&gt;
   (regexp-matches '(* #\-) "-_-") =&gt; #f
</pre>

<p>
<dt><a name="proc-_22_3cchar_3e_22"><code class="proc-def">"&lt;char&gt;"</code></a> 
<dd class="proc-def"></dd>

<p>
  A singleton char set written as a string of length one rather than a
  character.  Equivalent to its interpretation as a literal string
  match, but included to clarify it can be composed in
  <code>cset-sre</code>s.

<p>
<dt><a name="proc-_3cchar-set_3e"><code class="proc-def">&lt;char-set&gt;</code></a> 
<dd class="proc-def"></dd>

<p>
  A SRFI 14 character set, which matches any character in the set.
  Note that currently there is no portable written representation
  of SRFI 14 character sets, which means that this pattern is
  typically generated programmatically, such as with a quasiquoted
  expression.

<p>
<pre class="code-example">
   (regexp-partition `(+ ,char-set:vowels) "vowels")
   =&gt; ("v" "o" "w" "e" "ls")
</pre>

<blockquote style="background:lightgray"> Rationale: Many useful
  character sets are likely to be available as SRFI 14
  <code>char-set</code>s, so it is desirable to reuse them in regular
  expressions.  Since many Unicode character sets are extremely large,
  converting back and forth between an internal and external
  representation can be prohibitively expensive, so the option of
  direct embedding is necessary.  When a readable external
  representation is needed, <code>char-set-&gt;sre</code> can be used.
  </blockquote>

<p>
<dt>(<a name="proc-_3cstring_3e)"><code class="proc-def">&lt;string&gt;)</code></a> 
<dd class="proc-def"></dd>

<p>
  The set of chars as formed by <code>(string-&gt;char-set <var>&lt;string&gt;</var>)</code>.

<p>
<pre class="code-example">
   (regexp-matches '(* ("aeiou")) "oui") =&gt; #&lt;regexp-match&gt;
   (regexp-matches '(* ("aeiou")) "ouais") =&gt; #f
</pre>


<dt>(<a name="proc-_2f"><code class="proc-def">/</code></a> <var>&lt;range-spec&gt;</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Ranged char set.  The <var>&lt;range-spec&gt;</var> is a list of strings and
  characters.  These are flattened and grouped into pairs of
  characters, and all ranges formed by the pairs are included in the
  char set.

<p>
<pre class="code-example">
   (regexp-matches '(* (/ "AZ09")) "R2D2") =&gt; #&lt;regexp-match&gt;
   (regexp-matches '(* (/ "AZ09")) "C-3PO") =&gt; #f
</pre>


<dt>(<a name="proc-or"><code class="proc-def">or</code></a> <var>&lt;cset-sre&gt;</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-&amp;"><code class="proc-def">|\||</code></a> <var>&lt;cset-sre&gt;</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Char set union.

<p>
<dt>(<a name="proc-_7e"><code class="proc-def">~</code></a> <var>&lt;cset-sre&gt;</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Char set complement (i.e. [^...]).

<p>
<dt>(<a name="proc--"><code class="proc-def">-</code></a> <var>&lt;cset-sre&gt;</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Char set difference.

<p>
<pre class="code-example">
   (regexp-matches '(* (- (/ "az") ("aeiou"))) "xyzzy") =&gt; #&lt;regexp-match&gt;
   (regexp-matches '(* (- (/ "az") ("aeiou"))) "vowels") =&gt; #f
</pre>


<dt>(<a name="proc-and"><code class="proc-def">and</code></a> <var>&lt;cset-sre&gt;</var> <var>...</var>)
<dd class="proc-def"></dd>
<dt>(<a name="proc-&amp;"><code class="proc-def">&amp;</code></a> <var>&lt;cset-sre&gt;</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Char set intersection.

<p>
<pre class="code-example">
   (regexp-matches '(* (&amp; (/ "az") (~ ("aeiou")))) "xyzzy") =&gt; #&lt;regexp-match&gt;
   (regexp-matches '(* (&amp; (/ "az") (~ ("aeiou")))) "vowels") =&gt; #f
</pre>


<h3><a name="SRE_2dSyntax_Named-Character-Sets">Named Character Sets</a></h3>

<p>
<dt><a name="proc-any"><code class="proc-def">any</code></a> 
<dd class="proc-def"></dd>

<p>
  Match any character, even Unicode characters when in an ASCII context.

<p>
<dt><a name="proc-nonl"><code class="proc-def">nonl</code></a> 
<dd class="proc-def"></dd>

<p> Match any character other than <code>#\return</code> or
  <code>#\newline</code>.

<p>
<dt><a name="proc-ascii"><code class="proc-def">ascii</code></a> 
<dd class="proc-def"></dd>

<p>
  Match any ASCII character [0..127].

<p>
<dt><a name="proc-lower-case"><code class="proc-def">lower-case</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-lower"><code class="proc-def">lower</code></a> 
<dd class="proc-def"></dd>

<p> Matches any character for which <code>char-lower-case?</code>
  returns true.  In a Unicode context this corresponds to the
  Lowercase (Ll + Other_Lowercase) property.  In an ASCII context
  corresponds to <code>(/ "az")</code>.

<p>
<dt><a name="proc-upper-case"><code class="proc-def">upper-case</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-upper"><code class="proc-def">upper</code></a> 
<dd class="proc-def"></dd>

<p> Matches any character for which <code>char-upper-case?</code>
  returns true.  In a Unicode context this corresponds to the
  Uppercase (Lu + Other_Uppercase) property.  In an ASCII context
  corresponds to <code>(/ "AZ")</code>.

<p>
<dt><a name="proc-alphabetic"><code class="proc-def">alphabetic</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-alpha"><code class="proc-def">alpha</code></a> 
<dd class="proc-def"></dd>

<p> Matches any character for which <code>char-alphabetic?</code>
  returns true.  In a Unicode context this corresponds to the
  Alphabetic (L + Nl + Other_Alphabetic) property.  In an ASCII
  context corresponds to <code>(w/nocase (/ "az"))</code>.

<p>
<dt><a name="proc-numeric"><code class="proc-def">numeric</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-num"><code class="proc-def">num</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any character for which <code>char-numeric?</code> returns true.  For
  In a Unicode context this corresponds to the Numeric_Digit (Nd)
  property.  In an ASCII context corresponds to <code>(/ "09")</code>.

<p>
<dt><a name="proc-alphanumeric"><code class="proc-def">alphanumeric</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-alphanum"><code class="proc-def">alphanum</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-alnum"><code class="proc-def">alnum</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any character which is either a letter or number.
  Equivalent to:

<p>
<pre class="code-example">
   (or alphabetic numeric)
</pre>


<dt><a name="proc-punctuation"><code class="proc-def">punctuation</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-punct"><code class="proc-def">punct</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any punctuation character.  In a Unicode context this
  corresponds to the Punctuation property (P).  In an ASCII context
  this corresponds to <code>"!\"#%&amp;'()*,-./:;?@[\]_{}"</code>.

<p>
<dt><a name="proc-symbol"><code class="proc-def">symbol</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any symbol character.  In a Unicode context this corresponds
  to the Symbol property (Sm, Sc, Sk, or So).  In an ASCII this
  corresponds to <code>"$+&lt;=&gt;^`|~"</code>.

<p>
<dt><a name="proc-graphic"><code class="proc-def">graphic</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-graph"><code class="proc-def">graph</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any graphic character.  Equivalent to:

<p>
<pre class="code-example">
   (or alphanumeric punctuation symbol)
</pre>


<dt><a name="proc-whitespace"><code class="proc-def">whitespace</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-white"><code class="proc-def">white</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-space"><code class="proc-def">space</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any whitespace character.  In a Unicode context this
  corresponds to the Separator property (Zs, Zl or Zp).  In an ASCII
  context this corresponds to space, tab, line feed, form feed, and
  carriage return.

<p>
<dt><a name="proc-printing"><code class="proc-def">printing</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-print"><code class="proc-def">print</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any printing character.  Equivalent to:

<p>
<pre class="code-example">
   (or graphic whitespace)
</pre>


<dt><a name="proc-control"><code class="proc-def">control</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-cntrl"><code class="proc-def">cntrl</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any control or other character.  In a Unicode context this
  corresponds to the Other property (Cc, Cf, Co, Cs or Cn).  In an
  ASCII context this corresponds to:

<p>
<pre class="code-example">
   `(/ ,(integer-&gt;char 0) ,(integer-char 31))
</pre>


<dt><a name="proc-hex-digit"><code class="proc-def">hex-digit</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-xdigit"><code class="proc-def">xdigit</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches any valid digit in hexadecimal notation.  Alway ASCII-only.
  Equivalent to:

<p>
<pre class="code-example">
   (w/ascii (w/nocase (or numeric "abcdef")))
</pre>


<h3><a name="SRE_2dSyntax_Boundary-Assertions">Boundary Assertions</a></h3>

<p>
<dt><a name="proc-bos"><code class="proc-def">bos</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-eos"><code class="proc-def">eos</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches at the beginning/end of string without consuming any
  characters (a zero-width assertion).  If the search was initiated
  with start/end parameters, these are considered the end points,
  rather than the full string.

<p>
<dt><a name="proc-bol"><code class="proc-def">bol</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-eol"><code class="proc-def">eol</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches at the beginning/end of a line without consuming any
  characters (a zero-width assertion).  A line is a possibly empty
  sequence of characters followed by an end of line sequence as
  understood by the R7RS <code>read-line</code> procedure,
  specifically any of a linefeed character, carriage return character,
  or a carriage return followed by a linefeed character.
  The string is assumed to contain end of line sequences before the
  start and after the end of the string, even if the search was made
  on a substring and the actual surrounding characters differ.

<p>
<dt><a name="proc-bow"><code class="proc-def">bow</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-eow"><code class="proc-def">eow</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches at the beginning/end of a word without consuming any
  characters (a zero-width assertion).  In a Unicode context follows
  the default word boundary specification from TR29.  In an ASCII
  context a word is a sequence of one or more characters that are
  either alphanumeric or the underscore character.  The string is
  assumed to contain non-word characters immediately before the start
  and after the end, even if the search was made on a substring and
  word constituent characters appear immediately before the beginning
  or after the end.

<p>
<dt><a name="proc-nwb"><code class="proc-def">nwb</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches a non-word-boundary (i.e. \B in PCRE).  Equivalent to
  (neg-look-ahead (or bow eow)).

<p>
<dt>(<a name="proc-word"><code class="proc-def">word</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Anchor a sequence to word boundaries.  Equivalent to:

<p>
<pre class="code-example">
   (: bow <var>sre</var> ... eow)
</pre>


<dt>(<a name="proc-word_2b"><code class="proc-def">word+</code></a> <var>cset-sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Matches a single word composed of characters in the intersection of
  the given <var>cset-sre</var> and the word constituent characters.
  Equivalent to:

<p>
<pre class="code-example">
   (word (+ (and (or alphanumeric "_") (or <var>cset-sre</var> ...))))
</pre>


<dt><a name="proc-word"><code class="proc-def">word</code></a> 
<dd class="proc-def"></dd>

<p>
  A shorthand for (word+ any).

<p>
<dt><a name="proc-bog"><code class="proc-def">bog</code></a> 
<dd class="proc-def"></dd>
<dt><a name="proc-eog"><code class="proc-def">eog</code></a> 
<dd class="proc-def"></dd>

<p> Matches at the beginning/end of a single extended grapheme cluster
  without consuming any characters (a zero-width assertion).  Grapheme
  cluster boundaries are defined in Unicode <a
  href="#ref-TR29">TR29</a>.  The string is assumed to contain
  non-combining codepoints immediately before the start and after the
  end.  These always succeed in an ASCII context.

<p>
<dt><a name="proc-grapheme"><code class="proc-def">grapheme</code></a> 
<dd class="proc-def"></dd>

<p>
  Matches a single grapheme cluster (i.e. \X in PCRE).  This is what
  the end-user typically thinks of as a single character, comprised of
  a base non-combining codepoint followed by zero or more combining
  marks.  In an ASCII context this is equivalent to <code>any</code>.

<p> Assuming <code>char-set:mark</code> contains all characters with
  the Extend or SpacingMark properties defined in TR29, and
  <code>char-set:control</code>,
  <code>char-set:regional-indicator</code> and
  <code>char-set:hangul-*</code> are defined similarly, then the
  following SRE can be used with <code>regexp-extract</code> to
  extract all graphemes from a string:

<p>
<pre class="code-example">
   `(or (: (* ,char-set:hangul-l)
           (or ,char-set:hangul-lvt
               (: (? ,char-set:hangul-lv) (* ,char-set:hangul-v)))
           (* ,char-set:hangul-t))
        (+ ,char-set:regional-indicator)
        (: "\r\n")
        (: (~ control ("\r\n"))
           (+ ,char-set:mark))
        control)
</pre>

<p> To correctly match just an individual grapheme, however, the
  hangul matching needs to be expanded into alternate cases such that
  at least one codepoint is consumed.

<h3><a name="SRE_2dSyntax_Non-Greedy-Patterns">Non-Greedy Patterns</a></h3>

<p>
  The following patterns are only supported if the feature
  <code>regexp-non-greedy</code> is provided.

<p>
<dt>(<a name="proc-_3f_3f"><code class="proc-def">??</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Non-greedy pattern, matches 0 or 1 times, preferring the shorter
  match.

<p>
<dt>(<a name="proc-_2a_3f"><code class="proc-def">*?</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Non-greedy kleene star, matches 0 or more times, preferring the
  shorter match.

<p>
<dt>(<a name="proc-_2a_2a_3f"><code class="proc-def">**?</code></a> <var>m</var> <var>n</var> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Non-greedy kleene star, matches <var>m</var> to <var>n</var> times, preferring the
  shorter match.

<h3><a name="SRE_2dSyntax_Look-Around-Patterns">Look Around Patterns</a></h3>

<p>
  The following patterns are only supported if the feature
  <code>regexp-look-around</code> is provided.

<p>
<dt>(<a name="proc-look-ahead"><code class="proc-def">look-ahead</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Zero-width look-ahead assertion.  Assert the sequence matches from
  the current position, without advancing the position.

<p>
<dt>(<a name="proc-look-behind"><code class="proc-def">look-behind</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Zero-width look-behind assertion.  Assert the sequence matches
  behind the current position, without advancing the position.  It is
  an error if the sequence does not have a fixed length.

<p>
<dt>(<a name="proc-neg-look-ahead"><code class="proc-def">neg-look-ahead</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Zero-width negative look-ahead assertion.

<p>
<dt>(<a name="proc-neg-look-behind"><code class="proc-def">neg-look-behind</code></a> <var>sre</var> <var>...</var>)
<dd class="proc-def"></dd>

<p>
  Zero-width negative look-behind assertion.

<p>

<p>
<h1><a name="Implementation">Implementation</a></h1>

<p>
  A reference implementation in portable R7RS is available at
<pre>
    <a href="https://code.google.com/p/chibi-scheme/source/browse/lib/chibi/regexp.scm">https://code.google.com/p/chibi-scheme/source/browse/lib/chibi/regexp.sld</a>
    <a href="https://code.google.com/p/chibi-scheme/source/browse/lib/chibi/regexp.scm">https://code.google.com/p/chibi-scheme/source/browse/lib/chibi/regexp.scm</a>
</pre>
  depending only on
  <a href="http://srfi.schemers.org/srfi-14/srfi-14.html">SRFI 14</a>,
  <a href="http://srfi.schemers.org/srfi-33/srfi-33.html">SRFI 33</a> and
  <a href="http://srfi.schemers.org/srfi-69/srfi-69.html">SRFI 69</a>.
  This is implemented as a Thompson-style non-bactracking NFA, a
  discussion of which can be found at Russ Cox's
  <a href="#ref-ImplementingRegexps">Implementing Regexps</a>.  Note the
  reference implementation may not be up to date with the latest draft
  prior to finalization.

<p>

<p>
<h1><a name="References">References</a></h1>

<p>

<p>
<dl>
<dt class="biblio"><a name="ref-R7RS"><strong>R7RS</strong></a>
<dd>
<pre class="biblio">
      Alex Shinn, John Cowan, Arthur Gleckler, Revised<sup>7</sup> Report on the Algorithmic Language Scheme
      <a href="http://trac.sacrideo.us/wg/raw-attachment/wiki/WikiStart/r7rs.pdf">http://trac.sacrideo.us/wg/raw-attachment/wiki/WikiStart/r7rs.pdf</a>
</pre>

<dt class="biblio"><a name="ref-SCSH"><strong>SCSH</strong></a>
<dd>
<pre class="biblio">
      Olin Shivers, A Scheme Shell
      Massachusetts Institute of Technology Cambridge, MA, USA, 1994
      <a href="http://www.scsh.net/docu/scsh-paper/scsh-paper.html">http://www.scsh.net/docu/scsh-paper/scsh-paper.html</a>
</pre>


<dt class="biblio"><a name="ref-IrRegex"><strong>IrRegex</strong></a>
<dd>
<pre class="biblio">
      Alex Shinn, IrRegex - IrRegular Expressions
      <a href="http://synthcode.com/scheme/irregex/">http://synthcode.com/scheme/irregex/</a>
</pre>


<dt class="biblio"><a name="ref-TR18"><strong>TR18</strong></a>
<dd>
<pre class="biblio">
      Mark Davis, Andy Heninger, UTR #18: Unicode Regular Expressions
      <a href="http://www.unicode.org/reports/tr18/">http://www.unicode.org/reports/tr18/</a>
</pre>


<dt class="biblio"><a name="ref-UAX29"><strong>UAX29</strong></a>
<dd>
<pre class="biblio">
      Mark Davis, UAX #29: Unicode Text Segmentation
      <a href="http://www.unicode.org/reports/tr29/">http://www.unicode.org/reports/tr29/</a>
</pre>

<dt class="biblio"><a name="ref-SRFI-0"><strong>SRFI 0</strong></a>
<dd>
<pre class="biblio">
      Marc Feeley, Feature-based conditional expansion construct
      <a href="http://srfi.schemers.org/srfi-0/srfi-0.html">http://srfi.schemers.org/srfi-14/srfi-14.html</a>
</pre>

<dt class="biblio"><a name="ref-SRFI-14"><strong>SRFI 14</strong></a>
<dd>
<pre class="biblio">
      Olin Shivers, Character-set Library
      <a href="http://srfi.schemers.org/srfi-14/srfi-14.html">http://srfi.schemers.org/srfi-14/srfi-14.html</a>
</pre>


<dt class="biblio"><a name="ref-ImplementingRegexps"><strong>ImplementingRegexps</strong></a>
<dd>
<pre class="biblio">
      Russ Cox, Implementing Regular Expressions
      <a href="http://swtch.com/~rsc/regexp/">http://swtch.com/~rsc/regexp/</a>
</pre>


<dt class="biblio"><a name="ref-Tcl"><strong>Tcl</strong></a>
<dd>
<pre class="biblio">
      Russ Cox, Henry Spencer's Tcl Regex Library
      <a href="http://compilers.iecc.com/comparch/article/07-10-026">http://compilers.iecc.com/comparch/article/07-10-026</a>
</pre>


<dt class="biblio"><a name="ref-Gauche"><strong>Gauche</strong></a>
<dd>
<pre class="biblio">
      Shiro Kawai, Gauche Scheme - Regular Expressions
      <a href="http://practical-scheme.net/gauche/man/?p=Regular+expressions">http://practical-scheme.net/gauche/man/?p=Regular+expressions</a>
</pre>


<dt class="biblio"><a name="ref-Perl6"><strong>Perl6</strong></a>
<dd>
<pre class="biblio">
      Damian Conway, Perl6 Exegesis 5 - Regular Expressions
      <a href="http://www.perl.com/pub/a/2002/08/22/exegesis5.html">http://www.perl.com/pub/a/2002/08/22/exegesis5.html</a>
</pre>


<dt class="biblio"><a name="ref-PCRE"><strong>PCRE</strong></a>
<dd>
<pre class="biblio">
      Philip Hazel, PCRE - Perl Compatible Regular Expressions
      <a href="http://www.pcre.org/">http://www.pcre.org/</a>
</pre>
</dl>

<h1>Copyright</h1>

<p>Copyright (C) Alex Shinn 2013. All Rights Reserved.

<p>Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:</p>

<p>The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.</p>

<p>THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.</p>

    <hr />
<address>Editor: <a href="mailto:srfi-editors at srfi dot schemers dot org">
             Mike Sperber</a></address>

</body>
</html>