This is a port of the Haskell package megaparsec to Lean 4.
Ask your questions to the authors on Lurk Lab Zulip, and send PRs!
Let's write a parser that bites a digit off a string, but only if it's padded to the right with a space or a pipe symbol at the end of line or input:
def s := string_simple_pure
def c := char_simple_pure
def laLexer : Parsec Char String Unit String :=
s.lookAhead ((s.stringP "|" <* (void c.eol <|> c.eof)) <|> s.stringP " ")
def numP : Parsec Char String Unit Nat :=
(c.oneOf "0123456789".data) >>= fun x => pure $ x.val.toNat - '0'.val.toNat
def takeOneDigitP : Parsec Char String Unit (List Nat) :=
many' (numP <* laLexer)
Parsec
has several type arguments: β ℘ E γ : Type u
.
β
is the atomic type. In our example, it'sChar
. That's the type of tokens parsed out of source℘
.℘
is the source type. In our example, it'sString
. Tokens can be read out of the source viaStraume
facilities.Straume
allows for transparently reading from files. If you want to read from a file, this type would be(String × IO.FS.Handle)
. It would mean that finite chunks of typeString
are emitted from a, perhaps-larger-than-RAM file handled with somethign of typeIO.FS.Handle
. SeeTests/IO.lean
for a usage example! It's really simple!E
is the custom error type. For quick and dirty parsers, it'sUnit
, but if you want to get fancy, you can create custom errors and construct those based on your parser's fails.γ
is the type of the thing we're parsing out of the source. In this case, we're parsing aString
(out of aString
).
You may wonder where did the type α
go? In Straume
, we use letter alpha to denote the composite type. Composite types are finite counterparts of streams. Every stream ℘
has a composite buffer, consisting of atomic tokens.
Now let's see how our parser works.
To check for the padding we used lookAhead
primitive parser because it parses without consuming.
To check for the |
terminator, we used "alternative parsing" between c.eof
, which parses only the literal end of stream and c.eol
.
Due to (<|>)
signature (or an imperfect Alternative
implementation?!), we can't simply write c.eol <|> c.eof
.
It's because c.eof
returns Unit
, whereas c.eol
returns either "\n"
or "\r\n"
.
To work around it we used a Common
combinator void
, which voids anything, that's at least a Functor
.
Note that you could have used the same trick, but via Seq
to substitute the return type of any parser too.
For example, you could have just as well written (c.eol <|> c.eof *> pure "FIN")
.
We wouldn't care, since the parsed value gets discarded anyway due to <*
(left sparrow) operator between stringP
and the sub-parser in question.
There are also some examples in the Exambles folder.
To simply manually test the parser you wrote, you can use runParseTestP
:
def testMyP : IO Unit := do
-- Succeeds
IO.println "Successful parsers."
let (true1, three) ← parseTestP myP "3 2 1"
let (true2, five) ← parseTestP myP "5|"
-- Fails
IO.println "Failing parsers."
let (false1, _notFive) ← parseTestP myP "5"
let (false2, _notFiftyFive) ← parseTestP myP "55|"
pure $ Unit.unit
It will also print some errors and results on the screen.
Analogous function to simply run the parser and get the result of a parse is parseP
.
Just give it a parser, the stream (file) name from which you're parsing and a source which can either be a pure String
or a String
buffer (most likely, you want it to be empty) plus a file handle.
A concrete example of parsing straight from a file is:
let sIO := string_parsecT IO (String × IO.FS.Handle)
let cIO := char_parsecT IO (String × IO.FS.Handle)
let file := System.mkFilePath ["./Tests", "abcd.txt"]
let h ← IO.FS.Handle.mk file IO.FS.Mode.read false
let buffed := ("", h)
let res ← parseTP
(sIO.stringP "ab" *> sIO.stringP "cd" <* cIO.eol <* cIO.eof)
"abcd.txt"
buffed
The parseTP
function shall return either ParseErrorBundle
or the parse result, which in this case is String
.
Also, file-based parsing allows you to parse from files that are larger-than-RAM thanks to Straume library. (We didn't test this extensively yet!)
Check out stuff that MonadParsec
exports, look around char_simple
and string_simple
APIs in Megaparsec/Char.lean
and Megaparsec/String.lean
respectively, and patch the example code so that it would tolerate, but not require padding with spaces and the "end-of-line bar" while parsing a digit out of the input.
Send your snippets to @cognivore in Lurk Lab Zulip!
The big idea of Megaparsec goes like this:
MonadParsec
is a typeclass.
If a type is MonadParsec
, it means that there are:
- Basic parsers defined on it.
- Basic parser combinators defined on it.
- Ways to retrieve and update parser state.
- Track errors with ParseError.
- Thread the values through a particular monad.
ParsecT
is a structure designed to be a MonadParsec
and a trasformer that adds its capabilities to the underlying monad.
ParsecT
, and any other thing that implements MonadParsec
should be Monad
and Alternative
.
We're not prioritising applicative parsers at the moment, but you can use Seq
to swap types in our parsers as well as use the usual between
combinator.
For example, suppose we have a Char
parser newline := char '\n'
and we have a String
parser crlf := s.stringP "\r\n"
.
To combine both without losing information, we need to transform the first parser into a String
parser.
Instead of writing a new parser, we write (newline *> pure "\n") <|> crlf
.
So far, this is the main use of Seq
in this library.
There is a generic polymorphic machinery, MonadParsec
.
There is also a concrete default implementation of monad parsec, which is a monad transformer called ParsecT
.
As the user of Megaparsec, you'll likely be using ParsecT
or even just Parsec
, which is ParsecT
over Id
.
TODO: Matej et al. made a significant UX improvement facilitating @[defaultIntance]
, and some day we'll describe it here with code samples, but since I don't understand it completely, I can't quite write about it.
The original megaparsec
repository is included as a reference in reference/megaparsec
.
To download it, include --recursive
when cloning this repository.
If you have this repository cloned already, but need to fetch the reference implementation, please run git submodule update --init --recursive
I tried both nix and GHCup, both seem to be currently broken, so I disabled HLS VSCode extension for the time being.
You can't port Haskell code to Lean one to one!