Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/massung/parsec
Parsec-style parsing for Erlang
https://github.com/massung/parsec
Last synced: 3 months ago
JSON representation
Parsec-style parsing for Erlang
- Host: GitHub
- URL: https://github.com/massung/parsec
- Owner: massung
- Created: 2011-08-31T03:08:05.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2011-09-03T04:06:53.000Z (over 13 years ago)
- Last Synced: 2024-07-31T21:54:42.859Z (6 months ago)
- Language: Erlang
- Homepage:
- Size: 174 KB
- Stars: 32
- Watchers: 3
- Forks: 10
- Open Issues: 2
-
Metadata Files:
- Readme: README.org
Awesome Lists containing this project
- awesome-combinator-parsers - parsec
README
* Summary
This is a Parsec-style parsing module and token lexer for Erlang. It's not quite as powerful as Haskell's Parsec library (see http://legacy.cs.uu.nl/daan/parsec.html), but it's pretty good and easy to use.
It is extremely helpful to have a solid understanding of Monads. If you don't, that's fine, and you will still be able to use this module, but you may have a slightly more difficult time creating some of the more complex combinators that are possible.
* Usage Basics (Quickstart)
At its core, you can think of the parser as a state that's constantly being updated as the next function in the parser is called. Along with the updated state is the previous value that was just parsed. You parse a string by binding functions (parser combinators) together. When the final combinator finishes, the final result is returned.
** A simple example
Let's say we wanted to parse a simple addition expression like "1+2". There are several ways to go about doing this. The simplest would be to use the parsec:do/1 function in order to just verify that what we expect to be there is there:
: >> parsec:parse("1+2",parsec:do([parsec:char($1),parsec:char($+),parsec:char($2)])).
: {ok,50,[]}Examining the return value, we see it was successul, the result of the parsec:do/1 parser combinator was 50 (the character $2), and an empty list - or empty string - meaning that the entire input source was parsed.
** Extending it a little
We parsed the string, but didn't really get to do anything with it. In order to start actually accomplishing something, let's make a parser combinator that parses an integer:
: >> Digits = parsec:many1(parsec:one_of("0123456789")).
: >> Int = parsec:bind(Digits,fun (X) -> parsec:return(list_to_integer(X)) end).Taking a closer look at what's defined...
Digits is a combinator that expects to parse one or more of another combinator, which is one of a list of characters (digits). When completed, it will return a list of all the combinator results parsed. Let's test it...
: >> parsec:parse("384",Digits).
: {ok,"384",[]}Int is a bit more complex. First, it's going to bind the results of one combinator and hand it off to a function. That function will do something with the results (or ignore them) and return a new parser combinator to continue parsing with. In this case, it takes the results of whatever was parsed by Digits (a string of numbers) and passes it through list_to_integer/1, and uses the parsec:return/1 combinator to simply return the answer. Let's try it...
: >> parsec:parse("1038340", Int).
: {ok,103840,[]}Now that we have a combinator that can parse an integer and return it, let's use it to improve our initial math parser:
: >> parsec:parse("1+2",parsec:do([Int,parsec:char($+),Int])).
: {ok,2,[]}We essentially got the same value returned, only this time converted to an integer. More importantly, we no longer require $1 or $2, we can parse any numbers in our equation.
** Computing a result
Just like we bound a combinator to parse digits to a function that could return those digits converted to an integer, we can also chain many values together. So let's create a combinator that does just that.
: >> Add=parsec:bind(Int,fun (A) ->
: parsec:bind(parsec:do([parsec:char($+),Int]),
: fun (B) ->
: parsec:return(A+B)
: end)
: end).Looking closer, we first parse an Int and store it to a A by binding it to another function. We then bind a parsec:do/1 combinator (which returns the last item parsed) to B, finally returning A+B. Let's test it:
: >> parsec:parse("10+20",Add).
: {ok,30,[]}* One-liners and Common Parse Combinators
Here are some common parser combinators that you can use.
** The fail combinator
When a parse combinator fails, the return value is always the atom 'pzero'. The fail combinator is useful for saying "not possible." For example, when you learn about the lexer module below, if you wanted to say that a language definition didn't support single-line comments, you would set the comment_line combinator to pzero/0.
: >> parsec:parse("test",parsec:pzero()).
: pzero** Parsing many and skipping many
The combinators many/1 and many1/1 allow you to parse another combinator zero or more or one or more times, returning a list of what was parsed. The skip/1 combinator will parse another combinator zero or more times, and throw away the result, returning 'undefined' instead.
: >> parsec:parse("aaabbb",parsec:many1(parsec:char($a))).
: {ok,"aaa","bbb"}: >> parsec:parse("aaabbb",parsec:skip(parsec:char($a))).
: {ok,undefined,"bbb"}** Capturing whatever text was parsed
If you want to just capture whatever string was parsed, without having to create a long bind-chain of anonymous functions together, then you can use the capture/1 combinator. It simply marks the start of the parse state when it began and returns a list of all the characters parsed by the combinator passed to it.
: >> parsec:parse("test",parsec:capture(parsec:one_of("stuv"))).
: {ok,"t","est"}** Checking for the end of the stream
If you want to guarantee that all the characters of the source string were parsed, then use the eof/0 combinator. It will fail with 'pzero' if there are still characters left to be parsed.
: >> parsec:parse("",parsec:eof()).
: {ok,eof,[]}** Matching any character
Just like the function name suggested, this combinator will only fail if it's at the end of the source string.
: >> parsec:parse("2",parsec:any_char()).
: {ok,50,[]}** Attempt to parse with a default result
The option/2 combinator is simply a shortcut for creating a return/1 combinator with a known value if another combinator fails.
: >> parsec:parse("aaaa",parsec:option(not_found,parsec:char($b))).
: {ok,not_found,"aaaa"}** Maybe parsing
Sometimes you want to parse something if it's there, but not fail if it isn't. The maybe/1 combinator does this. What's important to remember is that regardless of whether or not the combinator works, the end result is 'undefined'. So use it only when you are just skipping characters or in conjuction with the capture/1 combinator.
: >> parsec:parse("hello",parsec:maybe(parsec:char($\n))).
: {ok,undefined,"hello"}** Parsing combinators separated by a combinator
Useful for parsing CSV lines, function arguments, and many other cases where a list of things you want extracted are separated by something you don't care about.
: >> parsec:parse("a,a,a",parsec:sep_by1(parsec:char($a),parsec:char($,))).
: {ok,"aaa",[]}** Parsing until a combinator is found
This let's you keep parsing a repeat combinator until a terminating combinator occurs. A way to think about this might be comments: once"" is parsed.
: >> parsec:parse("abc.",parsec:many_till(parsec:any_char(),parsec:char($.))).
: {ok,"abc",[]}** Parsing one of a list of combinators
If you have 2 or more possible combinator paths you'd like to go down, using the choice/1 combinator will attempt combinators - in order - until one succeeds. Once a combinator is successful it will cease attempting to parse any that are left. If all the choices fail, then 'pzero' is returned.
: >> parsec:parse("b",parsec:choice([parsec:char($a),parsec:char($b)])).
: {ok,98,[]}** And more...
There are many other parser combinators included with the parsec module that weren't covered here. I recommend taking a look at parsec.erl and looking at the exported functions. Each is commented to describe how it can be used. There are also several combinators defined as macros in include/parsec.hrl. These are combinators for parsing letters, digits, whitespace, newlines, etc.
* The Lexer Module
While the parsing module is quite effective on its own, the lexer allows you to easily define programming language definitions and then tokenize entire strings quickly. The lexer module takes the hassle out of having to code (sometimes painful) parser combinators for things like comments, strings, floating-point values, identifiers, etc.
** Defining a language lexer
In include/lexer.hrl is the record definition for a lexer. This is where you define the basics of your language: comment styles, identifiers, reserved identifiers, operators, etc. An example language definition for a simple Lisp parser can be found in src/lisp_parser.erl.
** Lexemes
At the heart of the lexer module are two functions: lexer:whitespace/1 and lexer:lexeme/2. The whitespace function skips over all whitespace (and newlines) as well as comments as defined by your lexer. The lexeme function is a slight more complex. It parses a combinator, then skips over any whitespace/comments, and returns what was parsed by the combinator.
It's important to remember that the lexer:lexeme/2 function is available to you for use. If you don't like the way the lexer module parses (for example) hexadecimal values, it's very easy for you to write your own combinator to do so and pass it through the lexeme combinator in order to gain all the benefits of the lexer module.
** Using a lexer to tokenize a string
The best example is that given in the src/lisp_parser.erl example module. It's recommend you load this module at least once in the REPL and play with the lisp_parser:parse/1 function a little:
: >> lisp_parser:parse("10 ; this is a comment").
: {ok,{num,10},[]}: >> lisp_parser:parse("(a #| b |# c)").
: {ok,{list,[{id,a},{id,c}]},[]}* Good Luck!
Hopefully this was a decent introduction to using the parsec module for Erlang. Feel free to email me if you are having problems or have some suggestions.