In the 3rd installment of combinator parsing, we will write the actual parser and use it for parsing our original problem from part 1. Parsers have a specific type, namely
type ('a, 'b) parser = 'a -> 'b * 'a
so a parser takes some type, 'a, breaks off part of the stream returning 'b and the rest of the stream. Our parser will have type (token list, exp) parser, as it will process a token list and produce an exp. We begin by defining two exceptions
exception Punt
exception ParserError
These are used a bit differently. The Punt exception is used whenever the parser wants to backtrack and attempt another parse. ParserError is going to be used when the parser fails to parse the input stream.
Wrapping up the parser system is a function we use for parsing stuff. It handles the task of checking there are no excess data when the parse finishes. It also handles the coercion of top-level punts into general parser errors.
fun parse_list parser stream =
(case parser stream of
(result, []) => result
| _ => raise ParserError) handle Punt => raise ParserError
Let us now build a parser for symbols:
fun symbol x ph =
case ph of
(Symb y :: rest) => if x = y then (x, rest) else raise Punt
| _ => raise Punt
given a symbol, x, and a phrase, ph, to attempt to parse, we test if the head of the stream is the desired symbol. If yes, we return the pair of the symbol and the rest of the phrase. If no, we punt; the parsing failed here so we backtrack to try something else.
It is easy to build more simple parsers like the one above. Here are numbers and identifiers:
fun number ph =
case ph of
(Number n :: rest) => (n, rest)
| _ => raise Punt
fun identifier ph =
case ph of
(Id id :: rest) => (id, rest)
| _ => raise Punt
But what makes this kind of parser special are the parser combinators. A parser combinator is a function which takes several parsers (of the ('a, 'b) parser type) and combines them into new parsers. We declare a couple of new operators and fill in their definitions later
infix 0 |||
infix 3 ---
infix 5 >>>
Normally, I am not that keen on declaring new operators, but when there is a certain sense of closure among the operators and they work on a specific domain, it might be easier to read code with operators. An aside is that it not always helps to have operators infix. For instance, 3 + 4 + 7 + 13 in ML can be written more succinctly in Lisp-systems as (+ 3 4 7 13). Nor do I find the operator noise of most Haskell programs very appealing. So one must use great care before just blindly declaring another operator. It should be warranted and justified. One justification is when you have a set of operators with closure, operating on the same domain. Binary arithmetic modulo N for instance. Or combinators for parsers.
Back to our combinators. The first one, ||| is the 'or'-combinator. ph1 ||| ph2 will first try to parse with ph1. Upon success of ph1, then that is the return of the whole parse. Upon fail, it tries ph2. Observant readers will recognize the short-circuiting nature. The code is using the backtracking to achieve the described informal semantics:
fun ph1 ||| ph2 =
(fn stream =>
ph1 stream handle Punt => ph2 stream)
Similarly, there is --- which is the equivalent of a logical 'and'. The parser ph1 --- ph2 first parses with ph1 and then with ph2 on the stream remaining from ph1. It returns the pair of the result. Perhaps the code explains this better than I do:
fun ph1 --- ph2 =
(fn stream =>
let
val (a, stream') = ph1 stream
val (b, stream'') = ph2 stream'
in
((a, b), stream'')
end)
Next up is >>> which is the application operator. If one writes ph >>> f, then ph is used to parse the stream into (x, rest) but the final result will be (f x, rest). The code is straightforward:
fun ph >>> f = (fn stream =>
let val (x, rest) = ph stream
in (f x, rest) end)
The >>> operator lets us shove any translating post-processor into the parser at any point, which will become handy.
With these down, it is easy to define some derived parsers. Here is one that lets you surround any parse with parenthesis:
fun parens ph = ((symbol #"(") --- ph --- (symbol #")"))
(* Cut the relevant part *)
>>> (fn ((_, x), _) => x)
Or even better, generalize that one:
fun surround left right ph = ((symbol left) --- ph --- (symbol right))
(* Cut the relevant part *)
>>> (fn ((_, x), _) => x)
val parens = surround #"(" #")"
One might look at the function that cuts out the relevant part and
deem that it is an ugly function. It is, indeed. But since we are
writing a combinator-parser, we can do a lot better than that. We can
simply define functions to get rid of it:
infix 3 --$
infix 3 $--
fun ph1 --$ ph2 = (ph1 --- ph2) >>> (fn (x, _) => x)
fun ph1 $-- ph2 = (ph1 --- ph2) >>> (fn (_, y) => y)
fun surround left right ph = (symbol left) $-- ph --$ (symbol right)
fun parens token = surround #"(" #")" token
The functions $-- and --$ can now be used to remove parts of the
parse-result without having to use a pattern deconstruction all over
the place. It has been confined to the functions $--,--$ once and for
all.
The next obvious step is to use our combinator parser for parsing the
small expression language from part one. First, we handle the two
primitive, or atomic, rules:
fun exp_number token = (number >>> (fn n => Const n)) token
fun exp_id token = (identifier >>> (fn id => Var id)) token
These functions are straightforward: If we parse a number, we just
have to add the right constructor onto the result. The same is true
for identifiers.
There are several ways to parse the rest of the expression language,
but one of the more clever ways is to use a function I first saw
Lawrence C. Paulson do. Here is the function, which may look a bit
daunting at first:
fun infixes (ph, prec_of, apply) =
let
fun over k toks = next k (ph toks)
and next k (x, Symb a :: rest) =
if prec_of a <>>> apply a x) rest)
| next k (x, toks) = (x, toks)
in
over 0
end
The function infixes takes a primitive expression parser, ph, a
precedence discriminator function, prec_of and an applicator function
apply. It then proceeds to produce a parser with the given precedence
designations.
Basically, there are two functions mutually defined, over
and next. The over function parses expressions over a given
precedence. The next function operates the next symbol token we
see. So, suppose we got a symbol. Then we look up it precedence. If
the precedence is lower than the current level we are operating on, we
simply return it back to the lower level. If it is higher, we start a
recursive run for the higher level via a call to over -- and we
arrange it such that when it returns, it applies the 'a' symbol with
the return value and 'x' (it is nice to see currying at
work). Finally, we proceed towards to next token at current level via
a recursive call to next.
In our case, the precedence table is easily given:
fun precOf x =
case x of
#"+" => 1
| #"-" => 1
| #"*" => 2
| _ => ~1
and same for the apply table:
fun apply a x y =
case a of
#"+" => Add (x, y)
| #"-" => Sub (x, y)
| #"*" => Mul (x, y)
| _ => raise InternalError
To use our infixes function, we build a mutual definition of primitive
expressions, parenthesis structure and a call to infixes:
fun prim_exp toks =
let fun p_exp toks = parens exp_parser toks
in
(exp_number ||| exp_id ||| p_exp) toks
end
and exp_parser toks = infixes (prim_exp, precOf, apply) toks
and exp_parser is our desired expression parser. A simple test:
fun p stream =
let
val lexed = lexer stream
in
parse_list exp_parser lexed
end
val testlex_str = "x + 33 * (y - 5)" : string
val testparser = p testlex_str
val out = eval_exp test_env testparser
returns the expected value, so we have some indication that it may
work as desired:
...
val p = fn : string -> exp
val testparser = Add (Var "x",Mul (Const #,Sub #)) : exp
val out = 68 : int
There are still several things we can do with the
parser combinators however, and there are some things left to discuss.
There is a big disadvantage to having a parser that backtracks by
default. At each possible junction it must store all of its state
since it might return to that point later on when backtracking. Of
course, this can be used for finding all possible parses with some
alteration of the parser, but mostly, it is a bad idea. The Haskell
Parsec library takes a more clever route. It has no backtracking by
default and the ||| operator only tries the second parser if the first
one
failed to consume any input. Thus it can avoid
keeping state around for backtracking.
Also, combinator-parsers have certain limitations. If your grammar has
left recursion, you will find yourself at an infinite loop very
quickly. For anything that needs speed and efficiency one is probably
better off with a parser generator for the LR family. But for smaller
problems where speed is not paramount, combinator parsers tend to be a
fine choice.
Add a comment