1. In the 3rd installment of combinator parsing, we will write the actual parser and use it for parsing our original problem from part 1. Parsers have a specific type, namely
    type ('a, 'b) parser = 'a -> 'b * 'a
    
    so a parser takes some type, 'a, breaks off part of the stream returning 'b and the rest of the stream. Our parser will have type (token list, exp) parser, as it will process a token list and produce an exp. We begin by defining two exceptions
    exception Punt
    exception ParserError
    
    These are used a bit differently. The Punt exception is used whenever the parser wants to backtrack and attempt another parse. ParserError is going to be used when the parser fails to parse the input stream. Wrapping up the parser system is a function we use for parsing stuff. It handles the task of checking there are no excess data when the parse finishes. It also handles the coercion of top-level punts into general parser errors.
    fun parse_list parser stream =
    (case parser stream of
      (result, []) => result
    | _ => raise ParserError) handle Punt => raise ParserError
    
    Let us now build a parser for symbols:
    fun symbol x ph =
    case ph of
      (Symb y :: rest) => if x = y then (x, rest) else raise Punt
    | _ => raise Punt
    
    given a symbol, x, and a phrase, ph, to attempt to parse, we test if the head of the stream is the desired symbol. If yes, we return the pair of the symbol and the rest of the phrase. If no, we punt; the parsing failed here so we backtrack to try something else. It is easy to build more simple parsers like the one above. Here are numbers and identifiers:
    fun number ph =
    case ph of
      (Number n :: rest) => (n, rest)
    | _ => raise Punt
    
    fun identifier ph =
    case ph of
      (Id id :: rest) => (id, rest)
    | _ => raise Punt
    
    But what makes this kind of parser special are the parser combinators. A parser combinator is a function which takes several parsers (of the ('a, 'b) parser type) and combines them into new parsers. We declare a couple of new operators and fill in their definitions later
    infix 0 |||
    infix 3 ---
    infix 5 >>>
    
    Normally, I am not that keen on declaring new operators, but when there is a certain sense of closure among the operators and they work on a specific domain, it might be easier to read code with operators. An aside is that it not always helps to have operators infix. For instance, 3 + 4 + 7 + 13 in ML can be written more succinctly in Lisp-systems as (+ 3 4 7 13). Nor do I find the operator noise of most Haskell programs very appealing. So one must use great care before just blindly declaring another operator. It should be warranted and justified. One justification is when you have a set of operators with closure, operating on the same domain. Binary arithmetic modulo N for instance. Or combinators for parsers. Back to our combinators. The first one, ||| is the 'or'-combinator. ph1 ||| ph2 will first try to parse with ph1. Upon success of ph1, then that is the return of the whole parse. Upon fail, it tries ph2. Observant readers will recognize the short-circuiting nature. The code is using the backtracking to achieve the described informal semantics:
    fun ph1 ||| ph2 =
    (fn stream =>
    ph1 stream handle Punt => ph2 stream)
    
    Similarly, there is --- which is the equivalent of a logical 'and'. The parser ph1 --- ph2 first parses with ph1 and then with ph2 on the stream remaining from ph1. It returns the pair of the result. Perhaps the code explains this better than I do:
    fun ph1 --- ph2 =
    (fn stream =>
    let
      val (a, stream') = ph1 stream
      val (b, stream'') = ph2 stream'
    in
      ((a, b), stream'')
    end)
    
    Next up is >>> which is the application operator. If one writes ph >>> f, then ph is used to parse the stream into (x, rest) but the final result will be (f x, rest). The code is straightforward:
    fun ph >>> f = (fn stream =>
        let val (x, rest) = ph stream
        in (f x, rest) end)
    
    The >>> operator lets us shove any translating post-processor into the parser at any point, which will become handy. With these down, it is easy to define some derived parsers. Here is one that lets you surround any parse with parenthesis:
    fun parens ph = ((symbol #"(") --- ph --- (symbol #")"))
       (* Cut the relevant part *)
       >>> (fn ((_, x), _) => x)
    
    Or even better, generalize that one:
    fun surround left right ph = ((symbol left) --- ph --- (symbol right))
             (* Cut the relevant part *)
             >>> (fn ((_, x), _) => x)
    
    val parens = surround #"(" #")"
    
    One might look at the function that cuts out the relevant part and deem that it is an ugly function. It is, indeed. But since we are writing a combinator-parser, we can do a lot better than that. We can simply define functions to get rid of it:
    infix 3 --$
    infix 3 $--
    fun ph1 --$ ph2 = (ph1 --- ph2) >>> (fn (x, _) => x)
    fun ph1 $-- ph2 = (ph1 --- ph2) >>> (fn (_, y) => y)
    
    fun surround left right ph = (symbol left) $-- ph --$ (symbol right)
    
    fun parens token = surround #"(" #")" token
    
    The functions $-- and --$ can now be used to remove parts of the parse-result without having to use a pattern deconstruction all over the place. It has been confined to the functions $--,--$ once and for all. The next obvious step is to use our combinator parser for parsing the small expression language from part one. First, we handle the two primitive, or atomic, rules:
    fun exp_number token = (number >>> (fn n => Const n)) token
    fun exp_id token    = (identifier >>> (fn id => Var id)) token
    
    These functions are straightforward: If we parse a number, we just have to add the right constructor onto the result. The same is true for identifiers. There are several ways to parse the rest of the expression language, but one of the more clever ways is to use a function I first saw Lawrence C. Paulson do. Here is the function, which may look a bit daunting at first:
    fun infixes (ph, prec_of, apply) =
    let
      fun over k toks = next k (ph toks)
      and next k (x, Symb a :: rest) =
      if prec_of a <>>> apply a x) rest)
        | next k (x, toks) = (x, toks)
    in
      over 0
    end
    
    The function infixes takes a primitive expression parser, ph, a precedence discriminator function, prec_of and an applicator function apply. It then proceeds to produce a parser with the given precedence designations. Basically, there are two functions mutually defined, over and next. The over function parses expressions over a given precedence. The next function operates the next symbol token we see. So, suppose we got a symbol. Then we look up it precedence. If the precedence is lower than the current level we are operating on, we simply return it back to the lower level. If it is higher, we start a recursive run for the higher level via a call to over -- and we arrange it such that when it returns, it applies the 'a' symbol with the return value and 'x' (it is nice to see currying at work). Finally, we proceed towards to next token at current level via a recursive call to next. In our case, the precedence table is easily given:
    fun precOf x =
    case x of
      #"+" => 1
    | #"-" => 1
    | #"*" => 2
    | _ => ~1
    
    and same for the apply table:
    fun apply a x y =
    case a of
      #"+" => Add (x, y)
    | #"-" => Sub (x, y)
    | #"*" => Mul (x, y)
    | _ => raise InternalError
    
    To use our infixes function, we build a mutual definition of primitive expressions, parenthesis structure and a call to infixes:
    fun prim_exp toks =
    let fun p_exp toks = parens exp_parser toks
    in
      (exp_number ||| exp_id ||| p_exp) toks
    end
    and exp_parser toks = infixes (prim_exp, precOf, apply) toks
    
    and exp_parser is our desired expression parser. A simple test:
    fun p stream =
    let
      val lexed = lexer stream
    in
      parse_list exp_parser lexed
    end
    
    val testlex_str = "x + 33 * (y - 5)" : string
    val testparser = p testlex_str
    val out = eval_exp test_env testparser
    
    returns the expected value, so we have some indication that it may work as desired:
    ...
    val p = fn : string -> exp
    val testparser = Add (Var "x",Mul (Const #,Sub #)) : exp
    val out = 68 : int
    
    There are still several things we can do with the parser combinators however, and there are some things left to discuss. There is a big disadvantage to having a parser that backtracks by default. At each possible junction it must store all of its state since it might return to that point later on when backtracking. Of course, this can be used for finding all possible parses with some alteration of the parser, but mostly, it is a bad idea. The Haskell Parsec library takes a more clever route. It has no backtracking by default and the ||| operator only tries the second parser if the first one failed to consume any input. Thus it can avoid keeping state around for backtracking. Also, combinator-parsers have certain limitations. If your grammar has left recursion, you will find yourself at an infinite loop very quickly. For anything that needs speed and efficiency one is probably better off with a parser generator for the LR family. But for smaller problems where speed is not paramount, combinator parsers tend to be a fine choice.
    0

    Add a comment

Blog Archive
About Me
About Me
What this is about
What this is about
I am jlouis. Pro Erlang programmer. I hack Agda, Coq, Twelf, Erlang, Haskell, and (Oca/S)ML. I sometimes write blog posts. I enjoy beer and whisky. I have a rather kinky mind. I also frag people in Quake.
Popular Posts
Popular Posts
  • On Curiosity and its software I cannot help but speculate on how the software on the Curiosity rover has been constructed. We know that m...
  • In this, I describe why Erlang is different from most other language runtimes. I also describe why it often forgoes throughput for lower la...
  • Haskell vs. Erlang Since I wrote a bittorrent client in both Erlang and Haskell, etorrent and combinatorrent respectively, I decided to put ...
  • A response to “Erlang - overhyped or underestimated” There is a blog post about Erlang which recently cropped up. It is well written and pu...
  • The reason this blog is not getting too many updates is due to me posting over on medium.com for the time. You can find me over there at thi...
  • On using Acme as a day-to-day text editor I've been using the Acme text editor from Plan9Port as my standard text editor for about 9 m...
  • On Erlang, State and Crashes There are two things which are ubiquitous in Erlang: A Process has an internal state. When the process crashes,...
  • When a dog owner wants to train his dog, the procedure is well-known and quite simple. The owner runs two loops: one of positive feedback an...
  • This post is all about parallel computation from a very high level view. I claim Erlang is not a parallel language in particular . It is not...
  • Erlangs message passing In the programming language Erlang[0], there are functionality to pass messages between processes. This feature is...
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.