Saturday, January 31, 2009

Combinator parsing, part 2 (Lexing)

In the first part, we set up the scene for combinator parsing. We introduced a small toy language and an interpreter for it. The thing we left out was the meat of the problem. Namely how to parse an expression string into a Standard ML datatype value. We will not be ready for this entirely yet, however. Most parsing endeavours begin with the concept of lexical analysis or lexing, in which we break up a stream of characters into a stream of tokens. It can best be described by grouping characters that belong together into a single token. There exist lexer generators able to produce quite fast lexers automatically. These usually work by identifying tokens via regular expressions and then compiling these regular expressions into a state machine (DFA). The best of them do not even interpret this DFA but compile the DFA into code which when executed is the DFA is question. For a particularly cool version of this, see Shriram Krishnamurthi [1] For our purpose here it will be overkill to use a lexer generator. So we will resort to the much simpler solution of just hand-write it. But we will be clever enough to write it in a way which is pretty fast. The main disadvantage compared to most of the lever generators is that they work on the input stream in chunks. In other words they keep a small amount of work in memory at all times. Our solution will just read everything in. Haskell fares better here, as the default is ts lazy-read streams. Anyway, we begin by defining the basics:
exception Internal_Error
datatype token = Symb of char | Id of string | Number of int
val symbols = String.explode "+-*()"
The exception is for catching internal errors in the lexer. I tend to have such an exception in each module and raise it whenever something is non-expected. This way I eliminate an unexhaustive pattern match and make it clear to the reader what the intention of the code was. The datatype defines what tokens we have in our language. There are symbols, identifiers and numbers. Finally, we declare a character list of symbols our language has.
fun numTok ss =
 case Int.fromString (Substring.string ss) of
   NONE => raise Internal_Error
 | SOME i => Number i

fun idTok ss = Id (Substring.string ss)
The numTok function handles a number. It expects a substring and then converts it to an integer. The pre-condition of this function is that there is something to convert at all, or we face an internal error. The idTok expects the substring to be an identifier and converting that to a string is trivial. Substrings warrants an explanation. In Standard ML, strings are immutable. Strings cannot be modified once they have been created. In fact, strings are vectors of chars. And vectors are immutable. A substring is then 3 values (s, i, n) where s is the underlying substring, i is the index into it and n is the length. So substrings are slices of strings. Our main lexer function uses substrings to maintain a window over the original string. While it processes the string, this window is then systematically moved forward and the token stream is generated. Here is the main lexer function:
fun scan (toks, ss) =
 case Substring.getc ss of
   NONE => List.rev toks
 | SOME (c, r) =>
   if List.exists (fn x => x = c) symbols
   then (* Symbol *)
     scan (Symb c::toks, r)
   else if Char.isAlpha c
   then (* Identifier *)
     let
       val (cs, rest) = Substring.splitl Char.isAlpha ss
       val tok = idTok cs
     in
       scan (tok :: toks, rest)
     end
   else if Char.isDigit c
   then (* Number *)
     let
       val (cs, rest) = Substring.splitl Char.isDigit ss
       val tok = numTok cs
     in
       scan (tok :: toks, rest)
     end
   else (* Skip whitespace, etc *)
     scan (toks, Substring.dropl (not o Char.isGraph) ss)

fun lexer str =
 scan ([], Substring.full str)
The beast is easily explained: If we call lexer on a string, the string is made into a substring, and scan is called with an empty token list. The scan function carries out the hard work. If there are no more substring chars, then the tokens are just reversed and we are done. It might look odd to build the token list in reverse but SML mandates that list can only be built from the front, so this is the typical way to handle the last part of a tail-calling function. If there is a character, then we discriminate the type of character. For each possible discrimination, we break off some of the string, create the right token type, and proceed with the rest of the string. Finally, we have a skip of white-space in case nothing else matches. As a rather crude test
val testlex_str = "x + 33 * (y - 5)"
val testlex_toks = lexer testlex_str
results in
exception Internal_Error
datatype token = Id of string | Number of int | Symb of char
val symbols = [#"+",#"-",#"*",#"(",#")"] : char list
val numTok = fn : substring -> token
val idTok = fn : substring -> token
val scan = fn : token list * substring -> token list
val lexer = fn : string -> token list
val testlex_str = "x + 33 * (y - 5)" : string
val testlex_toks =
[Id "x",Symb #"+",Number 33,Symb #"*",Symb #"(",Id "y",Symb #"-",Number 5,
Symb #")"] : token list
Which looks right. In the next part we will finish this up and write the combinator-parser for the token list into the abstract syntax tree we aim for. [1] Swine Before Perl
Post a Comment

About Me

My Photo
Lambda-loving CS Geek. Likes metal music. Likes dogs. Likes cats. Does not like pictures of dogs and cats (unless they are lambdacats!)

Has an unhealthy coffee addiction. Calls himself the coffee zombie in the morning (BEEEEANS!)

Has a neverending curiosity gene. Likes intelligence.