exception Internal_Error datatype token = Symb of char | Id of string | Number of int val symbols = String.explode "+-*()"The exception is for catching internal errors in the lexer. I tend to have such an exception in each module and raise it whenever something is non-expected. This way I eliminate an unexhaustive pattern match and make it clear to the reader what the intention of the code was. The datatype defines what tokens we have in our language. There are symbols, identifiers and numbers. Finally, we declare a character list of symbols our language has.
fun numTok ss = case Int.fromString (Substring.string ss) of NONE => raise Internal_Error | SOME i => Number i fun idTok ss = Id (Substring.string ss)The numTok function handles a number. It expects a substring and then converts it to an integer. The pre-condition of this function is that there is something to convert at all, or we face an internal error. The idTok expects the substring to be an identifier and converting that to a string is trivial. Substrings warrants an explanation. In Standard ML, strings are immutable. Strings cannot be modified once they have been created. In fact, strings are vectors of chars. And vectors are immutable. A substring is then 3 values (s, i, n) where s is the underlying substring, i is the index into it and n is the length. So substrings are slices of strings. Our main lexer function uses substrings to maintain a window over the original string. While it processes the string, this window is then systematically moved forward and the token stream is generated. Here is the main lexer function:
fun scan (toks, ss) = case Substring.getc ss of NONE => List.rev toks | SOME (c, r) => if List.exists (fn x => x = c) symbols then (* Symbol *) scan (Symb c::toks, r) else if Char.isAlpha c then (* Identifier *) let val (cs, rest) = Substring.splitl Char.isAlpha ss val tok = idTok cs in scan (tok :: toks, rest) end else if Char.isDigit c then (* Number *) let val (cs, rest) = Substring.splitl Char.isDigit ss val tok = numTok cs in scan (tok :: toks, rest) end else (* Skip whitespace, etc *) scan (toks, Substring.dropl (not o Char.isGraph) ss) fun lexer str = scan (, Substring.full str)The beast is easily explained: If we call lexer on a string, the string is made into a substring, and scan is called with an empty token list. The scan function carries out the hard work. If there are no more substring chars, then the tokens are just reversed and we are done. It might look odd to build the token list in reverse but SML mandates that list can only be built from the front, so this is the typical way to handle the last part of a tail-calling function. If there is a character, then we discriminate the type of character. For each possible discrimination, we break off some of the string, create the right token type, and proceed with the rest of the string. Finally, we have a skip of white-space in case nothing else matches. As a rather crude test
val testlex_str = "x + 33 * (y - 5)" val testlex_toks = lexer testlex_strresults in
exception Internal_Error datatype token = Id of string | Number of int | Symb of char val symbols = [#"+",#"-",#"*",#"(",#")"] : char list val numTok = fn : substring -> token val idTok = fn : substring -> token val scan = fn : token list * substring -> token list val lexer = fn : string -> token list val testlex_str = "x + 33 * (y - 5)" : string val testlex_toks = [Id "x",Symb #"+",Number 33,Symb #"*",Symb #"(",Id "y",Symb #"-",Number 5, Symb #")"] : token listWhich looks right. In the next part we will finish this up and write the combinator-parser for the token list into the abstract syntax tree we aim for.  Swine Before Perl