Saturday, January 31, 2009

Combinator parsing, part 2 (Lexing)

In the first part, we set up the scene for combinator parsing. We introduced a small toy language and an interpreter for it. The thing we left out was the meat of the problem. Namely how to parse an expression string into a Standard ML datatype value. We will not be ready for this entirely yet, however. Most parsing endeavours begin with the concept of lexical analysis or lexing, in which we break up a stream of characters into a stream of tokens. It can best be described by grouping characters that belong together into a single token. There exist lexer generators able to produce quite fast lexers automatically. These usually work by identifying tokens via regular expressions and then compiling these regular expressions into a state machine (DFA). The best of them do not even interpret this DFA but compile the DFA into code which when executed is the DFA is question. For a particularly cool version of this, see Shriram Krishnamurthi [1] For our purpose here it will be overkill to use a lexer generator. So we will resort to the much simpler solution of just hand-write it. But we will be clever enough to write it in a way which is pretty fast. The main disadvantage compared to most of the lever generators is that they work on the input stream in chunks. In other words they keep a small amount of work in memory at all times. Our solution will just read everything in. Haskell fares better here, as the default is ts lazy-read streams. Anyway, we begin by defining the basics:
exception Internal_Error
datatype token = Symb of char | Id of string | Number of int
val symbols = String.explode "+-*()"
The exception is for catching internal errors in the lexer. I tend to have such an exception in each module and raise it whenever something is non-expected. This way I eliminate an unexhaustive pattern match and make it clear to the reader what the intention of the code was. The datatype defines what tokens we have in our language. There are symbols, identifiers and numbers. Finally, we declare a character list of symbols our language has.
fun numTok ss =
 case Int.fromString (Substring.string ss) of
   NONE => raise Internal_Error
 | SOME i => Number i

fun idTok ss = Id (Substring.string ss)
The numTok function handles a number. It expects a substring and then converts it to an integer. The pre-condition of this function is that there is something to convert at all, or we face an internal error. The idTok expects the substring to be an identifier and converting that to a string is trivial. Substrings warrants an explanation. In Standard ML, strings are immutable. Strings cannot be modified once they have been created. In fact, strings are vectors of chars. And vectors are immutable. A substring is then 3 values (s, i, n) where s is the underlying substring, i is the index into it and n is the length. So substrings are slices of strings. Our main lexer function uses substrings to maintain a window over the original string. While it processes the string, this window is then systematically moved forward and the token stream is generated. Here is the main lexer function:
fun scan (toks, ss) =
 case Substring.getc ss of
   NONE => List.rev toks
 | SOME (c, r) =>
   if List.exists (fn x => x = c) symbols
   then (* Symbol *)
     scan (Symb c::toks, r)
   else if Char.isAlpha c
   then (* Identifier *)
     let
       val (cs, rest) = Substring.splitl Char.isAlpha ss
       val tok = idTok cs
     in
       scan (tok :: toks, rest)
     end
   else if Char.isDigit c
   then (* Number *)
     let
       val (cs, rest) = Substring.splitl Char.isDigit ss
       val tok = numTok cs
     in
       scan (tok :: toks, rest)
     end
   else (* Skip whitespace, etc *)
     scan (toks, Substring.dropl (not o Char.isGraph) ss)

fun lexer str =
 scan ([], Substring.full str)
The beast is easily explained: If we call lexer on a string, the string is made into a substring, and scan is called with an empty token list. The scan function carries out the hard work. If there are no more substring chars, then the tokens are just reversed and we are done. It might look odd to build the token list in reverse but SML mandates that list can only be built from the front, so this is the typical way to handle the last part of a tail-calling function. If there is a character, then we discriminate the type of character. For each possible discrimination, we break off some of the string, create the right token type, and proceed with the rest of the string. Finally, we have a skip of white-space in case nothing else matches. As a rather crude test
val testlex_str = "x + 33 * (y - 5)"
val testlex_toks = lexer testlex_str
results in
exception Internal_Error
datatype token = Id of string | Number of int | Symb of char
val symbols = [#"+",#"-",#"*",#"(",#")"] : char list
val numTok = fn : substring -> token
val idTok = fn : substring -> token
val scan = fn : token list * substring -> token list
val lexer = fn : string -> token list
val testlex_str = "x + 33 * (y - 5)" : string
val testlex_toks =
[Id "x",Symb #"+",Number 33,Symb #"*",Symb #"(",Id "y",Symb #"-",Number 5,
Symb #")"] : token list
Which looks right. In the next part we will finish this up and write the combinator-parser for the token list into the abstract syntax tree we aim for. [1] Swine Before Perl

Tuesday, January 20, 2009

Ripping the backend from Moscow ML

Suppose you have two functions:
(* Type: int * int -> int *)
fun f (x,y) = x + y

(* Type: int -> int -> int *)
fun g x y = x + y
These functions do the same thing, namely add two numbers 'x' and 'y'. They have different types however, and they are called in different ways. 'f' is the usual way in most languages. We take a pair, and return the result. 'g' is curried: it is a function that takes an 'x' and then produces a new function taking a 'y'. The produced function returns the result when given an 'y'. Interestingly, 'g' is more powerful than 'f'. If we write
(* Type: int -> int *)
g 3
we get a so called partial application of 'g'. This means we only supply the first parameter to the function. It allows us to optmize with respect to the first parameter and it is a good way to define very general functions and then specialize them into what we need later on. Here is an example:
(* Type: int -> int -> int *)
fun h x =
 let val z = expensive_operation x
 in (fn y => y + z)
 end
We have defined a function which carries out some expensive operation between getting its first and second parameter. An application, 'let val foo = h 37' say, will only compute the expensive operation once and each time we use 'foo' we reap the benefits. Moscow ML is a quasi-implementation of the Standard ML programming language. It does not follow the standard completely, but it follows it enough to be usable and to let simple code flow between SML implementations without too much problems. What does not flow that easily is the module system which is treated somewhat differently between Moscow ML and e.g. SML/NJ or MLton. Moscow ML works by taking SML source code through a lexer and a parser into an abstract syntax tree. Then the types of all expressions are inferred in an 'elaboration' phase. Next, modules are stripped and pattern matches are simplified via a pattern match compiler. Now, the code is translated into an internal languauge, Lambda. Lambda is a lambda calculus + stuff. Finally, Lambda is translated into bytecode of a virtual machine. The Moscow ML runtime is written in C and can interpret the bytecode output from the compiler. The engine is neat, but it suffers somewhat from age. Several speedups are readily possible on this machine with relative little effort. The naive implementation of the 'g' function from before follows the semantics directly. Take the 'x', produce the intermediate function as a closure and then invoke the closure on 'y'. It is inefficient to produce the closure however, and we don't always have to! in 'g x y' we know all the arguments, so we can avoid building the intermediate function. Moscow ML relies on a derivative of the ZINC abstract machine as its VM. ZINC was an experiment by Xavier Leroy for writing a virtual machine for the Caml language. It is interesting because curried arguments do not build intermediate closures unless they have to. In other words, using the curried, more powerful, variant of a function has no extra cost in this machine. I want to compile Moscow ML into LLVM bytecode. The idea is to compile from Lambda down to LLVM, which is not that different to what the Ocaml compiler is currently doing. In the process, I would like carry over the ZINC machines good handling of curried arguments. I am taking some preliminary steps to get this done, but don't expect anything wild the next couple of months. This takes a lot of time getting right since I am writing it as a hobby project for the time being. I might be posting some updates later when the project has progressed furhter. References -- if you know any reading which might be interesting, do not hesitate to write me: The ZINC Experiment How to make a fast curry (I love Simon Peyton Jones paper names) The Mosml-llvm code lives here

Friday, January 09, 2009

Combinator parsing, part 1 (Setup)

In this series of posts, we want to parse expressions. In order to make it worthwhile and fun, we will begin by placing some constraints on this.

  • We will be using the language Standard ML for describing the approach. Though note that any functional language of a certain power can do this. Converting it to Ocaml, Haskell, Erlang or a lisp should be pretty straightforward once you get the hang of the idea. In some sense, Standard ML is the simplest of the (statically typed) functional languages in that it has almost no fancy tricks you can pull.
  • We will not be using regular expressions. Not that regular expressions are bad, but I want to show that this can be done easily without the use of RE's. The goal here is to show people that there are different approaches to parsing data than just throwing a regular expression at the problem.
The language we will study has the following simple syntax structure, given in BNF form:
  e ::= x | n | e1 + e2 | e1 - e2 | e1 * e2
The 'x' is a variable. Its value can range over integers (we won't be concerned with the actual range of these integers, altough it is interesting to muse over). The 'n' is some integer constant. The term 'e1 + e2' is an addition of two numbers. Likewise for subtraction and multiplication. Here is a typical expression:
  x + 3 * (y - 5)
Note that the parenthesis are not given in the syntax. We just assume their existence and they work as you would expect: grouping together terms to force evaluation order.

Writing down the syntax in Standard ML is straightforward:
datatype exp = Var of string | Const of int | Add of exp * exp
           | Sub of exp * exp | Mul of exp * exp
Hopefully it is clear what is going on here. We have translated the syntax into an ML datatype. To discriminate each construct, we have added constructors 'Var','Const','Add','Sub', and 'Mul'.

If we can somehow bring strings of the form 'x + 3 * (y - 5)' into this datatype (correctly!) -- and we know what 'x' and 'y' is, then we are able to evaluate the expression. Here is how:

fun eval_exp env exp =
  case exp of
    Var x        => env x
  | Const n      => n
  | Add (e1, e2) => (eval_exp env e1) + (eval_exp env e2)
  | Sub (e1, e2) => (eval_exp env e1) - (eval_exp env e2)
  | Mul (e1, e2) => (eval_exp env e1) * (eval_exp env e2)
So we destruct the form of 'exp'. If it is a 'Var', we look up the variable in the environment 'env', which we will define shortly. If it is a 'Const', we just return the integer value. Finally, if it is a binary operator, we proceed to evaluate the two sub-expressions and apply the right ML-operator to it.

We still need to give a meaning to the environment 'env'. We would like to represent it as a map from strings to integers (ie, a function of type 'string -> int'), as our variables are strings. This is straightforward in Standard ML fortunately. Suppose we want to represent the mapping 'x' maps to 2 and 'y' maps to 7:

exception env_notfound of string
fun test_env v =
  case v of
    "x" => 2
  | "y" => 7
  | _   => raise (env_notfound v)
We define an exception for the case where we don't know about the variable and raise it if the variable is unknown. In Standard ML '_' stands as a wild-card for "any other unnamed value."

Let us round up by playing with what we have. If we manually parse the term 'x + 3 * (y - 5)' it becomes:
val term = Add (Var "x", Mul (Const 3, Sub (Var "y", Const 5)))
So if we try to run the term with our environment from above, we expect the result to be '2 + 3 * (7 - 5) = 2 + 3 * 2 = 2 + 6 = 8'. Here is what our Standard ML system says:
- eval_exp test_env term;
val it = 8 : int
In the next part, we will write the parser that allows us to automate the parsing of the term into the datatype of 'val term'

Tuesday, January 06, 2009

Common Erlang misconceptions

There is a common saying that Erlangs I/O layer is slow. Bullshit. The etorrent program can easily sustain 70 megabit with less than 5% of CPU usage on my old machines. The reason is that etorrent is shuffling 16k binary blocks around mostly. But that means you have a pointer to each 16k block and you are only a writev(2) call away from the disk. This means I/O is not itself a problem. The problem that people often hit is Erlangs string handling. In Erlang a string can either by represented as an immutable array of 8-bit characters, called a binary; or it can be represented as a list of integers (ie, a linked list of integers). The latter is mutable in the sense that you can generate new integers and update the linked list from the front only. Whenever you mutate the list of integers, garbage is generated. Allocation will happen and this directly translates to more garbage collection pressure later on. Had you had destructible arrays in Erlang it would have been very easy to circumvent the worst performance problems, but you will have to do with a limited zoo of types. In better languages, like Ocaml, Standard ML or Haskell, your type zoo is much richer here. You can choose between several immutable and mutable variants so the leverage is much greater. First trick: Erlang has the concept of an IO list. These are lists of binary/string chunks. They are fast because they lend themselves to a writev(2) in the bottom of the system. Thus you don't have to concatenate strings in the Erlang language but you can hand a quasi-concatenated list to the VM and let C do the heavy-lifting for you. Second trick: Mutating data as strings is slow. But if you have a abstract tree of tuples then it can be manipulated fast. If you think about the command [Bin | IOList] where we add a Binary to the front of an IOList, you'll know it happens in O(1) (constant) time. If we build a tuple {atom, D1, D2, ...}, we can build it in constant time. So the trick is to maintain your data in a non-string form for as long as possible and then only change it to a string at the very last point in time. Third trick: Correct data structure. Ropes. Finger Trees. M-way trees. Don't brute force yourself over a string. Process the structure by pattern matching in a clever way so you minimize the amount of change needed. A good tree will have around logarithmic change whereas a list will have linear change. This leads to the next point: allocation. Another crap problem in Erlang is that you have few (traditional) ways to minimize allocation. Any kind of operation will allocate data if you are not wary of it. Hence you should try to build programs that do not gloss over data and allocate all the time. Mapping of lists via list comprehensions is a good example. It is easy to understand but it will build a new list and garbage collection will cost you. If you can do multiple things in one list comprehension, you win traversals and allocations. Whenever I hear about "Slow string processing" in Erlang I immediately think: "Another Ruby programmer trying to cram his language into Erlang". If you hit slowness the trick in Erlang is to redefine the problem! Don't try to string process yourself into oblivion. If the sole thing your problem concerns is string-processing, go use perl. Please. Or be clever and use Ocaml. Pre-process the data in other languages and make them into erlang terms which can be read fast by erlang. Erlangs power is the concurrency at which it is really good. There were no reason for giving it fast string processing. Also, don't get bitten by the mnesia bug, please. It is almost always the case that mnesia and ETS are the wrong way to go. I got bitten by it multiple times in etorrent and now I am contemplating ripping all of it out again. Plain old data structures baby! There are some nasty caveats to mnesia and ETS and they are not related to the size limit of the database. For mnesia, be aware that a failed transaction will be re-run! Now take 100 transactions tripping over each other. They will never finish. Etorrent did that when peers should update their state. Instant snail-speed application. Mnesia is crap at doing joins. Just forget it. ETS is also problematic: Any store or get from ETS means a data copy. Don't do as etorrent and add large rows (128k+) to ETS. Your app will come to a grinding halt. Don't message big data structures around. Any message incurs a copy (in the default VM). If you need to work with a big data structure, shove it into a process and then send the Pid of that process around or register him globally. You can think of a Pid as "Passing a pointer around" in this case. And finally: Measure measure measure! Be scientific about your performance problems. You need to know exactly what causes it before jumping to conclusions. And don't worry too much about being parallel at the beginning. You can always split things up in finer granularity later.

Monday, January 05, 2009

A little set of tools

This is a little set of tools that I like to use now and then. I name them by their Ubuntu/Debian package names although I think one can find them easily for any package system out there. apg - The "automated password generator". It generates automated pronounceable passwords. It takes a lot of options as to how the passwords should look and what they must contain etc. It is a handy tool for generating passwords which are hard to guess but still easy to memorize. mp3gain - Mastered music today has different loudness levels. You will search for your volume button all the time because some tracks are much louder than others. This tool can analyze mp3 files and hint about the loudness level so programs can adjust accordingly. Or it can adjust the volume of the mp3 file in a lossless way (using a nice property of the MDCT in mp3s). password-gorilla - A Password safe application written on top of "pwsafe" in TCL. Main advantage of using this is that there are Windows and OSX programs able to read the format as well. stow - Manager for the hell that is /usr/local. You create /usr/local/stow and proceed to compile programs with "--prefix=/usr/local" but run "make install prefix=/usr/local/stow/foo-1.2.3". Now you can cd into /usr/local/stow and call the "stow" command on "foo-1.2.3". This creates a symlink farm in /usr/local. Want to get rid of the program? "stow -D" it followed by deletion. This is nice for handling seamless upgrades/downgrades of non-packaged software as wel. vorbisgain - Like mp3gain, just for vorbis files. graphviz - Hate drawing diagrams? Now you can program diagrams instead! Graphviz takes a description of a graph (not a plot!) and proceeds to build the graph for you. It uses some really good layout algorithms so you often get extremely neat graphs if you know a bit about the tool. Most kinds of graphs can be drawn easily. There are also tools for doing post-processing of drawn graphs if you want to do things a bit different than what the tool thought - but you rarely need that. It got a Cairo rendering backend and can output to PDF, PS, PNG, SVG, etc. The real power of graphviz is when your graph changes a lot. You just add vertices and edges and the system figures out layout itself. It is way easier than editing the graph in tool most of the time.

Sunday, January 04, 2009

Unison - A file sync application

Most system admins know about rsync. It is a really good tool to synchronize between different machines because it only moves data that happen to be changed. It lacks some things however. One of the things it is lacking is interactivity. Unison is a tool written by Benjamin C. Pierce and a couple of other guys. Pierce is a type theory researcher and he likes writing ocaml programs. But you should not let that put you off Unison. It uses the rsync protocol down below ;) When it sync's 2 sites it will ask you which way to propagate changes. It is also much more clever when it comes to shortcutting work. It can, for instance, detect that a file was moved so it has a local copy of that file it can use in the sync process. My preferred usage is to sync. directory contents between machines I work on. It often happens that I have worked on the machines in a way such a 2-way sync is needed. And Unison can give you that right away while it will be pain in rsync. Pierce now works on the Boomerang language which is really interesting. But I'll save that for a later post.

Friday, January 02, 2009

Avoiding memory fragmentation for fun and profit

One problem that might occur in long-running programs is that of memory fragmentation in the VM-heap. The problem is pretty simple: when you allocate and deallocate memory in the program you might end up with small "holes" all over your heap which are too small for the new data you need. In a page-allocated VM-heap world, this has a serious cost in memory usage. There are several ways to avoid the problem. The first one is to be aware of the problem and be smart when allocating memory. With the right amount of thought you can often get around the memory fragmentation, or at least minimize it. The second trick is to restart the application periodically. It is not as bad as it sounds in a UNIX system. You set up the application and fork() off a child to do the hard work. When the child has been running for some time, you kill it and fork() off another from the (non-fragmented) parent. Apache 1.x used this. The third trick is to allocate a big region of memory, keep a pointer to the first unallocated word and then reset the pointer when you are done with your request and don't need the memory anymore. Interestingly there has been considerable research in the area to attempt to automate this idea. It is known as "Region Inference" in the automated setting, but here it is used in a manual way in e.g. C to achieve the same thing. The Subversion project used a region handler some years ago when I looked at their code. I don't know if they still do however. Poul Henning Kamp uses the reset-trick in Varnish as well. The fourth trick is to use a good garbage collector. Garbage collectors have the advantage that they are allowed to move data around, so the good ones compact live data to one end of the heap periodically and clears up fragmentation in the process. Note that most garbage collectors used in "scripting languages" like Python or Ruby are pretty weak. They usually avoid doing "real" garbage collection and opts for some simple poor-mans solutions. It is unfortunate because this puts GC in a bad light. That and Java-enterprise-behemoths abusing the VM-heap :)

About Me

My Photo
Lambda-loving CS Geek. Likes metal music. Likes dogs. Likes cats. Does not like pictures of dogs and cats (unless they are lambdacats!)

Has an unhealthy coffee addiction. Calls himself the coffee zombie in the morning (BEEEEANS!)

Has a neverending curiosity gene. Likes intelligence.