Monday, January 24, 2011

Agda musings on huttons razor

Agda musings on Huttons razor

So Ralf Lämmel, also known as “Professor Fish”, defined a small language of which he took full copyright. The language was used as an assignment for students hacking Haskell and they were to carry out two alterations on the language in question. First, they were to give it a CPS style semantics and next, they were to define an analysis determining when expressions in the language were pure.

So far so good, but I am kind of tongue-in-cheek when I see tasks like these. And I wanted to play a bit with the idea. The first decision was to throw out Haskell as a tool. No, if we will have types, then let us get some real types. Dependent types. And let us require that all our programs terminate while we are at it. In other words, employ Agda2 to solve the task.

We begin by aptly naming the module as fish

module fish where

in honor of the grand professor who devised this magnificent language we are to present shortly. To define the language, we will need to pull in some Agda libraries for our use. If there is anything I’ve missed in my Twelf-hacking, it is the ability to use libraries defined by others when I am writing code.

open import Function hiding (const)
open import Data.Bool
open import Data.Nat
open import Data.Product
open import Data.Sum
open import Relation.Binary.PropositionalEquality

At this point, we get to define the language in question. We begin with a simple definition of a local module in which to define the language:

module HRazor where
  data Exp : Set where
    Const : ℕ -> Exp
    Plus  : Exp -> Exp -> Exp

What is this? This language is in a HRazor module. While it seems Professor Fish took copyright on the language, we know that Graham Hutton already took a patent on this language and dubbed it Hutton’s razor. We will shortly see that the language is different from the one which Ralf had in mind eventually - but the Razor is at the heart of it.

In Hutton’s Razor, we have expressions defined by Exp and these have exactly two constructors. The first constructor, Const takes a natural number and returns an expression. Note how Agda is a Unicode-aware language. This is an idea that will slowly creep upon you until you begin liking it. It signifies constants in the language. The other constructor, Plus adds together two expressions. This is the full language.

Is the language silly? Most decidedly not! Suppose you are researching some new idea in language theory. The Razor described above is about the smallest language you can define that does something meaningful. So your idea better work for the razor. Otherwise, you can as well ignore getting the idea working for harder problems.

Here is the function for (denotational) evaluation of the language:

  eval : Exp -> ℕ
  eval (Const y) = y
  eval (Plus y y') = (eval y) + (eval y')

which for most programmers of functional languages should be pretty straightforward. Essentially, we turn the expression tree into an evaluation in Agda.

Ralf adds a construct

Ralf’s assignment does not revolve around Huttons Razor. After all, Hutton has a patent (which he graciously grants every language researcher free use of). Rather, Ralf augments the razor with a Exit construction:

data Exp : Set where
  Const : ℕ -> Exp
  Plus  : Exp -> Exp -> Exp
  Exit  : Exp -> Exp

The informal meaning of Exit is this: Exit e will evaluate the computation e and then abort the rest of the program, returning e as its value. To reflect the idea that the program may be in two possible states now, either computing, or exiting, we need a way to tag what state we are in. Haskell people have the Either a b type, but in Agda, this is a disjoint sum:

R = ℕ ⊎ ℕ

Ralf generously provided some helper functions defining an algebra for how to compute with the disjoint sum so we translate these to Agda. Constant expressions are injected to the left in the sum:

const : ℕ -> R
const i = inj₁ i

Addition compares the two sides. It propagates right injections up, but computes on left injections. This is as Ralf does in his post.

add : R -> R -> R
add (inj₁ x) (inj₁ x') = inj₁ (x + x')
add (inj₁ x) (inj₂ y) = inj₂ y
add (inj₂ y) y' = inj₂ y

Finally, if we want to exit, we convert a left injection to a right one, and just pass on the right injection.

exit : R -> R
exit (inj₁ x) = inj₂ x
exit (inj₂ y) = inj₂ y

The neat thing about defining an algebra like this for R-computations is that now the evaluation function is almost as simple as in the Razor:

eval : Exp -> R
eval (Const y) = const y
eval (Plus y y') = add (eval y) (eval y')
eval (Exit e) = exit (eval e)

The only new thing is the Exit case which formalizes our informal definition above: compute eval e and exit with it.

So what is the point?

Well, this is Agda, so we can analyze the language we have defined for some very simple properties easily. Remember that I said that Agda programs are required to terminate. This comes in very handy when we want to prove that our language does, because we have a definition of it in Agda.

If Agda’s termination checker says that the Agda-program terminates, so does our Ralf augmented-Razor. In other words, we immediately have the property. Neat. In the same vein, our language must be deterministic. These properties can be formalized in Agda and proven as well, but they are immediate.

Task 1, continuations

Next, we CPS-transform our semantics. We could naively transform the language, above into CPS and then provide that, but that is a boring solution. Nay, the cool thing about CPS is that it makes the control flow of the program explicit. We can exploit this explicit control-flow to optimize the case of Exit expressions in the program.

module NotNaive where

Ok, so we throw everything into a submodule, NotNaive, to limit what is in our scope at any point in time so naming is not going to trip us up later.

  constK : ℕ -> (ℕ -> ℕ) -> ℕ
  constK i k = k i

This is part of our algebra as before. If we have a constant, how do we handle it? We add another parameter, the k above which is the kontinuation. In the case of the Razor-extension, we can regard k as “the rest of the program”. It has type (ℕ -> ℕ) which sort-of makes it a transformer on natural numbers. So if we are to deliver the natural number i to the kontinuation, we apply k i.

Additions take two values and delivers the addition to the kontinuation (which still is, intuitively, the rest of the program):

  addK : ℕ -> ℕ -> (ℕ -> ℕ) -> ℕ
  addK v1 v2 k = k (v1 + v2)

In the Exit case comes the trick. We do the following

  exitK : ℕ -> (ℕ -> ℕ) -> ℕ
  exitK i k = i

which is to not deliver the value to k but just return it straight away. This means that we will abort the evaluation of “the rest of the program” and just return the exit value straight away.

We now turn our attention in this language towards the evaluation function in the language. Constants are handled in the obvious way by using the algebra directly, but the two other cases are handled differently:

  evalK : Exp -> (ℕ -> ℕ) -> ℕ
  evalK (Const y) k = constK y k
  evalK (Plus y y') k = evalK y (\v1 -> evalK y' (\v2 -> addK v1 v2 k))
  evalK (Exit y) k = evalK y (\v1 -> exitK v1 k)

Take the Exit-case first. We evaluate the y like in the direct case. But then, we need to supply a continuation which says what the rest of the program does. Our continuation assumes it gets passed a value v1 which is the result of the y computation. It then exits with this value. The Plus case is similar, but there are two continuations to juggle. We evaluate y and deliver it to a continuation that first evaluates y' and then delivers to the final continuation, using our addK function from above.

To run the program:

run2 : Exp -> ℕ
run2 e = NotNaive.evalK e id

note we pass in the id identity function to “start up” the evaluation. This is in balance with what Ralf did. Again, Agda’s termination checker immediately provides termination guarantees for program.

Task 2: A Taste of dependent types

The main reason I set out to do this however, was to utilize the dependent types Agda has built-in. There are something eternally pleasing in defining programs which do not need to compute to answer questions, and here we will see an example of it.

The task is to create a function pure taking an expression and returning true if the expression is pure and false otherwise. Purity is defined as absence of an Exit expression. New module:

module DepAnalysis where

Now, dependent types are more than what the following claims it to be, I am aware. But as a first foray, it is quite mind-boggling in itself and serves as a great simple introduction. If we define

  data Exp' : Bool -> Set where

we have declared a dependent datatype Exp' which has a Bool value attached to it. This means we have two types of expressions: Exp' true and Exp' false. These are different types from the view of the type checker. You can read Bool -> Set as “if we supply a boolean value to an Exp', then it will yield a Set - Set being the elements of the Agda”world" we are working in. So we can just assume in our world, that if an expression is pure, it is Exp' true and otherwise, it is Exp' false.

    Const : ℕ -> Exp' true

This constructor tells Agda that if we supply the Const constructor with a natural number, it will hand us a pure expression. Seems in order, sir! Next constructor:

    Exit  : ∀ {c} -> Exp' c -> Exp' false

This can be read as “if you give my any expression, pure or not - I don’t care - then I will give you a non-pure expresssion”. This makes sense. There is an Exit in the expression, so it better not be pure. It doesn’t matter what the internal expression is.

    Plus  : ∀ {c1 c2} -> Exp' c1 -> Exp' c2 -> Exp' (c1 ∧ c2)

In the final case, Plus we get to supply two expressions. These two expressions have c1 and c2 as their purity-values. Agda will hand us back an expression which is pure if and only if both subexpressions are - due to the conjunction (logical and) in the returned expression.

What have we done here? We have created a dependent data type which will itself track if an expression is pure or not. As we build up the abstract syntax tree, the system will make the calculations necessary. Notice, that if the whole AST is known at compile time, then so is the purity of the expression - right away.

For completeness, we supply the required function pure. Here is the type:

  pure : {b : Bool} -> Exp' b -> Bool

It introduces a new concept, namely implicit parameters. The type {b : Bool} names a parameter which is implicit. We can normally omit implicit parameters when writing functions because it can be extracted from another type in the specification (in this case from Exp' b which contains the b). However, if we need access to the implicit parameter, we can positionally match it by wrapping it in brackets like this:

  pure {x} e = x

So the purity of the expression is already there, we just need to return it! I personally think this is awfully neat.

I want to play more

The Agda wiki contains tutorials, the Agda implementation, and all the other cool information on how to play with Agda. The development here is in a gist on github should you want to play with. Note that the development I made hardly qualifies to show off the really cool features of Agda but only scratches (barely) the surface of what is possible. My intuition is, however, that we will see much more dependently typed programming in the future, when everyone is programming functionally :)

Commentary is as always welcome.

Friday, January 07, 2011

Errlxperimentation: Clustering of Erlang Error Reports.

Errlxperimentation: Clustering of Erlang Error Reports.

If anyone wonders what I hacked over the holidays, this is it :)


In the Erlang/OTP system, there is a registered process with the name of error_logger. Whenever an error happens in the Erlang system by a correctly configured OTP process, it will send reports to the error_logger. The logger is then responsible for taking the report and in a pub-sub manner send it to any error handler currently configured.

One of these error handlers, log_mf_h, is a multi file log handler. It writes reports in binary term format to a set of files. Whenever a file reaches a certain size, it is rotated and a new file is written. The result is that we essentially have ring-buffer of all errors that occur in a running Erlang system.

A tool named rb the “Report Browser” allows us to read in the binary logs and process them. We can list certain errors, “grep” among them, apply filters to the reports and so on. It gives us a system for looking into a running Erlang system for errors — and handling them if we deem them to be problematic. This tool is very neat when you have a running system to inspect. You can basically leave it on for some time and then go look into the log-mf files for errors and crashes. Since the file is rotated on size the only worry is that important errors get rotated out of the log. But if you have enough log space, this won’t happen. In a modern world, you can easily leave 100 Megabytes of error logs. I have taken the leisure of having a modified log_mf_h which can compress term that goes to disk on the fly. This of course improves the size of error report sizes even more.


The experiment

An experiment that has been on my mind for some time was can I possibly do automatic clustering of similar error reports so like reports are clustered together? If this is possible, I could then just pick a representative for each error cluster and use that as a basis for handling like errors. Even if you have thousands of reports, it may be that there are only a few of them which are really different and a lot of them which are redundant.

I know nothing about clustering. But a question later on StackOverflow and after a kind soul on the #erlang IRC channel hinted me in the direction of the paper Ephemeral Document Clustering for Web Applications (2000) DOI: 10.1.1.59.131 I was set for an experiment with this. The algorithm implemented here is a variant though, mostly due to getting something fast. So it does not implement the full paper. We implement the ideas from there and the paper can be consulted if anything below is unclear (or comment, I'll try to explain).

I altered rb to rerb, short for the riak_err report browser. The Basho guys have this excellent alternative to the sasl application built into OTP, see riak_err and I used that as a starting point. I am not even sure riak_err is the right place for this, but at least this is what I did for now.


Processing

The first part is now to convert an error report. They usually look something along the lines of [{badarg, MFA1}, MFa2, ..., MFaN] where each MFA is a tuple {Module, Function, [Arg1, ..., ArgN]} — and each MFa is a tuple {Module, Function, Arity} (arity being the number of parameters of the function). Note that Erlang errors are Erlang terms, so they are easily manipulated by our code. It does make them harder to read for newcomers however.

First, we linearize and canonicalize the input. We change the tuple to [badarg, {Module, Function, length([Arg1, ..., ArgN])}, MFa2, ..., MFaN]. In other words we flatten the error term and we change the occurrence of MFA to MFa. As an example we may have an MFA [{badarg, {ets, insert, [table_foo, {123, something}]}}] and this is changed into [badarg, {ets, insert, 2}]. This ensures that we don’t care about the difference of arguments, but only care about the arity of the function and the call stack.

Next, suppose an error report has the structure [A, B, C] for arbitrary terms A, B and C. We now form bigrams ['#', A], [A, B], [B, C] and [C, '#']. The '#' is just an atom we use as padding for the boundary terms. For each of these we now run the erlang:phash2/2 function to obtain a hash of each bigram: [{phash2(Bg), 1} || Bg <- Bigrams]. Note we also smacked the hash into a tuple with a value of 1. This list of {Bigram, 1} pairs forms a vector in an (euclidian) vector space where the bigram part names a dimension in the space. If we form bigrams like this for all error reports, and take the union over all bigrams we have a set of bigrams defining the basis of the space.

Now we define a metric called similarity on two vectors, A and B, of the space by (A . B) / (||A|| * ||B||) = cos(t) where . is the dot-product and ||x|| is the norm of the vector x. This is equivalent to the cosine of the angle t (theta) between the two vectors A and B. Now, if two vectors point in the exact same direction, the angle is 0 and the cosine is 1, i.e., perfect similarity. If the two vectors are orthogonal to each other, the similarity will be 0.

The similarity of a set of vectors is had by taking the minimum of any two vectors from the set. It to a certain extent measures how “different” the cluster is because the minimum suggest the most dissimilar result is chosen.


Clustering

We form a forest of singleton clusters, and place each vector/error-report in its own cluster. Then we consider similarity of each pair of clusters. The two clusters with the highest similarity are joined. The joining also forms a binary-tree, a dendogram, of the two clusters. Whenever we join two clusters, the original clusters are removed. So whenever a join is executed, we are left with one less clusters. Recursively repeat this until we have a binary tree. This forms the cluster dendogram.

Listing the clusters is done by choosing a similarity, 0.5 say and then walking the tree until a similarity of 0.5 is reached. The subtree is then flattened into a cluster of like error reports.


Looking forward

The code is in an experimental branch of my github repo, jlouis/riak_err/n-gram-analyzer. Note I currently make no guarantees of its stability currently. I may rebase it still, so poke me if you want to base something on it. Currently you can only configure the riak_err config tree like:

{riak_err,
 [{log_mf, true},
  {error_logger_mf_dir, "log/mf"},
  {error_logger_mf_maxbytes, 10240000},
  {error_logger_mf_maxfiles, 10}
 ]},

and then run rerb:start(). followed by rerb:analyze() to analyze the reports. Listing the clusters are done by rerb:list({clusters, Similarity}) where the similarity is between 0.0 and 1.0.
The code not seen many wars yet, so it is probably quite unstable.

Another important point is that the algorithm used is a simplification of what the above linked paper does. In particular the algorithm is at least O(n^3) though I think one can pretty easily lower that to O(n^2 lg n). Also, there are some constants that can be shaved down to improve the clustering speed (It is not that fast at the moment since it walks lists all the way). Finally, it is not parallel though some parts of it lends itself to a fork/join style speedup.

I’d also want to provide the clustering tree in a raw way, so you can run arbitrary functions over it. I’ll probably implement a variant that for each cluster tells the size of it and picks three random representatives for it.


Final words

The clustering used here seems to work wonderfully well for the error reports I tried it on. It happily clusters up like error reports and different clusters contain different error reports. The Bigrams ensure that we grab the call-path rather than just a presence in a function. We only operate on the stack trace, but we could be operating on more of the error-report.

Also, I am not too sure riak_err is the right place for this. I did some alterations to log_mf_h and rb which
  • Enables error reports above 16Kb in size. No limit-function is employed yet, though it would be preferable. I could imagine a report with the state of a large heap being pushed.
  • Use compression on terms. Disk IO is slow compared to in-memory compression.
  • Fixes some redundancy in the rb grep code by making the code into a match compiler. This patch can probably flow into OTP after some more testing.
  • Use binary notation. The rb and log_mf_h code is probably so old Erlang did not have binaries at that time.
Some of the ideas here are inspired by analysis of stack traces in other languages, Java in particular. But perhaps some of these ideas are applicable to log files in general.

Finally, any comments, positive/negative are welcome. Suggestions for improving the ideas here are also much appreciated.

About Me

My Photo
Lambda-loving CS Geek. Likes metal music. Likes dogs. Likes cats. Does not like pictures of dogs and cats (unless they are lambdacats!)

Has an unhealthy coffee addiction. Calls himself the coffee zombie in the morning (BEEEEANS!)

Has a neverending curiosity gene. Likes intelligence.