Friday, January 07, 2011

Errlxperimentation: Clustering of Erlang Error Reports.

Errlxperimentation: Clustering of Erlang Error Reports.

If anyone wonders what I hacked over the holidays, this is it :)


In the Erlang/OTP system, there is a registered process with the name of error_logger. Whenever an error happens in the Erlang system by a correctly configured OTP process, it will send reports to the error_logger. The logger is then responsible for taking the report and in a pub-sub manner send it to any error handler currently configured.

One of these error handlers, log_mf_h, is a multi file log handler. It writes reports in binary term format to a set of files. Whenever a file reaches a certain size, it is rotated and a new file is written. The result is that we essentially have ring-buffer of all errors that occur in a running Erlang system.

A tool named rb the “Report Browser” allows us to read in the binary logs and process them. We can list certain errors, “grep” among them, apply filters to the reports and so on. It gives us a system for looking into a running Erlang system for errors — and handling them if we deem them to be problematic. This tool is very neat when you have a running system to inspect. You can basically leave it on for some time and then go look into the log-mf files for errors and crashes. Since the file is rotated on size the only worry is that important errors get rotated out of the log. But if you have enough log space, this won’t happen. In a modern world, you can easily leave 100 Megabytes of error logs. I have taken the leisure of having a modified log_mf_h which can compress term that goes to disk on the fly. This of course improves the size of error report sizes even more.


The experiment

An experiment that has been on my mind for some time was can I possibly do automatic clustering of similar error reports so like reports are clustered together? If this is possible, I could then just pick a representative for each error cluster and use that as a basis for handling like errors. Even if you have thousands of reports, it may be that there are only a few of them which are really different and a lot of them which are redundant.

I know nothing about clustering. But a question later on StackOverflow and after a kind soul on the #erlang IRC channel hinted me in the direction of the paper Ephemeral Document Clustering for Web Applications (2000) DOI: 10.1.1.59.131 I was set for an experiment with this. The algorithm implemented here is a variant though, mostly due to getting something fast. So it does not implement the full paper. We implement the ideas from there and the paper can be consulted if anything below is unclear (or comment, I'll try to explain).

I altered rb to rerb, short for the riak_err report browser. The Basho guys have this excellent alternative to the sasl application built into OTP, see riak_err and I used that as a starting point. I am not even sure riak_err is the right place for this, but at least this is what I did for now.


Processing

The first part is now to convert an error report. They usually look something along the lines of [{badarg, MFA1}, MFa2, ..., MFaN] where each MFA is a tuple {Module, Function, [Arg1, ..., ArgN]} — and each MFa is a tuple {Module, Function, Arity} (arity being the number of parameters of the function). Note that Erlang errors are Erlang terms, so they are easily manipulated by our code. It does make them harder to read for newcomers however.

First, we linearize and canonicalize the input. We change the tuple to [badarg, {Module, Function, length([Arg1, ..., ArgN])}, MFa2, ..., MFaN]. In other words we flatten the error term and we change the occurrence of MFA to MFa. As an example we may have an MFA [{badarg, {ets, insert, [table_foo, {123, something}]}}] and this is changed into [badarg, {ets, insert, 2}]. This ensures that we don’t care about the difference of arguments, but only care about the arity of the function and the call stack.

Next, suppose an error report has the structure [A, B, C] for arbitrary terms A, B and C. We now form bigrams ['#', A], [A, B], [B, C] and [C, '#']. The '#' is just an atom we use as padding for the boundary terms. For each of these we now run the erlang:phash2/2 function to obtain a hash of each bigram: [{phash2(Bg), 1} || Bg <- Bigrams]. Note we also smacked the hash into a tuple with a value of 1. This list of {Bigram, 1} pairs forms a vector in an (euclidian) vector space where the bigram part names a dimension in the space. If we form bigrams like this for all error reports, and take the union over all bigrams we have a set of bigrams defining the basis of the space.

Now we define a metric called similarity on two vectors, A and B, of the space by (A . B) / (||A|| * ||B||) = cos(t) where . is the dot-product and ||x|| is the norm of the vector x. This is equivalent to the cosine of the angle t (theta) between the two vectors A and B. Now, if two vectors point in the exact same direction, the angle is 0 and the cosine is 1, i.e., perfect similarity. If the two vectors are orthogonal to each other, the similarity will be 0.

The similarity of a set of vectors is had by taking the minimum of any two vectors from the set. It to a certain extent measures how “different” the cluster is because the minimum suggest the most dissimilar result is chosen.


Clustering

We form a forest of singleton clusters, and place each vector/error-report in its own cluster. Then we consider similarity of each pair of clusters. The two clusters with the highest similarity are joined. The joining also forms a binary-tree, a dendogram, of the two clusters. Whenever we join two clusters, the original clusters are removed. So whenever a join is executed, we are left with one less clusters. Recursively repeat this until we have a binary tree. This forms the cluster dendogram.

Listing the clusters is done by choosing a similarity, 0.5 say and then walking the tree until a similarity of 0.5 is reached. The subtree is then flattened into a cluster of like error reports.


Looking forward

The code is in an experimental branch of my github repo, jlouis/riak_err/n-gram-analyzer. Note I currently make no guarantees of its stability currently. I may rebase it still, so poke me if you want to base something on it. Currently you can only configure the riak_err config tree like:

{riak_err,
 [{log_mf, true},
  {error_logger_mf_dir, "log/mf"},
  {error_logger_mf_maxbytes, 10240000},
  {error_logger_mf_maxfiles, 10}
 ]},

and then run rerb:start(). followed by rerb:analyze() to analyze the reports. Listing the clusters are done by rerb:list({clusters, Similarity}) where the similarity is between 0.0 and 1.0.
The code not seen many wars yet, so it is probably quite unstable.

Another important point is that the algorithm used is a simplification of what the above linked paper does. In particular the algorithm is at least O(n^3) though I think one can pretty easily lower that to O(n^2 lg n). Also, there are some constants that can be shaved down to improve the clustering speed (It is not that fast at the moment since it walks lists all the way). Finally, it is not parallel though some parts of it lends itself to a fork/join style speedup.

I’d also want to provide the clustering tree in a raw way, so you can run arbitrary functions over it. I’ll probably implement a variant that for each cluster tells the size of it and picks three random representatives for it.


Final words

The clustering used here seems to work wonderfully well for the error reports I tried it on. It happily clusters up like error reports and different clusters contain different error reports. The Bigrams ensure that we grab the call-path rather than just a presence in a function. We only operate on the stack trace, but we could be operating on more of the error-report.

Also, I am not too sure riak_err is the right place for this. I did some alterations to log_mf_h and rb which
  • Enables error reports above 16Kb in size. No limit-function is employed yet, though it would be preferable. I could imagine a report with the state of a large heap being pushed.
  • Use compression on terms. Disk IO is slow compared to in-memory compression.
  • Fixes some redundancy in the rb grep code by making the code into a match compiler. This patch can probably flow into OTP after some more testing.
  • Use binary notation. The rb and log_mf_h code is probably so old Erlang did not have binaries at that time.
Some of the ideas here are inspired by analysis of stack traces in other languages, Java in particular. But perhaps some of these ideas are applicable to log files in general.

Finally, any comments, positive/negative are welcome. Suggestions for improving the ideas here are also much appreciated.
Post a Comment

About Me

My Photo
Lambda-loving CS Geek. Likes metal music. Likes dogs, cats. Does not like pictures of dogs and cats (unless they are lambdacats!)

Has an unhealthy coffee addiction. Calls himself the coffee zombie in the morning (BEEEEANS!)

Has a neverending curiosity gene, loves intelligence and passion.