Friday, January 07, 2011

Errlxperimentation: Clustering of Erlang Error Reports.

Errlxperimentation: Clustering of Erlang Error Reports.

If anyone wonders what I hacked over the holidays, this is it :)


In the Erlang/OTP system, there is a registered process with the name of error_logger. Whenever an error happens in the Erlang system by a correctly configured OTP process, it will send reports to the error_logger. The logger is then responsible for taking the report and in a pub-sub manner send it to any error handler currently configured.

One of these error handlers, log_mf_h, is a multi file log handler. It writes reports in binary term format to a set of files. Whenever a file reaches a certain size, it is rotated and a new file is written. The result is that we essentially have ring-buffer of all errors that occur in a running Erlang system.

A tool named rb the “Report Browser” allows us to read in the binary logs and process them. We can list certain errors, “grep” among them, apply filters to the reports and so on. It gives us a system for looking into a running Erlang system for errors — and handling them if we deem them to be problematic. This tool is very neat when you have a running system to inspect. You can basically leave it on for some time and then go look into the log-mf files for errors and crashes. Since the file is rotated on size the only worry is that important errors get rotated out of the log. But if you have enough log space, this won’t happen. In a modern world, you can easily leave 100 Megabytes of error logs. I have taken the leisure of having a modified log_mf_h which can compress term that goes to disk on the fly. This of course improves the size of error report sizes even more.


The experiment

An experiment that has been on my mind for some time was can I possibly do automatic clustering of similar error reports so like reports are clustered together? If this is possible, I could then just pick a representative for each error cluster and use that as a basis for handling like errors. Even if you have thousands of reports, it may be that there are only a few of them which are really different and a lot of them which are redundant.

I know nothing about clustering. But a question later on StackOverflow and after a kind soul on the #erlang IRC channel hinted me in the direction of the paper Ephemeral Document Clustering for Web Applications (2000) DOI: 10.1.1.59.131 I was set for an experiment with this. The algorithm implemented here is a variant though, mostly due to getting something fast. So it does not implement the full paper. We implement the ideas from there and the paper can be consulted if anything below is unclear (or comment, I'll try to explain).

I altered rb to rerb, short for the riak_err report browser. The Basho guys have this excellent alternative to the sasl application built into OTP, see riak_err and I used that as a starting point. I am not even sure riak_err is the right place for this, but at least this is what I did for now.


Processing

The first part is now to convert an error report. They usually look something along the lines of [{badarg, MFA1}, MFa2, ..., MFaN] where each MFA is a tuple {Module, Function, [Arg1, ..., ArgN]} — and each MFa is a tuple {Module, Function, Arity} (arity being the number of parameters of the function). Note that Erlang errors are Erlang terms, so they are easily manipulated by our code. It does make them harder to read for newcomers however.

First, we linearize and canonicalize the input. We change the tuple to [badarg, {Module, Function, length([Arg1, ..., ArgN])}, MFa2, ..., MFaN]. In other words we flatten the error term and we change the occurrence of MFA to MFa. As an example we may have an MFA [{badarg, {ets, insert, [table_foo, {123, something}]}}] and this is changed into [badarg, {ets, insert, 2}]. This ensures that we don’t care about the difference of arguments, but only care about the arity of the function and the call stack.

Next, suppose an error report has the structure [A, B, C] for arbitrary terms A, B and C. We now form bigrams ['#', A], [A, B], [B, C] and [C, '#']. The '#' is just an atom we use as padding for the boundary terms. For each of these we now run the erlang:phash2/2 function to obtain a hash of each bigram: [{phash2(Bg), 1} || Bg <- Bigrams]. Note we also smacked the hash into a tuple with a value of 1. This list of {Bigram, 1} pairs forms a vector in an (euclidian) vector space where the bigram part names a dimension in the space. If we form bigrams like this for all error reports, and take the union over all bigrams we have a set of bigrams defining the basis of the space.

Now we define a metric called similarity on two vectors, A and B, of the space by (A . B) / (||A|| * ||B||) = cos(t) where . is the dot-product and ||x|| is the norm of the vector x. This is equivalent to the cosine of the angle t (theta) between the two vectors A and B. Now, if two vectors point in the exact same direction, the angle is 0 and the cosine is 1, i.e., perfect similarity. If the two vectors are orthogonal to each other, the similarity will be 0.

The similarity of a set of vectors is had by taking the minimum of any two vectors from the set. It to a certain extent measures how “different” the cluster is because the minimum suggest the most dissimilar result is chosen.


Clustering

We form a forest of singleton clusters, and place each vector/error-report in its own cluster. Then we consider similarity of each pair of clusters. The two clusters with the highest similarity are joined. The joining also forms a binary-tree, a dendogram, of the two clusters. Whenever we join two clusters, the original clusters are removed. So whenever a join is executed, we are left with one less clusters. Recursively repeat this until we have a binary tree. This forms the cluster dendogram.

Listing the clusters is done by choosing a similarity, 0.5 say and then walking the tree until a similarity of 0.5 is reached. The subtree is then flattened into a cluster of like error reports.


Looking forward

The code is in an experimental branch of my github repo, jlouis/riak_err/n-gram-analyzer. Note I currently make no guarantees of its stability currently. I may rebase it still, so poke me if you want to base something on it. Currently you can only configure the riak_err config tree like:

{riak_err,
 [{log_mf, true},
  {error_logger_mf_dir, "log/mf"},
  {error_logger_mf_maxbytes, 10240000},
  {error_logger_mf_maxfiles, 10}
 ]},

and then run rerb:start(). followed by rerb:analyze() to analyze the reports. Listing the clusters are done by rerb:list({clusters, Similarity}) where the similarity is between 0.0 and 1.0.
The code not seen many wars yet, so it is probably quite unstable.

Another important point is that the algorithm used is a simplification of what the above linked paper does. In particular the algorithm is at least O(n^3) though I think one can pretty easily lower that to O(n^2 lg n). Also, there are some constants that can be shaved down to improve the clustering speed (It is not that fast at the moment since it walks lists all the way). Finally, it is not parallel though some parts of it lends itself to a fork/join style speedup.

I’d also want to provide the clustering tree in a raw way, so you can run arbitrary functions over it. I’ll probably implement a variant that for each cluster tells the size of it and picks three random representatives for it.


Final words

The clustering used here seems to work wonderfully well for the error reports I tried it on. It happily clusters up like error reports and different clusters contain different error reports. The Bigrams ensure that we grab the call-path rather than just a presence in a function. We only operate on the stack trace, but we could be operating on more of the error-report.

Also, I am not too sure riak_err is the right place for this. I did some alterations to log_mf_h and rb which
  • Enables error reports above 16Kb in size. No limit-function is employed yet, though it would be preferable. I could imagine a report with the state of a large heap being pushed.
  • Use compression on terms. Disk IO is slow compared to in-memory compression.
  • Fixes some redundancy in the rb grep code by making the code into a match compiler. This patch can probably flow into OTP after some more testing.
  • Use binary notation. The rb and log_mf_h code is probably so old Erlang did not have binaries at that time.
Some of the ideas here are inspired by analysis of stack traces in other languages, Java in particular. But perhaps some of these ideas are applicable to log files in general.

Finally, any comments, positive/negative are welcome. Suggestions for improving the ideas here are also much appreciated.

2 comments:

Thomas said...

Interesting blog post.

If I understand this correctly (yes, I should read the article), you characterize a distance between Erlang stack traces. The closer they are, the more it points to the same error.

Clearly, one error can expose itself as many faults and hence this is useful to find the error when you observe the faults.

But, I fail to understand the intuition behind your measure. The bigrams form the vector, as far as I can see, i.e., a vector is the collection of bigrams in a stack trace....? But then, wouldn't any type error or division by zero error have very many ways to get there, hence many different stack traces? In that case, the most important information in the stack trace is the final failure... which programmers do look at first, the error message of the last call.
There is no weight in your vector as far as I see that compensates for this?

Thus, if you can demonstrate that the similarity notion you use is a useful one, then I would surely encourage you to wrap this up in an article for the upcoming Erlang workshop. It seems like a cool thing to present there!

Jlouis said...

@Thomas:

The similarity notion may still need some tuning indeed.

Another way to look at it is this, however:

Suppose we have G = (V, E) be a directed call-graph. Every vertex v in V are {Module, Function, Arity} entries and (v,u) is in E if the function v calls u. Augment this graph with a source node, s, and a sink node, t. Every possible "start function", v, of a spawned function is connected to the source by (s,v). And every possible point of error is connected to the sink by (u,t).

A bigram [u, v] in a stack trace corresponds to an edge in E. Thus, a stack trace is a path in the call graph just mentioned from the source to the sink (the source and sink acts as "padding" for the first and last entry in the stack trace respectively).

The similarity notion for two traces/paths is thus: How many edges do the two paths have in common? If they are completely equivalent the similarity is 1.0 and if they have nothing in common the similarity is 0.0. For each edge they do have in common, the similarity is 1/n where n is is length of the longest trace in consideration.

Currently it looks as if this works quite well experimentally, but as you say, it needs a good rationale for why this is so, theoretically. Also, my experiments currently are quite non-extensive so I don't know if this works on a larger scale.

There are a couple of knobs we can twist: The weight of the bigrams are all 1. We could give the type errors or div-by-zero errors at the bottom of the trace more weight for instance, making them more important. The second knob is similarity in case of not two, but a cluster of traces. We currently take the minimum over any pair from the cluster but there are several alternative metrics to try here. Finally, the third knob is the dendogram tree formed, which can be changed from a binary tree into an n-way tree or something else entirely.

Blog Archive

About Me

My Photo
Lambda-loving CS Geek. Likes metal music. Likes dogs. Likes cats. Does not like pictures of dogs and cats (unless they are lambdacats!)

Has an unhealthy coffee addiction. Calls himself the coffee zombie in the morning (BEEEEANS!)

Has a neverending curiosity gene. Likes intelligence.