Tracing Erlang programs for fun and profit

One of the neat things about Erlang is its instrumentation capability. You can instrument programs to tell you interesting things about what is happening in the program. This blog post is about a tool by Mats Cronqvist, redbug.

Redbug can be downloaded from github and is part of Mats’ eper suite of tools. Installing the tool is easy. I recommend to set the $ERL_LIBS environment to something. Mine is set at:

   jlouis@illithid:~$ env | grep ERL_LIBS
  ERL_LIBS=:/home/jlouis/lib/erlang/lib

so I can just drop Erlang libraries I use on-and-off into that directory and they will be picked up by any Erlang I run. It is not a good solution when you are building software with dependencies, but for smaller tools used by you, like eper or Erlang QuickCheck, this mechanism is really good.

Installing eper should be fairly simple.

Redbug invocation

Redbug can either be called from the command line. There is a redbug shell script you can call or you can call it from the Erlang shell. The main invocation is like this,

   redbug:start(TimeOut, MessageCount, MS)

where TimeOut is a timeout in milliseconds at which redbug ceases to operate, MessageCount sets a limit to how many reports redbug is going to make, and MS is a match-spec matching trace points in the program. There are several possible ways to write MS and I am only going to give some simple examples to get you started. The tool is self-documenting and you can call

   redbug:help().

from a shell and get the whole story.

The timeout and messagecount limitation is very useful. Erlang has a built-in tracer on which redbug is built. But contrary to the built-in tracer, redbug protects the running system by the limitations. You can’t hose the system by accidentally set a specification which hoards all of the resources on the Erlang node.

A typical MS is written like {erlang,now,[return,stack]}. This states we are tracing for the erlang:now call of any arity. When we match, we want the current stack printed and we want the return value of the call:

  22> redbug:start(100000, 2, {erlang,now,[return,stack]}).
 ok
 23> erlang:now().

 22:40:12 <{erlang,apply,2}> {erlang,now,[]}
 {1290,289212,474520}
  shell:eval_loop/3
  shell:eval_exprs/7
  shell:exprs/7

 22:40:12 <{erlang,apply,2}> {erlang,now,0} -> {1290,289212,474520}
 quitting: msg_count
 24> erlang:now().
 {1290,289219,57040}

Notice the quitting: msg_count which states after two messages from redbug, it will cease to do any further tracing. In general, the MS can also be written like, e.g., {module,function,[return,{'_', 42}]} stating that we accept any call matching module:function(_, 42) and gets its return stack.

A real-world bug hunt

Redbug means you don’t in general have to add a lot of debug-printing to your Erlang code. Rather, it is easier to probe systematically with redbug on a running system. I was wondering why a recent patch in etorrent seemed to work incorrectly, so we go hunting:

(etorrent@127.0.0.1)26> redbug:start(10000, 2, {etorrent_choker, split_preferred, [return]}).
ok
22:47:32  {etorrent_choker,split_preferred,[[]]}
22:47:32  {etorrent_choker,split_preferred,1} -> {[],[]}
quitting: msg_count

This choker call should not be passed the empty list, so we look into the code and find that the rechoke_info builder right before it is odd:

(etorrent@127.0.0.1)29> redbug:start(10000, 30, {etorrent_choker, build_rechoke_info, [return]}).
[..]
22:49:22  {etorrent_choker,build_rechoke_info,2} -> []
22:49:22  {etorrent_choker,build_rechoke_info,1} -> []

So - both build_rechoke_info/1 and build_rechoke_info/2 return the empty list. Something is wrong inside that function. Since the function is looking up data in other modules, we trace each of the module lookups:

(etorrent@127.0.0.1)29> redbug:start(10000, 10, {etorrent_table, get_peer_info, [return]}).
[..]
22:51:42  {etorrent_table,get_peer_info,[<0.5759.0>]}
22:51:42  {etorrent_table,get_peer_info,1} -> {peer_info,
                                                              leeching,17}

Nope, that looks right, on to the next:

(etorrent@127.0.0.1)30> redbug:start(10000, 15, {etorrent_rate_mgr, fetch_send_rate, [return]}).
ok
22:53:12  {etorrent_rate_mgr,fetch_send_rate,[4,<0.3919.0>]}
22:53:12  {etorrent_rate_mgr,fetch_send_rate,2} -> none

Oh, a return of none is wrong here! Why does it return none? The call looks fine, but we are looking up data in an ETS table…

At this point, we can use another nice little Erlang tool, tv or the Table Viewer. We run:

tv:start().

find the problematic table and inspect an element, which turned out to contain the wrong information. Thus, the hunt is all about figuring out why the wrong information is entered into the table in the first place.

(etorrent@127.0.0.1)35> redbug:start(10000, 5, {ets,insert,[stack,{etorrent_send_state, {rate_mgr, {'_', undefined}, '_', '_'}}]}).

Basically, we have now constricted the output to exactly the wrong types of calls. And the culprit function is easily found in the callers stack.

Further digging shows the problem to be a race at the gproc process table which can be fixed by asking gproc to await the appearance of a given key.

On Erlang, State and Crashes

There are two things which are ubiquitous in Erlang:

A Process has an internal state.
When the process crashes, that internal state is gone.

These two facts pose some problems for new Erlang programmers. If my state is gone, then what should I then do? The short answer to the question is that some other process must have the state and provide the backup, but this is hardly a fulfilling answer: It is turtles all the way down. Now, that other process might die, and then another beast of a process must have the state. And this observation continues ad infinitum. So what is the Erlang programmer to do? This is my attempt at answering the question.

State Classification

The internal state of an Erlang process can naturally be classified. First, state has different value. State related to the current computation residing on the stack may not be important at all after a process crash. It crashed for a reason and chance are that the exact same state will bring down the process again with the same error. The same observation might apply to some internal state: It is like a scratchpad or a blackboard: when the next lecture starts, it can be erased because it has served its purpose.

Next is static state. If a process is governing a TCP/IP connection that process should probably connect to the same TCP/IP Address/Port pair if it crashes and is restarted. We call that kind of data configuration or static data. It is there, but it is not meant to change over the course of the application, or only change rarely.

Finally our crude classification of state has dynamic data. This class is the data we generate over the course of the running program, get from user input, create because other programs communicate with us and so on. The class can be split into two major components: State we can compute from other data and state we cannot compute. The computable state is somewhat less of a problem. We can basically just recompute it after a crash, so the real problem is the other kind of user/program-supplied information.

In other words, we have three major kinds of state: scratchpad, static and dynamic.

The Error Kernel

Erlang programs have a concept called the error kernel. The kernel is the part of the program which must be correct for its correct operation. Good Erlang design begins with identifying the error kernel of the system: What part must not fail or it will bring down the whole system? Once you have the kernel identified, you seek to make it minimal. Whenever the kernel is about to do an operation which is dangerous and might crash, you "outsource" that computation to another process, a dumb slave worker. If he crashes and is killed, nothing really bad has happened - since the kernel keeps going.

Identifying the kernel plugs the "turtles all the way down" hole. As soon as the kernel is hit, we assume correctness. But since the kernel is small, the trusted computing base of our program is likewise. We only need to trust a small part of the program, and that part is also fairly simple.

A visualization is this: A program is a patchwork of small squares. Some of the squares are red, and these are the "error kernel". Most (naively implemented) imperative programs are mostly red, save for a few squares. These are the squares where exceptions are handled explicitly and the error is correctly mitigated. The kernel is thus fairly large. In contrast, robustness-aware Erlang programs have few red squares - most of the patchwork is white. It is a design-goal to get as few red squares as possible. It is achieved by delegating dangerous work to the white areas so a crash does not affect the kernel.

Handling the state classes

Each class must be handled differently. First there is the scratchpad/blackboard class. If a process crashes, the class is interesting because it contains the stack trace and usually the data which tells a story - namely how and why the process crashed. We usually export this data via SASLs error logger, so we can look at a crash report and understand what went wrong. After all, the internal state is gone after the crash report is done and logged.

Next, there is the static class. The simplest thing is to have another process feed in the static data. This can be done by, among others, the supervisor, by asking an ETS table, by asking GProc (if you use gproc in your system), by asking another process or by discovery through the call application:get_env/2. It is important to note just how static the data is - you have few options with differing advantages and disadvantages. Which one to choose depends on how much the data is going to change.

Finally, the fully dynamic data is the nasty culprit. If you can recompute the data, you are lucky. As an example from my etorrent application, each peer has a dynamic table of what parts of a torrent file the given peer has. So the controlling process has an internal table of this information. But if we crash and reconnect to the peer, the virtue of the bittorrent protocol will send us this information again. So that information is hardly worth keeping around. Other times, you can simply recalculate the information when your process restarts, and that is almost never a problem either.

So what about the user supplied data? This is where the error kernel comes in. You need to protect data which you can not reconstruct. You protect it by shoving it into the error kernel and keep some simple state maintenance processes there to handle the state. A word of warning though: If your state is corrupted, it means that processes basing their work on the state will do something wrong. To mitigate this, it is important to make some general sanity checking of your data. Make it a priority to check your data for invariants if you find them. And don't blindly trust non-error-kernel parts of the system.

If a process crashes, you should definitely think how much of its internal state you want to recycle. If you recycle everything you risk hitting the exact same bug again and crash. Rather, there may be a benefit to only recycling parts of the internal state.

The next step: Onion-layered Error kernels

The next logical step up, is to recognize that the error kernel is not discrete. You want to regard the error kernel as an onion. Whenever you peel off a layer, you get a step closer to the trusted computing base of the application. Then your system design is to push down state maintenance to the outermost layer in the onion where it still makes sense. This in effect protects one part of the application from others. In Etorrent, we can download multiple torrent files at the same time. If one such torrent download fails, there is no reason it should affect the other torrent downloads. We can add a layer to the onion: Some state which is local to the torrent is kept in a separate supervisor tree - to mitigate the error if that part fails.

The net effect is program robustness: A bug in the program will suddenly need perseverance. It has to penetrate several layers in the onion before it can take the full program down. And if the Erlang system is well designed, even the most grave bugs can only penetrate so far before the stopping power of the onion layers brings it to a halt.

Furthermore, it underpins a mantra of Erlang programs: Small bugs have small impact. They won't even penetrate the first layer. And they will hardly be a scratch in the fabric of computing.

(Aside: Good computer security engineering use the same onion-layered model. There are strong similarities between protecting a computer system against a well-armed intruder and protecting a program against an aggressive, persistent, dangerous and maiming bug. End of Aside)

EDIT: smaller language changes where my first post was a bit drafty.

JLOUIS Ramblings

Computer Science, Mathematics, Society, Relationships, Sex, 0xf00d, Pets, Leashes and Secretaries.

Tracing Erlang programs for fun and profit

Redbug invocation

A real-world bug hunt

View comments

On Erlang, State and Crashes

On Erlang, State and Crashes

State Classification

The Error Kernel

Handling the state classes

The next step: Onion-layered Error kernels

View comments