Tracing Erlang programs for fun and profit
One of the neat things about Erlang is its instrumentation capability. You can instrument programs to tell you interesting things about what is happening in the program. This blog post is about a tool by Mats Cronqvist, redbug.
Redbug can be downloaded from github and is part of Mats’ eper suite of tools. Installing the tool is easy. I recommend to set the $ERL_LIBS environment to something. Mine is set at:
jlouis@illithid:~$ env | grep ERL_LIBS
ERL_LIBS=:/home/jlouis/lib/erlang/lib
so I can just drop Erlang libraries I use on-and-off into that directory and they will be picked up by any Erlang I run. It is not a good solution when you are building software with dependencies, but for smaller tools used by you, like eper or Erlang QuickCheck, this mechanism is really good.
Installing eper should be fairly simple.
Redbug invocation
Redbug can either be called from the command line. There is a redbug
shell script you can call or you can call it from the Erlang shell. The main invocation is like this,
redbug:start(TimeOut, MessageCount, MS)
where TimeOut
is a timeout in milliseconds at which redbug ceases to operate, MessageCount
sets a limit to how many reports redbug is going to make, and MS
is a match-spec matching trace points in the program. There are several possible ways to write MS
and I am only going to give some simple examples to get you started. The tool is self-documenting and you can call
redbug:help().
from a shell and get the whole story.
The timeout and messagecount limitation is very useful. Erlang has a built-in tracer on which redbug is built. But contrary to the built-in tracer, redbug protects the running system by the limitations. You can’t hose the system by accidentally set a specification which hoards all of the resources on the Erlang node.
A typical MS is written like {erlang,now,[return,stack]}
. This states we are tracing for the erlang:now
call of any arity. When we match, we want the current stack printed and we want the return value of the call:
22> redbug:start(100000, 2, {erlang,now,[return,stack]}).
ok
23> erlang:now().
22:40:12 <{erlang,apply,2}> {erlang,now,[]}
{1290,289212,474520}
shell:eval_loop/3
shell:eval_exprs/7
shell:exprs/7
22:40:12 <{erlang,apply,2}> {erlang,now,0} -> {1290,289212,474520}
quitting: msg_count
24> erlang:now().
{1290,289219,57040}
Notice the quitting: msg_count
which states after two messages from redbug, it will cease to do any further tracing. In general, the MS can also be written like, e.g., {module,function,[return,{'_', 42}]}
stating that we accept any call matching module:function(_, 42)
and gets its return stack.
A real-world bug hunt
Redbug means you don’t in general have to add a lot of debug-printing to your Erlang code. Rather, it is easier to probe systematically with redbug on a running system. I was wondering why a recent patch in etorrent seemed to work incorrectly, so we go hunting:
(etorrent@127.0.0.1)26> redbug:start(10000, 2, {etorrent_choker, split_preferred, [return]}).
ok
22:47:32 {etorrent_choker,split_preferred,[[]]}
22:47:32 {etorrent_choker,split_preferred,1} -> {[],[]}
quitting: msg_count
This choker call should not be passed the empty list, so we look into the code and find that the rechoke_info builder right before it is odd:
(etorrent@127.0.0.1)29> redbug:start(10000, 30, {etorrent_choker, build_rechoke_info, [return]}).
[..]
22:49:22 {etorrent_choker,build_rechoke_info,2} -> []
22:49:22 {etorrent_choker,build_rechoke_info,1} -> []
So - both build_rechoke_info/1
and build_rechoke_info/2
return the empty list. Something is wrong inside that function. Since the function is looking up data in other modules, we trace each of the module lookups:
(etorrent@127.0.0.1)29> redbug:start(10000, 10, {etorrent_table, get_peer_info, [return]}).
[..]
22:51:42 {etorrent_table,get_peer_info,[<0.5759.0>]}
22:51:42 {etorrent_table,get_peer_info,1} -> {peer_info,
leeching,17}
Nope, that looks right, on to the next:
(etorrent@127.0.0.1)30> redbug:start(10000, 15, {etorrent_rate_mgr, fetch_send_rate, [return]}).
ok
22:53:12 {etorrent_rate_mgr,fetch_send_rate,[4,<0.3919.0>]}
22:53:12 {etorrent_rate_mgr,fetch_send_rate,2} -> none
Oh, a return of none
is wrong here! Why does it return none
? The call looks fine, but we are looking up data in an ETS table…
At this point, we can use another nice little Erlang tool, tv
or the Table Viewer. We run:
tv:start().
find the problematic table and inspect an element, which turned out to contain the wrong information. Thus, the hunt is all about figuring out why the wrong information is entered into the table in the first place.
(etorrent@127.0.0.1)35> redbug:start(10000, 5, {ets,insert,[stack,{etorrent_send_state, {rate_mgr, {'_', undefined}, '_', '_'}}]}).
Basically, we have now constricted the output to exactly the wrong types of calls. And the culprit function is easily found in the callers stack.
Further digging shows the problem to be a race at the gproc process table which can be fixed by asking gproc to await the appearance of a given key.
View comments