On Erlang, State and Crashes
There are two things which are ubiquitous in Erlang:
- A Process has an internal state.
- When the process crashes, that internal state is gone.
These two facts pose some problems for new Erlang programmers. If my state is gone, then what should I then do? The short answer to the question is that some other process must have the state and provide the backup, but this is hardly a fulfilling answer: It is turtles all the way down. Now, that other process might die, and then another beast of a process must have the state. And this observation continues ad infinitum. So what is the Erlang programmer to do? This is my attempt at answering the question.
The internal state of an Erlang process can naturally be classified. First, state has different value. State related to the current computation residing on the stack may not be important at all after a process crash. It crashed for a reason and chance are that the exact same state will bring down the process again with the same error. The same observation might apply to some internal state: It is like a scratchpad or a blackboard: when the next lecture starts, it can be erased because it has served its purpose.
Next is static state. If a process is governing a TCP/IP connection that process should probably connect to the same TCP/IP Address/Port pair if it crashes and is restarted. We call that kind of data configuration or static data. It is there, but it is not meant to change over the course of the application, or only change rarely.
Finally our crude classification of state has dynamic data. This class is the data we generate over the course of the running program, get from user input, create because other programs communicate with us and so on. The class can be split into two major components: State we can compute from other data and state we cannot compute. The computable state is somewhat less of a problem. We can basically just recompute it after a crash, so the real problem is the other kind of user/program-supplied information.
In other words, we have three major kinds of state: scratchpad, static and dynamic.
The Error Kernel
Erlang programs have a concept called the error kernel. The kernel is the part of the program which must be correct for its correct operation. Good Erlang design begins with identifying the error kernel of the system: What part must not fail or it will bring down the whole system? Once you have the kernel identified, you seek to make it minimal. Whenever the kernel is about to do an operation which is dangerous and might crash, you "outsource" that computation to another process, a dumb slave worker. If he crashes and is killed, nothing really bad has happened - since the kernel keeps going.
Identifying the kernel plugs the "turtles all the way down" hole. As soon as the kernel is hit, we assume correctness. But since the kernel is small, the trusted computing base of our program is likewise. We only need to trust a small part of the program, and that part is also fairly simple.
A visualization is this: A program is a patchwork of small squares. Some of the squares are red, and these are the "error kernel". Most (naively implemented) imperative programs are mostly red, save for a few squares. These are the squares where exceptions are handled explicitly and the error is correctly mitigated. The kernel is thus fairly large. In contrast, robustness-aware Erlang programs have few red squares - most of the patchwork is white. It is a design-goal to get as few red squares as possible. It is achieved by delegating dangerous work to the white areas so a crash does not affect the kernel.
Handling the state classes
Each class must be handled differently. First there is the scratchpad/blackboard class. If a process crashes, the class is interesting because it contains the stack trace and usually the data which tells a story - namely how and why the process crashed. We usually export this data via SASLs error logger, so we can look at a crash report and understand what went wrong. After all, the internal state is gone after the crash report is done and logged.
Next, there is the static class. The simplest thing is to have another process feed in the static data. This can be done by, among others, the supervisor, by asking an ETS table, by asking GProc (if you use gproc in your system), by asking another process or by discovery through the call
application:get_env/2. It is important to note just how static the data is - you have few options with differing advantages and disadvantages. Which one to choose depends on how much the data is going to change.
Finally, the fully dynamic data is the nasty culprit. If you can recompute the data, you are lucky. As an example from my etorrent application, each peer has a dynamic table of what parts of a torrent file the given peer has. So the controlling process has an internal table of this information. But if we crash and reconnect to the peer, the virtue of the bittorrent protocol will send us this information again. So that information is hardly worth keeping around. Other times, you can simply recalculate the information when your process restarts, and that is almost never a problem either.
So what about the user supplied data? This is where the error kernel comes in. You need to protect data which you can not reconstruct. You protect it by shoving it into the error kernel and keep some simple state maintenance processes there to handle the state. A word of warning though: If your state is corrupted, it means that processes basing their work on the state will do something wrong. To mitigate this, it is important to make some general sanity checking of your data. Make it a priority to check your data for invariants if you find them. And don't blindly trust non-error-kernel parts of the system.
If a process crashes, you should definitely think how much of its internal state you want to recycle. If you recycle everything you risk hitting the exact same bug again and crash. Rather, there may be a benefit to only recycling parts of the internal state.
The next step: Onion-layered Error kernels
The next logical step up, is to recognize that the error kernel is not discrete. You want to regard the error kernel as an onion. Whenever you peel off a layer, you get a step closer to the trusted computing base of the application. Then your system design is to push down state maintenance to the outermost layer in the onion where it still makes sense. This in effect protects one part of the application from others. In Etorrent, we can download multiple torrent files at the same time. If one such torrent download fails, there is no reason it should affect the other torrent downloads. We can add a layer to the onion: Some state which is local to the torrent is kept in a separate supervisor tree - to mitigate the error if that part fails.
The net effect is program robustness: A bug in the program will suddenly need perseverance. It has to penetrate several layers in the onion before it can take the full program down. And if the Erlang system is well designed, even the most grave bugs can only penetrate so far before the stopping power of the onion layers brings it to a halt.
Furthermore, it underpins a mantra of Erlang programs: Small bugs have small impact. They won't even penetrate the first layer. And they will hardly be a scratch in the fabric of computing.
(Aside: Good computer security engineering use the same onion-layered model. There are strong similarities between protecting a computer system against a well-armed intruder and protecting a program against an aggressive, persistent, dangerous and maiming bug. End of Aside)
EDIT: smaller language changes where my first post was a bit drafty.