In this, I describe why Erlang is different from most other language runtimes. I also describe why it often forgoes throughput for lower latency.
TL;DR - Erlang is different from most other language runtimes in that it targets different values. This describes why it often seem to perform worse if you have few processes, but well if you have many.
From time to time the question of Erlang scheduling gets asked by different people. While this is an abridged version of the real thing, it can act as a way to describe how Erlang operates its processes. Do note that I am taking Erlang R15 as the base point here. If you are a reader from the future, things might have changed quite a lot—though it is usually fair to assume things only got better, in Erlang and other systems.
Toward the operating system, Erlang usually has a thread per core you have in the machine. Each of these threads runs what is known as a scheduler. This is to make sure all cores of the machine can potentially do work for the Erlang system. The cores may be bound to schedulers, through the +sbt flag, which means the schedulers will not "jump around" between cores. It only works on modern operating systems, so OSX can't do it, naturally. It means that the Erlang system knows about processor layout and associated affinities which is important due to caches, migration times and so on. Often the +sbt flag can speed up your system. And at times by quite a lot.
The +A flag defines a number of async threads for the async thread pool. This pool can be used by drivers to block an operation, such that the schedulers can still do useful work while one of the pool-threads are blocked. Most notably the thread pool is used by the file driver to speed up file I/O - but not network I/O.
While the above describes a rough layout towards the OS kernel, we still need to address the concept of an Erlang (userland) process. When you call spawn(fun worker/0) a new process is constructed, by allocating its process control block in userland. This usually amounts to some 600+ bytes and it varies from 32 to 64 bit architectures. Runnable processes are placed in the run-queue of a scheduler and will thus be run later when they get a time-slice.
Before diving into a single scheduler, I want to describe a little bit about how migration works. Every once in a while, processes are migrated between schedulers according to a quite intricate process. The aim of the heuristic is to balance load over multiple schedulers so all cores get utilized fully. But the algorithm also considers if there is enough work to warrant starting up new schedulers. If not, it is better to keep the scheduler turned off as this means the thread has nothing to do. And in turn this means the core can enter power save mode and get turned off. Yes, Erlang conserves power if possible. Schedulers can also work-steal if they are out of work. For the details of this, see [1].
IMPORTANT: In R15, schedulers are started and stopped in a "lagged" fashion. What this means is that Erlang/OTP recognizes that starting a scheduler or stopping one is rather expensive so it only does this if really needed. Suppose there is no work for a scheduler. Rather than immediately taking it to sleep, it will spin for a little while in the hope that work arrives soon. If work arrives, it can be handled immediately with low latency. On the other hand, this means you cannot use tools like top(1) or the OS kernel to measure how efficient your system is executing. You must use the internal calls in the Erlang system. Many people were incorrectly assuming that R15 was worse than R14 for exactly this reason.
Each scheduler runs two types of jobs: process jobs and port jobs. These are run with priorities like in an operating system kernel and is subject to the same worries and heuristics. You can flag processes to be high-priority, low-priority and so on. A process job executes a process for a little while. A port job considers ports. To the uninformed, a "port" in Erlang is a mechanism for communicating with the outside world. Files, network sockets, pipes to other programs are all ports. Programmers can add "port drivers" to the Erlang system in order to support new types of ports, but that does require writing C code. One scheduler will also run polling on network sockets to read in new data from those.
Both processes and ports have a "reduction budget" of 2000 reductions. Any operation in the system costs reductions. This includes function calls in loops, calling built-in-functions (BIFs), garbage collecting heaps of that process[n1], storing/reading from ETS, sending messages (The size of the recipients mailbox counts, large mailboxes are more expensive to send to). This is quite pervasive, by the way. The Erlang regular expression library has been modified and instrumented even if it is written in C code. So when you have a long-running regular expression, you will be counted against it and preempted several times while it runs. Ports as well! Doing I/O on a port costs reductions, sending distributed messages has a cost, and so on. Much time has been spent to ensure that any kind of progress in the system has a reduction cost[n2].
In effect, this is what makes me say that Erlang is one of a few languages that actually does preemptive multitasking and gets soft-realtime right. Also it values low latency over raw throughput, which is not common in programming language runtimes.
To be precise, preemption[2] means that the scheduler can force a task off execution. Everything based on cooperation cannot do this: Python twisted, Node.js, LWT (Ocaml) and so on. But more interestingly, neither Go (golang.org) nor Haskell (GHC) is fully preemptive. Go only switches context on communication, so a tight loop can hog a core. GHC switches upon memory allocation (which admittedly is a very common occurrence in Haskell programs). The problem in these systems are that hogging a core for a while—one might imagine doing an array-operation in both languages—will affect the latency of the system.
This leads to soft-realtime[3] which means that the system will degrade if we fail to meet a timing deadline. Say we have 100 processes on our run-queue. The first one is doing an array-operation which takes 50ms. Now, in Go or Haskell/GHC[n3] this means that tasks 2-100 will take at least 50ms. In Erlang, on the other hand, task 1 would get 2000 reductions, which is sub 1ms. Then it would be put in the back of the queue and tasks 2-100 would be allowed to run. Naturally this means that all tasks are given a fair share.
Erlang is meticously built around ensuring low-latency soft-realtime properties. The reduction count of 2000 is quite low and forces many small context switches. It is quite expensive to break up long-running BIFs so they can be preempted mid-computation. But this also ensures an Erlang system tend to degrade in a graceful manner when loaded with more work. It also means that for a company like Ericsson, where low latency matters, there is no other alternative out there. You can't magically take another throughput-oriented language and obtain low latency. You will have to work for it. And if low latency matters to you, then frankly not picking Erlang is in many cases an odd choice.
[1] "Characterizing the Scalability of Erlang VM on Many-core Processors" http://kth.diva-portal.org/smash/record.jsf?searchId=2&pid=diva2:392243
[2] http://en.wikipedia.org/wiki/Preemption_(computing)
[3] http://en.wikipedia.org/wiki/Real-time_computing
[n1] Process heaps are per-process so one process can't affect the GC time of other processes too much.
[n2] This section is also why one must beware of long-running NIFs. They do not per default preempt, nor do they bump the reduction counter. So they can introduce latency in your system.
[n3] Imagine a single core here, multicore sort of "absorbs" this problem up to core-count, but the problem still persists.
(Smaller edits made to the document at Mon 14th Jan 2013)
Go is cooperative by default. This is useful as it removes a lot of the locking overhead (it's also an artefact of a simple scheduler that still needs more work). Go can be made preemptive by setting GOMAXPROCS > 1, which you would be crazy not to do on a server.
ReplyDeletehttp://golang.org/doc/go_faq.html#Concurrency
Go is not preemptive with GOMAXPROCS > 1. I have created a small toy program to show that this is the case: https://gist.github.com/4523487
ReplyDeleteThe idea of the program is to start off 2 goroutines and then wait a bit to make sure they get on the CPUs. If you set GOMAXPROCS=2, this program infinite-loops and never returns. In a fully preemptive environment it would return in 1 second since the main thread would print and then exit.
You could crank up GOMAXPROCS, but this only pushes the problem out further until I increase the 'loop' constant by the same amount.
Why isn't, in your opinion the Mac OSX "modern"?
ReplyDeleteCan't it run mulitcore or have a foreign run-time do farming?
/aclassifier, Trondheim
By the way, assuming that all CSP-based concurrency (athough based on cooperative scheduling) would have the same run-time charcteristics would not be correct. Also, the art of programming in any paradigm needs a thorough understanding of it. Doiing a large loop without a syncronization point could be a problem for latency, as you mention. But it would depend also on what you do in the interrupts.
ReplyDelete/aclassifier
I fully agree on the CSP remarks. CSP, defined informally, means different things to different people, like OOP would. As an example, Go is often called CSP (like) but there are parts of the model which is closer to the π-calculus.
ReplyDeleteThen there is CSP in a formal verification. Say you encode your CSP in a logical framework and give it operational semantics. There are probably several such valid encodings which people would claim to be CSP—yet starvation semantics would be quite different.
Finally, there is the faithfulness of the implementation. You may wish to ignore certain implementation details or handle undefined behaviour in different ways.
When I am saying Go here I mean, in particular, the golang.org standard implementation in its current (today, Jan 13th 2013) implementation.
As for OSX, it is a tongue-in-cheek remark. Processor affinity goes back to at least the 90's in IRIX. But there is little reason for Apple to implement it since it does not help their business core. And further, multi-core Apple systems are fairly new. Another point is that Apple has built a fairly broken kqueue()/kevent() implementation. Again because it does not suit their core customer who doesn't need this.
Thanks!
DeleteIn the sense where "preemptive" has the same semantics as "interrupt", then also occam (like on transputers) did interrupts (of course). But it got rid of "interrupt" as a term, a traditional interrupt function was "just another process" hanging on a hardware (flag) input channel.
In addition, an other CSP-based system could have used operating system real preemption this way. It's only when there are message (channel) based synchronization that the scheduler would have to let the messages drive the scheduling. This could of course give low latency along one path of communication while the others are waiting.
However, when one start to factor out unneeded elements here, then I assume this kind of preemption would go. It would be little need for it I assume. Parallel slackness ("enough" concurrency) does take care of burning processor cycles for the code, with busy poll not ever needed. Since we are comparing soft realtime with soft realtime here, isn't it like comparing rubber bands?
Something completely different: Erlang seems to have "save queues". Those puzzle me, see http://www.teigfam.net/oyvind/home/technology/056-some-questions-about-sdl/. I have tried to write more about "WYSIWYG semantics" at http://oyvteig.blogspot.no/2011/12/036-wysiwyg-semantics.html. Any comments could be posted there, not meaning to distract from this thread.
/aclassifier
Great post!
ReplyDeleteCould you maybe clarify the +K flag (kernel poll)?
From what I have gathered, this will use OS epoll/kqueue feature for I/O. Does this mean that I/O will not go through async thread pool?
Network I/O via epoll(), kqueue()/kevent() or /dev/poll are all enabled through the +K true parameter. If you set it, the system will do its best to avoid a select()-oriented loop. Works on most architectures, but for OSX.
ReplyDeleteThe way it works is that one of the schedulers run it under a lock once in a while and then sort-of distributes the work to other cores. There is ongoing work to split that so each scheduler will keep its own set of filedescriptors to poll based on what ports it currently owns and poll per-scheduler. There is a preliminary patch which gives good results, especially on heavily loaded systems, so one day we might see that in the Erlang system.
The implementation essentially never blocks on the call, so there is currently no use in having to use a thread pool.
There are three interesting comments to your writings, aclassifier:
ReplyDeleteFirst, if you have thousands of cores, perhaps a million or so, then you don't need any scheduling at all. Each concurrent activity is just assigned a new core to run on.
The second one pertains to "save queues". If I get SDL correct, then Erlang doesn't really have them per se, but the concept is probably not necessary either.
In Erlang, there is a single unbounded mailbox and senders trying to deliver messages to that box will never block. A rudimentary flow control happens because sending to a large mailbox has higher reduction-cost than a small one. But that won't protect you against overflowing a slow process.
This solves your first couple of questions since they would not apply. As for matching on the mailbox, this is often used in place of eating messages and then saving them for later. In fact, rather than using the save queue semantics, you can often get away and never pull out messages from the mailbox - other than the message you are interested in.
The idiom is that you generate a unique tag and then send a message with that tag. You now match on the arrival of the tag back to you, skipping all other messages. In effect, this acts like having such a queue.
As for flow control, there is no support in Erlang by default and you have to implement it yourself. The reason is that it is very often application specific to do this correctly. It depends on message idempotence, if you may throw away messages, who may block and so on.
Finally, as for your definition of WYSIWYG semantics, which I would define as being able to determine the semantics of a process by local context. In well written Erlang-programs, this is certainly the standard to go for. You decouple programs and focus on the protocol contracts between them. Correctness of the program is to large extent given by being able to focus on one component at a time without having to worry about other components. As such, Erlang processes act like dynamic modules of state, splitting up the state space. Often, Erlang processes tend to have few lines of code, which helps with understanding.
ReplyDeleteIt's always great to see a problem from another perspective. When I am at CSP-conferences, Erlang is also mentioned some times, so I have wondered. Some times one bring a known problem into another context, and.. which problem? And sometimes we have a solution in one paradigm that one try to bring into a new context, and.. solution to what?
In the hope that my questions would not be of the sorts, I go on:
> unbounded mailbox
I fail to understand that unboundedness is something that I could relate to in a safety-critical system. How do I?
While I like the matching in Erlang, what about messages that are never (or too late) wanted? Won't they age and creep up like fragmentation and finally fill the "unbounded" buffer?
> But that won't protect you against overflowing a slow process.
Do you mean buffer overflow (and crash) or the slow process being burdened with too much messages in the end?
> As for flow control,.. paragraph
When you say that you throw away messages, are these messages that you match on and get (explicit), or messages that you throw because they wouldn't match (implicit)?
> Finally, as for your definition of WYSIWYG semantics,
I didn't invent the term, it's in IEEE Computer letter. The protocol contracts, yes. But what if you match on something that fits the regular expression, but still isn't wanted - is that according to the protocol or will it break it? Will it depend?
If that unwanted match yields an acceptable message for another internal state, then you would have to put it into the save queue I assume. Will this be WYSIWYG semantics?
/aclassifier
This comment has been removed by the author.
ReplyDeleteLet me ask a question on [n2] - doesn't a NIF execution count up the reduction counter by one? It surely does not "bump" the reduction counter, but it will surely increases it by one, which I believe. Please correct me if I'm wrong. Thanks.
ReplyDelete> Unbounded mailboxes...
ReplyDeleteIn Erlang, you would normally have some state which reads and throws away messages not understood. Often logging a warning in the process or crashing the process entirely so you can clean up. This is the normal way to handle mailboxes and avoid them to fill up over time.
> Overflow...
This is when senders generate more messages than the receiver can handle. Over time, messages will queue and this affects latency and memory usage. To avoid this, you need to write it yourself as Erlang only provides you with tools to handle this. Not solutions.
Erlang systems thus provides no way by default to ensure safety critical operation. You must provide those mechanisms yourself. A telephony switch always need room for emergency calls for instance. And this is achieved by tracking the load of the system in general and by kicking out normal connections if necessary.
> Flow control,...
Depending on the system, you have to decide how to either avoid or recover from an overload situation. Explicit matching is rarely used to clean up a mailbox because it tend to be rather expensive. Usually, it is better to head-drop the queue or simply flush away all of it. Naturally, what to do is application specific.
In general, unknown messages are regarded as a failure to implement the protocol correctly. They can happen by accident, or due to dynamically upgrading processes in the wrong order. This is why it is mandatory for Erlang-idiomatic programs to have some state in which they process and remove messages from the front of the queue they don't know about.
@Kenji,
ReplyDeleteYou are right, but there are two problems here which must be understood:
First, a single reduction is often far too little for more expensive operations. This means that if a process calls many NIFs in short succession it ends up getting more time on the CPU than other processes and this of course changes the expected timing of the system.
Second, and worse, NIFs prevent preemption when they are executed. If they are implemented naively, not broken up, and are long-running (rule of thumb: more than 1ms) then they almost certainly will affect timing and thus soft realtime constraints of the system.
Jesper, thanks for taking me through this crash course in Erlang (no pun)!
ReplyDeleteErlang is a tool, of course - but as with any language, solutions seem to unfold more in one direction than another?
If there is a safety-critical direction there, send me a note, as it may be well worth studying more thoroughly.
In any overflowed system one must know what to discard, should that situation arrise - or design it such that it would have 100% capacity for all situations, or being provably correct. How telephone switches or fire detection systems (as I work with) compare here I don't know. I assume there's a lot of critical data handled by Erlang. Also I don't know which protocol level that's at. Probably several. The higher the OSI level, the less acceptable it is to throw away anything, I would assume. After the first fire alarm the next is not so interesting, if it's not in a nuclear power station.
Selecting the right solution is harder than selecting the right tool. But these decisions are connected, no matter what Alan Turing would say.
(I know this is off the original subject here, but it's very interesting!)
/aclassifier
A "properly written NIF" should bump the reduction counter inside the NIF for every "operation" that matters so that it can be mixed with other erlang code and not block the scheduler.
ReplyDeleteThe CPython VM actually does limit the runtime of Python threads via the GIL. It's rather subtle.
ReplyDeleteThe idea behind it is that if a Python thread goes into a loop, there should still be a chance for other threads to execute. If some number of VM ticks - which is basically the number of opcodes executed - have passed since the last switch, the global interpreter lock, or GIL, is released for that thread and the Python runtime tells the OS to schedule some other thread.
In short it's a mess and that's why Python programmers generally stick to cooperative solutions like Twisted, eventlet or likewise.
You can define affinity on OSX: http://developer.apple.com/library/mac/#releasenotes/Performance/RN-AffinityAPI/_index.html
ReplyDeleteYou cannot explicitly define processor affinity but you can make all threads run on different processors (which is what you usually really want).
Thank you for posting. Erlang's scheduling reminds me of Prime Computer's scheduler. The Process Control Block contained a counter whose overflow caused a process fault. The tick process incremented the counter; when it overflowed, Boom! The process got kicked off the ready list and had to wait its next turn. I/O operations and IPC also removed a process from the ready list. Best of all, the instruction set supported all this via native semaphore and queuing instructions. It was sweet.
ReplyDeleteJLOUIS said: "As for OSX, .. Another point is that Apple has built a fairly broken kqueue()/kevent() implementation. Again because it does not suit their core customer who doesn't need this."
ReplyDeleteThe Apple Manpage [1] for kqueue states in the BUGS section that "Not all filesystem types support kqueue-style notifications. And even some that do, like some remote filesystems, may only support a subset of the notification semantics described here."
Does OSX only support a subset, or are there plain bugs in their code, as "fairly broken" seems to imply? Is it open or proprietary code? Are you saying that Apple deliberately don't fix an error?
(also an aside in this thread, albeit started by JLOUIS:-)
/aclassifier
[1] - https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/kqueue.2.html
You can call runtime.Gosched() in Go to cooperatively hand off to the scheduler. But it's fair to say that Go's scheduler is targeting different behaviour. Goroutines are more about synchronization and simplifying control flow, less about having a myriad of small, restartable run-loops with good uptime, Erlang's speciality.
ReplyDelete@aclassifier, that's all very easy to answer without even looking further.
ReplyDeleteWhat filesystem types don't support kqueue-style notifications? The link you gave shows a manpage from October 2008, and things still hold like that in the latest OSX, so Apple still hasn't fixed it, and it seems they won't fix it voluntarily.
As for the open or closed code, that doesn't matter at all. What matters if it's their code or third-party code. Anyway, we don't have a reliable way of knowing if a filesystem supports kqueue-style notifications, and which it does fully support, without trying to use kqueue on it. Such API would make it much easier to determine if a filesystem module is legacy and treat it differently.
Otherwise, just hardcode your application with a table stating which filesystems are well behaved, which are more-or-less and which don't support kqueue at all. Or put it in the application's Info.plist.
Doesn't OSX use kqueue?
ReplyDeleteIs it just (more or less) available in the API?
/classifier
@Julian Morrison: yielding cooperatively is also possible in Erlang (call erlang:yield()), but it's recommended against doing it because the scheduler knows better when to interrupt you than you would.
ReplyDeleteSynchronization can just be a matter of sending messages. A and B want to synchronize. A sends a unique token 't' to B, along with a description of a task to do. A waits for B's response, which includes token 't' (checked through pattern matching) and is automatically descheduled until a message comes back. When a message is received, it's checked against A's pattern, A will only unblock when B tells it it can.
This can be done without impacting the rest of the system, unrelated to A and B, though.
I'm not sure what Go does that is a lot better on that (I don't use Go), and I would like to know.
ReplyDelete@ MononcQc: A yield would be a one-sided action, not cooperatively, I assume.
In Go, if zero-buffered, synchronization and communication are two sides of the same.
To do a session in Go (this would be your "synchronization" I think) the sender would pass over a session channel (bidirecional) with the original request, that would be used privately during the session. This would also simulate an input boolean expression, which Go's guards do not have (by design). See my "Priority select in Go" [1]
[1] http://www.teigfam.net/oyvind/home/technology/047-priority-select-in-go/
/aclassifier
Great write up. I wonder why +sbt is not on by default?
ReplyDeleteActually, Erlang some flow control built-in, but it is usually not enough. I found out when trying to run a naive Sieve of Eratosthenes on Erjang.
ReplyDeleteIt works by making message send take a variable reduction hit, proportional to the length of the receiver's queue length. For distributed messages the hit is proportional to the length of the wire encoding of the message. This will slow down the sending end a little, but for high load situations you'll have to do your own flow control / back pressure mechanism.
Hey krab,
ReplyDeleteI think I mentioned that in the post, but for some reason it was not clear enough. Thanks for adding that blurb :)
@MononcQc Go's channels have 2 modes: buffered, and unbuffered. Unbuffered channel sends can't proceed until there's a corresponding channel receive, and vice versa. Whichever goroutine reaches the point of rendezvous first will block. As such that gives you the ability to reason easily about "before" and "after" across goroutines. (Unbuffered channels are more like Erlang's mailboxes, fire-and-forget and pick up later, except that they have maximum lengths defined in at creation time, and when they fill, they block like an unbuffered channel.
ReplyDeleteOops, typo, I mean "Buffered channels are more like Erlang's mailboxes".
ReplyDeleteGreat article - thanks!
ReplyDelete