JLOUIS Ramblings: 07/01/2011

This post is all about parallel computation from a very high level view. I claim Erlang is not a parallel language in particular. It is not created with the primary goal of speeding up computation and harnessing multiple cores, your GPU and so on. If that is your goal, there are other languages which will probably fare much better for your needs (Haskell, for instance).

Note however, while Erlang is not a parallel language, its runtime is rather excellent at forcing out parallelism of existing concurrent programs. So when we say Erlang is parallel, we say that Erlang is parallel in a specific way! The recent years have seen much work in Erlang/OTP on making the runtime concurrently parallel and we are reaping the benefits. The reason can be found in the simple observation that a Erlang program has thousands of processes which gives thousands of executable threads of control. Since you have more than one thread of control, and communication between them is largely asynchronous, you have all the opportunity for a parallel speedup.

Parallelism vs Concurrency
A fine point indeed, is that parallel computation and concurrency are different entities when considering computation. Simon Marlow has a blog post in which he describes the details.

The gist of it, however, is rather simple: computing parallel is to hunt for a program speedup when adding more execution cores to the system running the program. This is a property of the machine. Computing concurrent is to write a program with multiple threads of control such that it will non-deterministically execute each thread. It is a property of the language.

If you come with a view from the hardware and upwards toward to computing logic, then surely you may fall into the trap of equating the two terms. The reason is that to implement parallel computation in current hardware and software, you use a concurrent semantics. The pthread library for instance is writing a concurrent program which then takes advantage of multiple processors if they are available. To
program for the shared memory architecture, you use mutex'es and locks. Most other methods build on top of these primitives. Deep down at the hardware level, it is the presence of a Compare-and-swap
operation that is at the heart of the idea.

One may then surmise: are we always going to use these primitives deep down when we implement parallel programs. Are we always going to go parallel from the concurrency primitives? If your opinion is largely yes, then you may hold the view from below, from hardware and up.

The view up from the logic and down, is that the language has certain constructs available to it, of which some are parallel. How we are going to implement those features is postponed until later. That is, we squint our eyes in a way such that we cannot see the deeper down details - but can concentrate on the general ideas. One could imagine different hardware on which another implementation would be more beneficial, for instance by running on special purpose hardware for the task, an FPGA or a GPU.

The main point is that parallelism is a declarative notion in the logic. If you want an array processed in parallel, you are not going to explain how the underlying implementation should squeeze in more computational work on multiple cores. You are just stating that the semantics of your program is such that it is allowed to do so. Most importantly, in parallel array processing you must be very specific about the order in which elements are being processed. For some computations it doesn't matter. For other computations, there is a notion of order and you can't screw it up. The difference in semantics here w.r.t. order is among what concerns a language designer with the view from logic.

So there are essentially two views: from hardware or from logic. From the perspective of logic, if you have toyed with formal semantics, is that we define concurrency-primitives in a different way than parallelism-primitives formally. That is, they are different entities. The perspective of hardware on the other hand, conflates the ideas because the only way to achieve parallelism is to use
concurrency - especially in UNIX.

Erlang is not a parallel programming language!
Erlang is not a concurrent functional programming language. It is not concurrent in the sense that Erlang was built for robustness and fault tolerance. To implement robustness and fault tolerance, it was decided that concurrency was a good vehicle. Likewise for the decision to make the language functional. These choices were merely done in other to achieve the greater goal and had to be chosen along the way.

Erlang is not a parallel language either! The goal is to provide robustness and fault tolerance. Going parallel does not directly facilitate this goal, and hence from a perspective of language logic there are very few primitives that has to do with the concept of parallelism. Contrast with concurrency of which there are several central primitives built into the language.

From the view of the hardware and the actual implementation of Erlang/OTP, much attention has been paid the recent years to make the Erlang VM be able to execute processes in parallel. This has mainly been done because you can then increase hardware utilization and hope to gain a speedup in your programs. In other words, it is not a necessity, but a nice-to-have feature.

The main point to take away from this observation is that in Erlang, parallel computation is implicit. If you write your program in an idiomatic way, you will often find that there are automatic speedup gains from going to multiple cores. But you have little direct control of how things are going to be parallel. Rather, you design your program such that it can implicitly take advantage of multiple cores.

There is an inherent want for many newfangled Erlang programmers to harness the full potential of their new K-core machine (with K > 1). The idea is simply that by using Erlang, the program will run faster because you will stand a chance at getting the utilization of the cores up.

But this is and has not been the primary focus of Erlang! And hence it is not a given that your program will run faster. To make it go fast, the program has to be written in a style that makes it possible for multiple cores to work on the program at the same time, implicitly.

That said, Erlang was designed in a way that makes automatic use of multiple cores much simpler than some other languages out there. In fact, a common case is that you should be able to speed up an unaltered program in many cases just by adding more cores. Given the focus on robustness and faul tolerance this sounds decision logical. If you have a system that works and is relatively void of bugs, then you should not have to go out and change the internals of the program to make it run faster. The Danger is that your new altered program is incorrect and then it does not matter at all if it runs faster or not.

One might take the goal of Erlangs parallel endeavours as: we hope to achieve a relatively good amount of speedup when adding multiple cores without you having to change the program. That is, if you add more cores, we expect some speedup, but not necessarily a speedup which is the best possible. It is an attempt at a best-effort solution rather than a perfect one. Herein lies the gist of why Erlang is not a parallel programming language. It is not about maximizing the throughput and usage of the many cores, but to use them to automatically enhance an already written program.

Finally, if one looks at the Erlang history, parallelism only came to the language fairly late. While some believe that the language is excellent in a multi-core era because it is (relatively) easy to make the runtime parallel this was not the goal initially. By sheer luck, some of the design decisions made for Erlang makes the language brilliant for running on multiple cores.

Finally, a word has to be said on granularity of processes. What is the smallest piece of computation that can be made to work on multiple cores. In Erlang, the grain is finer than most other languages concepts of "threads". And processes are so cheap that it is easy to spawn a new one to do separate computation. But it is important to note that the grain is coarser than the Haskell concept of a "spark" (see
Runtime support for multicore Haskell (PDF) for instance). And the spark concept is spot-on for squeezing out parallelism at a much finer grain than what Erlang has. The essential trick for sparks in Haskell is that they share the Thread-State-Object (context). So you don't have to go start new threads to service sparks, making them much cheaper to execute even if there are millions of them. As an explicit method of gaining parallelism, sparks is a very interesting concept.

In conclusion: Erlang can do parallelism in the view of adding it on top of concurrency. This means that parallelism is implicit rather than explicit in the program and hence you want to program in a certain style to make sure your program goes fast. The rest of this post here are ideas for doing that.

Utilization vs Speedup
It is very easy to get full utilization of a 128 core machine: Let 1 core do the actual work, and let the remaining 127 cores run an infinite loop.

All cores will now show that they utilize 100%, yet the program is not running any faster than before. We would love to have the program run 128 times faster, now we had 128 times the computational power available. The actual *speedup* is what we are after.

As a programmer, you have two goals. The first goal is to get the utilization up. You can't hope for speedup if cores are sitting there, idling. *The utilization is an upper bound for speedup*. Next, the goal is to make the additional cores do something better than the endless loop above. That is, decrease the execution time of the program.

It is worth to look at Amdahl's law in the parallel variant. The central aspect of the law is *"serial parts in program will slow your program down"*. That part can only be run on a single core, and hence it will limit the other cores into idling if they have to wait. And when we consider a parallel part, there is a critical path, called the span of the computation (see Work & Span for instance). Making the span smaller will also yield a faster program - but when optimizing for the span, the critical path may jump to another part of the computation.

Also of importance is Gustafson's law. Amdahl assumes a fixed workload size. If we can complete that workload in half the time, we have a speedup of two. But as the number of cores increases, so does the overhead of orchestrating their communication. Gustafson (et. al) proposed that we instead make workload size the scaling variable. That is, suppose the solve a workload of size 1 in 1 minute on 1 core. Now, if we have 8 cores, perfect speedup would be that we were able to solve a workload of size 8 in 1 minute on 8 cores. Contrast with Amdahl: workload 1, 1/8 minute and 8 cores.

Erlang and speedup
While Erlang is not an explicit parallel language, it does not hurt to write your Erlang programs in a style which invites multiple cores to do work. In this section, we will try to give some general advice on how to do exactly that. To make Erlang able to execute programs in parallel, you must obviously structure your program as such.

First, your process run-queue must have enough work that it can be divided among several cores. If there is only ever a single process that can run, that will limit the parallelism of your program.

Second, be aware of serializing bottlenecks. A process executing is the granularity. So if a large number of workers all go to the same single process for their data, that single process is now the contention point which is serial and for that part of the code there will not be any speedup. If the query to the single process is dominating in your program you are in a bad shape. The extension of this is rather simple: if you have 4 workers processing a queue, then your maximal speedup is 4. This may not be enough on a 16-core machine.

Third, synchronous calls blocks more than asynchronous casts. If possible, prefer asynchronously communicating processes over synchronous ones. Long chains of synchronous calls following each other are serialization in disguise. They block out other callers when they are processing. They are also prone to creating deadlocks in programs. Asynchronous calls will in addition tend to fill up mailboxes with work. It is much more efficient to handle 8 messages in a context switch time slot than it is handling 1 message and then switch right away on an empty queue. In some cases you can go pseudo-async by making 1 out of 100 calls synchronous for instance. This is an effective way to ensure that you don't overflow a slow receiver as it forces in some amount of flow control.

In Erlang a process is very cheap. It is far cheaper than a `pthread` thread in C. This leads to a programming methodology where you are not afraid of creating processes. Much like you would be creating objects in OOP. A million processes is not a limiting factor. A good design choice in Erlang is simply to create a process per incoming request or event. Take an HTTP server for instance. Each GET-request will spawn a new process to handle that GET. Had processes been heavy in weight, this approach would not have been viable at all. The idea is that the data you are going to be operating on is morphed into a thread-of-control so the data itself can go ask other parties for information, process on itself and so on. There is a strong similarity to OOP here as each process plays the role of an living object (they are "dead" in OOP only to be temporarily revived from their zombie state when the current executioner of control comes by).

The design effectively ensures our run-queue is always filled up well with work to do. Had we opted for a queue-pipeline approach as is common in many other languages, the amount of workers working on each queue would have to be tuned to the number of cores. Furthermore, you have to watch out for overloading a specific queue. In our morph-data-into-process design it is the run-queue we have to watch
out for, but that is simpler and there are tools for doing it (see for instance Erlang Solutions esl/jobs application).

Note, however, that if these many small processes all request data from the same server-like process, then performance will dwindle and suffer. The server-like process is a serializer which we warned about earlier.

One idiom I commonly use is to have a protected ETS table with the option {read_concurrency, true} set. When all the spawned processes want to read data, they go directly into the ETS table and fetch what they need. Writes are serialized through a central process. Of course this scheme can be altered if you need a large amount of writes, or general reader-writer access.

In the newer Erlang/OTP releases (R14 and upwards, I believe) the access to the ETS table is highly parallel in the runtime. Hence, your million processes will not have to wait on each other to read data out of the table.

Another common trick is to increase the independence of computations. Whenever you can eliminate some sharing, perhaps by recalculating it or caching it somewhere, then you increase the independence of the computation. In Erlang semantics, processes have exclusive access to their own data. Hence, the independence of data is already really good. What you want to avoid is having to partner with other processes to solve a task -- if the partner is heavily contended.

The central point is to be aware where serialization might occur in your code and then eliminate it. Otherwise, you may see a bad utilization, and no speedup. There is a nice tool, Percept, which can profile the runnability of a process; when a process or port can be run on a core.

Also, you should be aware of the following interesting gotchas:

I disabled SMP and now my code runs faster!
Yes. This is a common thing. First, when you benchmark your code, a very grave but all too common mistake is to use your code optimized for parallel execution on the single core. Parallel code often has some overhead due to communication and often, it also impedes the execution speed if you run it on a single core only.

You must benchmark the fastest possible non-parallel version of your program and count the speedup against that. The other solution is simply not fair. It is one way you can make your benchmark look better by cheating.

The overhead is also present when you take your program, written in Erlang, and run it on the non-SMP variant of the Erlang VM. It can often go faster for some work loads. The reason is that your program may not exploit parallelism that well and thus the overhead of going truly parallel is dominating the actual speedup gain. I have seen this with some of my Haskell programs as well.

There is a fairly high price for suddenly having multiple cores inside the VM doing things. You will have to synchronize in the runtime and even though your synchronization primitives are heavily optimized, they have overhead. Now, if the problem you are looking at is not even saturating a single core, then surely, it would have been better to run in a single-threaded manner. It is only when you could benefit from a second core that it will help you.

If, on the other hand, your problem maxes out the core, then adding a new core to the mix has some overhead, but it will make the computation scale further and you may experience a better throughput in the end.

So if you are well below a single core in utilization, this is often a way to lower the amount of work your system has to do. And make the code run faster.

My code has faster than perfect speedup!
Another common thing that can happen is super-linear speedup. That is, you may have 8 cores and expect a speedup close to 8 -- but all of a sudden you have a speedup of 12. This happens. The reason it happens is usually that when you add cores to your computer, you also alter other resources. Most notably, you often increase the amount of available cache. If you distribute the computation over multiple physical machines, you may also increase the amount of available RAM.

These factors can improve your computational speed by quite a lot if you can manage computations such that they become independent of each other. It directly reaps the benefit of the added caches as there will be no migration of data from one cache to another. If the computation can suddenly fit into the caches, then the speed can be much increased. Likewise, the same pattern can emerge if you add more RAM to your cluster. Suddenly the central data can be kept mostly in memory and you don't have to load data off the disk anymore.

Hiding latency - the secret weapon for fast parallel computation

If there is a single trick resulting in fast parallel speedups, it has to be latency hiding. Note problems come in three varieties: The simple, the impossible and the interesting.

The simple problems are called the Embarrassingly parallel problems. They are simple to solve: split them into equal sized chunks and solve each part. Then reintegrate the individual solutions into a final result. The idea of Map-Reduce leans itself to a solution method. It is obvious if your problem is simple, then Map-Reduce solves it. The impossible problems are the class of problems for which you have little to no hope of speeding up the computation by having more cores. A compiler, for instance, can compile multiple modules at once. But the steps of lexing-parsing-code-gen-optimization-and-so-on has to happen in that order and has to happen in a serial fashion - a large span is dominating. Some numerical iterative computations require a finished computed step N before they can tackle step N+1. These problems are so data dependent that you have no hope of making those parts parallel. Essentially the span of each iteration is chained together in a way that makes it impossible to speed up.

This leaves us with the interesting class of problems. In order to compute these, you need to have some amount of sharing/dependence in between the processes. And this is where the latency hiding trick comes into play.

When one process sends data to another process, there is latency involved. We must copy data into another mailbox, or perhaps copy to another computer. The bandwidth of the network and the size of the data we have to copy puts us into a situation where there is a minimum latency we can expect when we copy. Now, the naive solution is to repeat a two step process: Compute this iteration. Then Exchange this iteration with other processes. But every time we exchange data the processes are waiting and are idle. If communication time dominates, then our speedup will suffer.

The latency hiding idea almost doesn't even warrant a name: "do something else while the data transfers". That's it. Like an operating system is doing something else while a process waits for disk IO to complete, a process can do something else while waiting for data to transfer.

To Erlang programmers this means that the order in which you send and receive matter. If you reorder the receives such that the communication overhead is hidden, then your program will be able to find the relevant message in its mailbox already there when it is needed. The rule is send early and receive late. As soon as you have data another process will need, you must make it available to them. And when you need data make sure you get it as late in the process as possible, just before you need it. That way, it may already be there in the mailbox, ready for you to use.

The conclusion is that one must take into consideration the impact of latencies. In high performance computing it is of primary importance. A worthy bet is that on modern computers, moving data around is often the bottleneck. Not the raw speed of the CPU core. If we have a distributed system, then the latencies may be much higher due to network copying - with a much higher impact on speed. The CPU cache also plays a crucial role in limiting the speed by which your computations will run.

Conclusion
We have given a brief overview of some of the parallel computation basics. We have argued that concurrency and parallelism are different beasts. We have argued that Erlang is not particularly parallel. We have surmised on what one should do to increase parallelism in Erlang programs. We have also hit on a small number of interesting gotchas when computing parallel. And we have discussed about the concepts of latency hiding.

There is much more to be said about Erlang and parallel computation. I particularly like that Erlang does not assume a cache-coherent architecture for instance. But that will be a post for another time since I am all out of blog space and this is already too large.

Google released Google+. Everyone has now blogged about whether it will "rank or tank", but nobody has really looked into the concept of circles from a more technical viewpoint to my knowledge. Let us remedy that.

Online communication services can all be classified according to different aspects. A chat IM message is, as an example, usually a one-to-one realtime communication. There is an assumption that I will answer a chat message rather quickly - even when I don't have to. IRC - Internet relay chat - is different: communication is topical as I join a channel discussing a certain topic. Communication is many-to-many and there is no assumption I answer the message right away. Likewise for emails - but there exchange of messages happen at a slower pace with larger bodies of text. The reason you use more than one service is essentially that different communication services cater to different kinds of communication. The collaborative benefits you gain when developing source code over IRC is not paralleled in any place currently for instance - not as a free open project at least.

What kind of communication services disrupts your flow of thought by alerting you and requiring you to answer (phones, I am looking at you!)? How realtime/instant is a message? Can you post additional content like photos, images, urls? Can you split communication into topics of interest? Are topics invite-only or free-to-join by default? How is moderation done on topics? Is it easy to form groups of privacy to discuss something? Is the communication one-to-one-focused, or is it one-to-many or even perhaps many-to-many? Can you build robots to automate tasks for the service? And so, in the light of Google+:

Google+ major point is to work like Facebook does in the core. This means that I am not the primary customer - as I don't use Facebook much. Hence, I am probably the wrong guy to judge if it will work out or not. By the core of Facebook, I mean that the primary goals of G+ and FB overlap very much. In the long run, the goal is to form a relation-graph between people and get access to it for Google.

The major, essential, difference between Google+ and Facebook is the formation of connections. In Facebook, A adds B and then B confirms the relationship. The opposite of this is Twitter: you just follow people you are interested in. Facebook runs a 2-way handshake whereas Twitters is one-way. Since connections in Facebook is 2-way, then we trivially have: if A is connected to B, then B is also connected to A. In other words: the connection graph is undirected in Facebook. Contrast with Twitter: graphs in twitter are directed. There are numerous one-way connections in Twitter. Most notable are celebrity-figures with an extremely high amount of followers. The communication from such an account is essentially 1-to-many and public.

So what does Google+ do? It chooses a variant of Twitter. In G+, the graph is directed like in Twitter, but you also label the edges with a name -- your circle designation for the connection. You can have multiple edges to the same person/account, so in reality the graph is a multigraph where edges are discriminated by their circle-labels. Another representation is to deem that a graph edge is labeled with a set of circle-labels.

When A adds B to the circles, A is adding an edge in the graph from A to B for each circle. So A builds a one-way path to the other end. If later B decides to add A to his circles, then there will be edges in the opposite direction. They may not be labeled the same at all. This means that A can have B in the circle of close friends while B has A in the circle ObnoxiousCatPhotoSubmitters. This is an interesting skew of connections which come into play. When you post something to a circle, the recipients can not see the message as if it is originating from that circle - in fact circles are exclusive to an account and can't be seen by others. This in effect makes a message a low-level value which is exchanged from a sharer to a list of people the sharer decides via circles. Upon receipt, the message is then cast into the light of the recipients circles.

So forming connections is like in Twitter, only that edges are labeled by Circles given by the source. This is a rather genius move. When A shares, he chooses the circles he want to share the message with. The system then searches all edges labeled with those circles, and posts on those edges only. If A wants to post Tech-related stuff, A can do so in the Tech-circle so only people that might care is hit by the sharing.

The sender, A, has complete control of who sees the message. This is what Google refers to as the ability of G+ to accommodate different kinds of privacy levels on messages. Also, it is the sender, A, that defines his or her own levels of privacy with Circles as they are local to an account. It also blurs the definition of "follower", "friend", "connection", "add" and so on. Some edges are labeled as "close friends" and are thus strong edges from a privacy perspective. Other edges are weak in the sense that you have a connection, but there are limits to what you want to share with the other person.
When data arrives at B, messages from A gets sorted according to the circles of B. So when B looks at streams from circles in which A is in, he sees the shared information from A.
Consequence: A needs to moderate what he sends along the Tech-circle to B. If A begins sending Cat photos to his tech-friends, he will end up in the ObnoxiousCatPosters circle at B - and only that circle. B is free to replace him at any time if he wants. The assumption currently is that relations between people are formed one primary ground. "He is family". "She and I are always discussing cool graphics design". This may largely be true.

One thing I have not figured out yet: Suppose I want to follow B for more than one interest. Cats and Dogs. I have a circle of Cats and one of Dogs. But when B posts in his public circle for all to see, his posts will end up at my place in both Cats and Dogs. To the best of my knowledge there is no way I can currently split this. My guess is that Google will have to solve it in the longer run. They have several options among of which is to use statistical analysis on the posts through machine learning. Or allow #tags as in Twitter.
Another thing I have not figured out completely yet: I don't think G+ can substitute for the other communication tools I am already using. Since I am not much of a Facebook user I am not the right demographic, but I don't see G+ replacing Email, IRC or Twitter. Twitter is the thing it will have the best options at replacing in my opinion. Email and IRC needs a more wave'esque approach to the problem I am afraid. I would really like to see an "IRC for the masses" and I am not convinced that Huddle is it. Further, the locality of circles to an account makes them distinctively different from IRC channels. On the other hand, if you have a topical connection graph, you can use machine learning to suggest IRC-like topics later on - and it is not that hard to add proper IRC-like group chat features to Google Chat which is already integrated.

I find the technical difference mentioned here quite interesting. It is definitely a generalization of the existing models of connection-forming, and it will be interesting to see if it takes off, and what people will get out of using it.

So in conclusion: G+ is a funny hybrid between the style of Facebook and Twitter. The circle concept is central to how the connection-graph is formed. And thus, central to how G+ works. It will be interesting to see if Facebook will have a counter-solution. Currently, FB can create groups of friends and handle these in many ways like G+ can handle circles. We can only hope that this competition from Google will force Facebook to improve their system as well. But they probably can't get rid of the two-way connections easily.

JLOUIS Ramblings

Computer Science, Mathematics, Society, Relationships, Sex, 0xf00d, Pets, Leashes and Secretaries.

Erlangs parallelism is not parallelism!

View comments

Google+ - so what is different, technically?

View comments