1. Programs of a certain size are complex. As long as the program is written by a single programmer and is fairly small, say under 1000 lines of code, then everything is easy. The programmer can keep the whole program in the head and it is easy to do stuff with that program. If on the other hand the program grows in size or we add more programmers, then we can't rely on the singular knowledge of a programmer.

    The only way to solve this problem is to build in abstractions in your programs. We will review two such methods in Erlang. The idea of abstraction, informally, is that we will hide certain details and only provide a clean interface through which to manipulate stuff. Erlang is a "Mutually Consenting Adult Language" (read: dynamically typed with full term introspection - or more violently - unityped crap with everything in one big union type). So this abstraction is not possible in reality. On the other hand, the dialyzer can provide us with much of the necessary tooling for abstraction.

    As an example of so-called modular abstraction, let us consider a small toy module:
     -module(myq).
    
     -export([push/2,
              empty/0,
              pop/1]).
     -type t() :: {fifo, [integer()], [integer()]}.
     -export_type([t/0]).
    
     -spec empty() -> t().
     -spec push(integer(), t()) -> t().
     -spec pop(t())     -> 'empty' | {'value', integer(), t()}.
    These are the definitions and specs of the module we are implementing. We are writing a simple queue module for a FIFO queue, based on two lists that are kept back-to-back. I am using a Standard ML / Ocaml trick here by calling the canonical type it operates on for 't'. The operations push/2 and pop/1 are used to push and pop elements to and from the queue respectively. Note we are prefixing queues by the atom 'fifo' to discriminate them from other tuples. The implementation of the queue is equally simple:
    
    
     empty() -> {fifo, [], []}.
    
     push(E, {fifo, Front, Back}) -> {fifo, Front, [E | Back]}.
    
     pop({fifo, [E | N], Back}) -> {value, E, {fifo, N, Back}};
     pop({fifo, [], []})        -> empty;
     pop({fifo, [], Back})      -> pop({fifo, lists:reverse(Back), []}).
    

    We always push to the back list and always pop from the front list. If the front list ever becomes empty, we reverse the back list to the front. Not used persistently, this queue has amortized O(1) run-time and is as such pretty fast.

    The neat thing is that all operations are local to the myq module when you want to operate on queues. This abstracts away details about queues when you are using them via this module. There can much code inside such a module which is never exposed to the outside and thus we have an easier time managing the program.

    There is a problem with this though, which is that the implementation of the queue is transparent. A user of the myq module can, when handed a queue, Q, of type myq:t() we can discriminate on it like this user:
     -module(u).
    
     -compile(export_all).
    
     -spec f(myq:t()) -> myq:t().
     f(Q) ->
         case Q of
             {fifo, [], []} ->
                 myq:push(7, Q);
             _Otherwise ->
                 Q
         end.
    Note how we match on the queue and manipulate it. This is bad practice! If the myq module defined the representation of the queue it ought to be the only module that manipulate the internal representation of a queue. Otherwise we might lose the modularity since the representation has bled all over the place. Now, since Erlang is for mutually consenting adults, you need to make sure this data structural representation leak doesn't happen yourself. It is especially important with records. If you want modular code, avoid putting records in included header files if possible unless you are dead sure the representation won't change all of a sudden. Otherwise the record will bleed all over your code and make it harder to change stuff later on. Also changes are not module-local but in several modules. This hurts the reusability of code.

    However, the dialyzer has a neat trick! If we instead of
     -type t() :: {fifo, [integer()], [integer()]}.
    had defined the type as opaque
     -opaque t() :: {fifo, [integer()], [integer()]}.
    Then the dialyzer will report the following when run on the code:
     u.erl:9: The call myq:push(7,Q::{'fifo',[],[]}) does not have an
          opaque term of type myq:t() as 2nd argument
    which is a warning that we are breaking the opaqueness abstraction of the myq:t() type.

    The Other kind of abstraction in Erlang
    Languages like Haskell or ML has these kind of tricks up their sleeve in the type system. You can enforce a type to be opaque and get type errors if a user tries to dig into the structure of the representation. Since the dialyzer came later in Erlang one might wonder why one could write programs larger than a million lines of code in Erlang and get away with it when there was no enforcement of opaqueness. The answer is subtle and peculiar. Part of the answer is naturally the functional heritage of Erlang. Functional languages tend to have excellent reusability properties because the task of handling state is diminished. Also, functional code tend to be easier to maintain since it is much more data-flow oriented than control-flow oriented. But Erlang has another kind of abstraction which is pretty unique to it, namely that of a process:

    If I create a process, then its internal state is not observable from the outside. The only thing I can do is to communicate with the process by protocol: I can send it a message and I can await messages from it. This makes the process abstract when viewed from the outside. The internal representation is not visible and you could completely substitute the proces for another one without the caller knowing. In Erlang this principle of process isolation is key to the abstractional facilities.

    What does this mean really?
    Erlang has not one, but two kinds of ways to handle large applications: You can use modules, exports of types and opaqueness constraints to hide representations. While you can break the abstraction, the dialyzer will warn you when you are doing so. This is a compile-time and program-code abstractional facility. Orthogonally to this, a process is a runtime isolation abstraction. It enforces a given protocol at run time which you must abide. It can hide the internal representation of a process. It provides an abstractional facility as well. It is also the base of fault tolerance. If a process dies, only its internal state can be directly affected. Other processes not logically bound to it can still run. It is my hunch that these two tools together is invaluable when it comes to building large Erlang programs, several hundred thousand lines of code - and get away with it!

    So in conclusion: To create modular code-level functional abstractions, rely on the dialyzer to create them for you like in the queue example from above. To create a modular runtime, split your program into processes, where each process handles a concurrent task.

    5

    View comments

  2. Introduction

     

    The unit test is the lowest common denominator. We can use them, but they are cumbersome and they only peek once into the fabric of our work. A better solution for many tasks, namely Property Based Testing exists for Erlang. This post is an example of how to use the statem type of test, mainly because there are so few of these out there. The outset is this: We will randomly generate operations on a priority queue and then validate that these operations are correct according to a simpler model.

     

    The setup was that Michael Truog had a nice library of (min) priority queues which he had written (see okeuday/pqueue) He was mostly interested in how they benchmarked, but before benchmarking, you need to have some kind of knowledge that your PrioQ is correct indeed. Otherwise, you may just return the value 42 on each output and claim that this is the "right" thing to do. So the system under test (the SUT) is the various priority queue implementations by Michael. In order to test them, we need a model of a priority queue though.

     

    The model of a SUT is some code we write which can reflect the observations of the SUT. In the case of a priority queue, there is a very simple model we can choose. We can simply base it upon a naive priority queue. The idea is then, that our PrioQ is easy to write but it may not be particularly fast. If an error shows up, it means that the SUT and our model will disagree (hopefully) and we can figure out if the problem is with our Model or the SUT.

     

    We are going to use PropEr. The choice could equally well have been Erlang QuickCheck though, with minor changes. Both tools reflect the same interface for this kind of work, even though their real working might be slightly different in practice.

     

    Note: All code is in Michaels repository linked above. If you want to digest the code yourself, go read it there - after having read this blog post. I've pulled out the important parts, but eh real repo has a bit more code in it.

    Building the model

     

    The model in this case is fairly simple. When constructing the model you need to have an idea of what you are going to test for. The goal is to get the model as small as possible while still being able to handle everything you may need. On the other end of the spectrum, you build up a model that is a complete specification - which can in some cases actually mean your model will be larger than the SUT. But in this case the model is way smaller and simpler than the queue it tests.

     

    For this particular example though, we are very lucky. Essentially our example is a complex implementation (implementations) of a simple idea. So our model goes for reflecting the simple idea instead.

     

    Let us write a priority queue to serve as the model. Our representation is [{priority(), [element()]}] that is a list of tuple-pairs where the firstelement is a priority and the second element is the list of values we are storing under that priority. We require that the list is sorted by the priorities. This makes lookup a bit faster.

     

    Insertion is awfully simple. Just skip until we hit the right spot and inject the value:

    %% A listq is a sorted list of priorities
    listq_insert({P, V}, []) -> [{P, [V]}];
    listq_insert({P, V}, [{P1, _} | _] = LQ) when P < P1 ->
        [{P, [V]} | LQ];
    listq_insert({P, V}, [{P1, Vs} | Next]) when P == P1 ->
        [{P, Vs ++ [V]} | Next];
    listq_insert({P, V}, [{P1, Vs} | Next]) when P > P1 ->
         [{P1, Vs} | listq_insert({P, V}, Next)].
    
    

    Note that if we get a hit on the priority, we just add the next value in the back. This is slow, but we don't care since we only need the model to carry out tests. Now, converting a prioq to a list is easy. So is taking the length of a priority queue in the model:

    listq_to_list(L) ->
        lists:concat(
          [ Vals || {_Prio, Vals} <- L]).
    
    listq_length(L) ->
        lists:sum(
          [ length(Vs) || {_Prio, Vs} <- L]).
    
    

    There! Michael have two ways to remove elements from a priority queue. The first is the common solution where you remove the next element. Note that if the queue is empty then the result is an empty queue. Also note, we don't return the removed element. This semantics seem a little odd to me, but on the other hand, it is better to make the model reflect the expected behavior here:

    listq_rem([]) -> [];
    listq_rem([{_P, [_V]} | Next]) -> Next;
    listq_rem([{P, [_V1 | Vs]} | Next]) -> [{P, Vs} | Next].
    
    

    Michael also have a variant where we remove an element of a given priority. We can simply dig through the list until we find the desired priority and then remove an element from that one. If there are no more elements left of that priority, we kill the pair:

    listq_rem([], _P) -> [];
    listq_rem([{P, [_]} | Next], P) -> Next;
    listq_rem([{P, [_ | Vs]} | Next], P) -> [{P, Vs} | Next];
    listq_rem([{P1, Vs} | Next], P) -> [{P1, Vs} | listq_rem(Next, P)].
    
    

    Finally, we need to be able to peek into the queue. There are three variants. We can peek for the minimal element. We can peek for the element with a given priority, or we can peek for an element but also obtain its priority in question:

    listq_peek([]) -> empty;
    listq_peek([{_P, [V | _]} | _]) -> {value, V}.
    
    listq_prio_peek([{P, [V | _]} | _], P) -> {value, V};
    listq_prio_peek([{_P1, _} | Next], P) -> listq_prio_peek(Next, P);
    listq_prio_peek([], _P) -> empty.
    
    listq_ppeek([]) -> empty;
    listq_ppeek([{P, [V | _]} | _]) -> {value, V, P}.
    
    

    Making a gen_server to drive the SUT.

     

    Strictly speaking, this example does not need a properstatem test. You could just build up the priority queue by using valid operations and then make tests. But to illustrate the use of statem we have built a simple driver for a priority queue. This code is in okeuday/pqueue/src/queuesrv.erl and it is not that interesting. We just have a separate process which keeps track of the internal state of the priority queue for us. This means, that we can only observe the queue_srv by means of what we prod it to do and what it answers with.

     

    You can have a look at the code, but it is not that interesting. It merely reflects what operations are allowed on a priority queue.

    Introducing the StateM behaviour

     

    We need a bit of work in order to support the proper_statem behaviour:

    -module(pqueue_proper).
    -include_lib("proper/include/proper.hrl").
    -behaviour(proper_statem).
    
    -export([command/1, initial_state/0, next_state/3, postcondition/3,
        precondition/2]).
    
    

    The next part is somewhat wrong, from a type perspective. I should probably change it. The state is tracked far too broadly here, but the code will still function properly.

    -type value() :: integer().
    -record(state, { in_queue :: [{value(), term()}] }).
    -define(SERVER, queue_srv).
    
    

    Now we are at the point where we need to handle the given callbacks from proper_statem. The first one we will be attacking is the command/1 which states what commands are eligible for firing. But before doing that, we need some helpers:

    priority() -> integer(-20, 20).
    value() -> integer().
    
    

    This is a generator of priorities and one of values. Michael decided that priorities should be in the range -20 to 20, so we reflect that in our generator. The values are always integers, but they have no bound on the other hand.

    priority(InQ) -> elements([P || {P, _} <- InQ]).
    
    

    Given a queue, we extract all the priorities from that queue. And then we use elements to select one at random. This is a generator of existing priorites that are already in the queue. Then there is the initial state of the system, which is the empty priority queue:

    initial_state() -> #state { in_queue = [] }.
    
    

    Commands

     

    Now we are ready to define the command function. It is keyed by the current state, so we can take that into consideration as well. It may be that the current state limits what commands are eligible for firing.

    command(#state { in_queue = InQ }) ->
        oneof([{call, ?SERVER, in, [value()]},
               {call, ?SERVER, in, [value(), priority()]},
               {call, ?SERVER, is_empty, []},
               {call, ?SERVER, is_queue, []},
               {call, ?SERVER, len, []},
               {call, ?SERVER, out, []}] ++
              [{call, ?SERVER, out, [priority(InQ)]} || InQ =/= []] ++
              [{call, ?SERVER, pout, []},
              {call, ?SERVER, to_list, []}]).
    
    

    We generate calls to ?SERVER which is our queuesrv genserver construction. We use the already given generators to generate random values and priorities where applicable. Note the trick [priority(InQ) || InQ =/= []] which is a degenerate list comprehension. The list is [] if InQ is empty. Otherwise it uses the priority/1 function from above to pick a random priority to remove among those already in the queue.

    Updating our model

     

    Now, assume we have run an operation on the SUT with the command/1 callback. Then our model needs to be updated with the same value as well. Otherwise we could not check for correctness. The function next_state(State, Ret, Call) reflects this change. The three parameters taken are the current State of the model, the return value of the operation we invoked, and Call describes the command we executed.

     

    Note: There is one very important thing with the next_state/3 function. It is used twice internally in PropEr/QuickCheck. First it is used in an abstract mode where all values are symbolic in nature. That is you can not rely on a return value being a "real" value. Rather you must only transfer values around without discriminating on them. Secondly, the function is used concretely when you are running the test cases. The notion used is symbolic and dynamic respectively. Just keep this in mind when you write your own tests. Since you may be executing this with symbolic values, it is limited how you can discriminate values as they may be symbolic in nature.

     

    Here is the first clause:

    next_state(#state { in_queue = InQ } = S, _V, {call, _, out, []}) ->
        S#state { in_queue = listq_rem(InQ) };
    
    

    This means we had a call to out/0 on the queue and we should remove an element from the queue. We can then call our own listq_rem/1 function to track this operation in our own model state. Note also that some of the clauses are not going to update the state since the calls generated does not alter the state. Also note that we sometimes look inside the call to figure out what values were generated as input so we can transition our model state accordingly:

    next_state(#state { in_queue = InQ } = S, _V, {call, _, out, [Prio]}) ->
        S#state { in_queue = listq_rem(InQ, Prio) };
    next_state(#state { in_queue = InQ } = S, _V, {call, _, pout, _}) ->
        S#state { in_queue = listq_rem(InQ) };
    next_state(S, _V, {call, _, to_list, _}) -> S;
    next_state(S, _V, {call, _, is_queue, _}) -> S;
    next_state(S, _V, {call, _, is_empty, _}) -> S;
    next_state(S, _V, {call, _, len, _}) -> S;
    next_state(#state { in_queue = InQ } = S, _V,
                   {call, _, in, [Value, Prio]}) ->
        S#state { in_queue = listq_insert({Prio, Value}, InQ) };
    next_state(#state { in_queue = InQ } = S, _V,
                  {call, _, in, [Value]}) ->
        S#state { in_queue = listq_insert({0, Value}, InQ) }.
    
    

    Preconditions

     

    This example does not use preconditions:

    precondition(_S, _Call) -> true

    Normally you use preconditions to limit what calls can be done in what state. That is, if the precondition fails, then the given transition is not allowed. One can use this as a constraining measure in some tests. But for this example, everything can always fire, so there is no reason to limit the calls in any way.

    Postconditions

     

    The postcondition is where we check that the model and SUT agrees on the observations. This is not executed symbolically, but is entirely dynamic execution. Here is the first clause:

    postcondition(#state { in_queue = InQ }, {call, _, out, [Prio]}, R) ->
        R == listq_prio_peek(InQ, Prio);
    
    

    it states: If we have a current state, InQ and a call to out with a specific priority, Prio and the SUT returned the value R. Then peeking in InQ for the first element at that given priority should be the same element. After this test, the state transition with next_state/3 will happen. Equally:

    postcondition(#state { in_queue = InQ }, {call, _, pout, _}, R) ->
        R == listq_ppeek(InQ);
    postcondition(#state { in_queue = InQ }, {call, _, out, _}, R) ->
        R == listq_peek(InQ);
    
    

    handles the remaining checks for removing elements from the priority queue. Converting to a list or taking the length can also be done by calling our model variants and looking for equality:

    postcondition(S, {call, _, to_list, _}, R) ->
        R == listq_to_list(S#state.in_queue);
    postcondition(S, {call, _, len, _}, L) ->
        L == listq_length(S#state.in_queue);
    
    

    Finally, one can ask if we have a queue, which should always be true - and we could ask if the queue is currently empty, which we can determine by discriminating on our model state. In this case they should also agree. Inserting elements into a queue always succeeds in addition:

    postcondition(_S, {call, _, is_queue, _}, true) -> true;
    postcondition(S, {call, _, is_empty, _}, Res) ->
        Res == (S#state.in_queue == []);
    postcondition(_S, {call, _, in, _}, _) ->
        true;
    
    

    All other outcomes are errors:

    postcondition(_, _, _) ->
        false.
    
    

    Running the test

     

    Now, running this property test is the following property:

    correct(M) ->
       ?FORALL(Cmds, commands(?MODULE),
          ?TRAPEXIT(
            begin
               ?SERVER:start_link(M),
                   {History,State,Result} = run_commands(?MODULE, Cmds),
                   ?SERVER:stop(),
                   ?WHENFAIL(
                     io:format("History: ~w\nState: ~w\nResult: ~w\n",
                                   [History,State,Result]),
                                   aggregate(
                                    command_names(Cmds),
                                    Result =:= ok))
                    end)).
    
    

    Where M describes the module we wish to test. Michael wrote several and we can test them all with the same code! Then we start a queue_srv with M as the module and then we run a series of commands. The server is then stopped. Then, upon failure, we output necessary stuff to figure out what went wrong. Also, we aggregate how often the commands are hti so we are sure we have a decent coverage.

     

    The test uncovered some errors straight away. Here is one such which was much much bigger but got minimized down by PropEr:

    [{set,{var,1},{call,queue_srv,in,[-18]}},
     {set,{var,5},{call,queue_srv,in,[9]}},
     {set,{var,6},{call,queue_srv,in,[-10,-4]}},
     {set,{var,18},{call,queue_srv,in,[-29]}},
     {set,{var,22},{call,queue_srv,in,[11]}},
     {set,{var,26},{call,queue_srv,len,[]}}]
    
    

    So, if you insert [-18, 9, -4 at priority -10, -29, 11] and then ask for the length of the priority queue, something failed (the pqueue code crashed).

     

    The point here is that normal unit tests won't usually come up with examples this complex. But by starting out asking: "What can be done here?" by means of generators and then carrying out random operations uncovers nasty errors quickly. Michaels code now routinely survives 1500 PropEr test runs on all his priority queue modules. Given that it took 1-2 hours writing the test, it seems like it was worth it.

    2

    View comments

  3. A common problem in concurrent systems is the following: events fire from all places all the time and you have no direct control over when and why events fire. Thus, to figure out problems one can often make use of a sequence diagram like this one (thanks Wikipedia):

    The purpose of such a diagram is to show the interactions between different concurrent parties. In this case, Fred, Bob, Hank and Renee - in a restaurant. One can easily draw such diagrams on paper. But the problem with such a drawing is that it is disconnected from the machine. What the drawing shows may or may not be what the program actually does.

    Wouldn't it be neat if you could make such diagrams by tracing what the code does and then build up the diagram in a programmatic way? Well, with Erlang, you can. I'll use some code I wrote as an example for how to do it.

    Step 1: Build a tracer function
    The first step is to construct a tracer function. Mine look like the following and resides in a module called "utp".

    report_event(DetailLevel, FromTo, Label, Contents) ->
        %% N.B External call
        ?MODULE:report_event(DetailLevel, FromTo, FromTo, Label, Contents).
    
    report_event(_DetailLevel, _From, _To, _Label, _Contents) ->
        hopefully_traced.
    
    
    
    The basic idea is found in the latter of these two definitions. It defines a dummy-function taking 5 arguments which are thrown away promptly. The return value "hopefully_traced" is arbitrary, but it signifies what we want out of this function. The first variant, is used when it is the same party that does both things.

    The arguments are as follows:

    • DetailLevel: A number between 0 and 100. It signifies the level of detail for this event. We can use it to give major events low level and more detailed event at a finer grained scale higher levels. The event tracer can cut off events based on this value so we get the desired graininess.
    • From: The party from which the event originates.
    • To: The party to which the event is sent.
    • Label: The notational label to put on the message.
    • Contents: If you click on the event, we will show this term to the user. It can be used to pack up more detailed information to present, while avoiding cluttering up the diagram.
    Step 2: Trace the events in the program
    Whenever something important in the program happens, you add a call to "utp:report_event/5" or whatever the designation for your tracing function is. This will be default do nothing, but it allows us to hook that function later with Erlangs tracing facilities. As an example, here is the interaction from above:
    trace_test() ->
        Events = [{fred, bob, order_food},
                  {bob, hank, order_food},
                  {bob, fred, serve_wine},
                  {hank, bob, pickup},
                  {bob, fred, serve_feed},
                  {fred, renee, pay}],
        [utp:report_event(50, F, T, L, [])
         || {F,T,L} <- Events].
    
    
    
    But note that anything can be reported really, by spreading the event reporting function out over your code base. Rather than run scores of debug statements, you can add a reporter at the point in the code. Also note I have condensed it a bit in the above example with a list comprehension because I don't care for the detail level and the contents.

    Step 3: Introduce code to invoke the et application
    Next, we need to invoke Erlangs et application in the right way. I have a module, called "utp_filter" which does it, even though it still doesn't contain any filter function. Here is the code of interest:
    start(ExtraOptions) ->
        Options =
            [{event_order, event_ts},
             {scale, 2},
             {max_actors, 10},
             {detail_level, 90},
             {actors, [fred, bob, hank, renee]},
             {trace_pattern, {utp, max}},
             {trace_global, true},
             {title, "uTP tracer"} | ExtraOptions],
        et_viewer:start(Options).
    
    
    
    What this code does is to initalize the et_viewer in a way such that it can be used with our example. The event order is event_ts which means we are timestamping events at when they fire and not when they are received. The actors define the order in which the actors appear. The trace_pattern is pretty important. It is what makes us hook to the "utp:report_event/5" call. You can give a lot of options to this or event produce such patterns yourself. But just name the module in which your trace function is placed will work.

    Step 4: Test
    If we invoke "utp_filter:start([])" which starts up the et viewer by executing the code in step 3, we can then invoke the trace test code from step 2 to obtain:

    Which should reflect the example from Wikipedia, but now obtained by executing a program in Erlang. This idea has been used by me to capture and find bugs in TCP-stack variants, in particular uTP:



    (Even if these diagrams are small, the basic idea should be in there - it is not about the diagrams, but about the steps to obtain the glory yourself :). For a TCP stack-like protocol it is very powerful: You have a tcpdump(1) output, an strace(1) output, as well as the internal protocol state in the same diagram. I found quite some bugs with these tools in my code, simply by inspecting the interaction of the programs.

    And thats it. With the above code in your application, you can enable a graphical trace viewer on your code at any time. If you don't enable it, the overhead of the trace functions are negligible - unless you have a very hot critical path indeed. If you do enable it, you can see exactly what your code does in the given situation.
    7

    View comments


  4. Now the Dart language is out at http://dartlang.org, we can begin looking at the language and see what we think of it. Everything here are musings from reading the spec of the Dart language. Note that my points here are quite subjective at times. Don't expect this to be an objective run down of the language. Rather, I wanted to write down what I thought about the language just by looking at the specification alone.

    Starting out, we generally won't focus on the syntax of the language. The syntax is boring, tedious and just there. It is Java/Javascript-like, mostly to interest the right kinds of people on the language. We note that the syntax is considerably shorter and leaner than the equivalent Java code and that there are notational
    stuff going on to make certain things easier to write down in the programs.

    The major goal of Dart is to clean up Javascript and make design choices in the language which does not hamper raw execution speed. Javascript is not built for compilation into fast programs. There are simply too many nasty warts and design choices in the language which makes it very hard to compile. V8 is astoundingly good at it, but by altering the source language a little bit, you can squeeze out much more power with very little effort.

    I think that this is a goal of Dart which is central. Don't design yourself into a corner by language features.

    Dart is Single Threaded but support concurrency through an Actor-like concept of isolates. The choice of making the runtime single threaded is understandable. It is vastly simpler to write a compiler for a single threaded runtime than it is for a multi-threaded one. Also, the focus is perhaps on writing programs for web browsers and the amount of extra processing power to be had there from multiple processors is perhaps limited. Note that while the language specification says it is single threaded, the Isolate concept in the library talks about heavy isolates which are separate threads. One could imagine that the execution inside an isolated region is always single-threaded (Like in, Erlang for instance), but the system as a whole might have several isolates working simultaneously.

    On the other hand, even your modern mobile device goes multi-core. Thus, the choice is as much as a gamble as opting for a multi-core solution from day one. This choice is the conservative one.

    Dart makes it explicit that it has several kinds of errors and that these may come "lazily" on demand. The idea is to facilitate JIT-compilation, so an implementation is not required to report all errors up front. An understandable and interesting choice. For a programmer with the proactive view (Haskell, Java, Ocaml) on error  reporting, this sounds rather bad, but for a programmer with an reactive view (Erlang), it sounds quite familiar.

    Naturally, the language makes the almost ubiquitous billion dollar mistake of C.A.R Hoare: the language includes null as a default. Almost all non-functional languages do.

    An interesting choice is to enforce simplicity on static/final values. This is to ensure fast startup times, so a program can't hide a large computation in a global variable at startup.

    Amusingly, the language must make sure that every function returns 'null' in the case where the function has no return statement. This is funny because in functional languages, this kind of problem doesn't exist. a 'void' function can't be used in expression context in an imperative language, but since Dart type system is not enforcing, there are no guarantees. Hence a kludge is introduced.

    Functions are first class values with a simple notation for the equivalent of a lambda construction.

    The language has operator overloading, but as is common in imperative languages, you cannot define your own operator. Rather, you must use the operators that come with the language.

    Stealing ideas from other languages, there are convenience notation for getters and setters. On the other hand, there is no mention of lenses. The language is pretty much a standard OO-language with no mention of prototypical OO as in Javascript. This choice is probably smart since prototypical OO is irritating to compile.

    There is a built-in notion of factory construction, much like the notation of getters and setters. The idea is to allow a constructor to act as a factory under the hood so the callers don't need to get exposed to the fact that there is a factory. One could imagine this concept extended to other classic design patterns over time.

    The language is single-inheritance+interface+generics. Unlike Go, this language does support generics with upper bounds on the class hierarchy. In other words, the language has bounded quantification. And it even support F-bounded quantification on generics (also called recursively bounded quantification - you allow the class parameter T to occur in the bound as well: "T extends I(T)", see wikipedia on bounded quantification).

    The language is pure. Everything is an object. You see that choice since null is the sole instance of class Null for example.

    Dart has true integers of unbounded size. There is no cramming everything into double values. There are no pesky size bounds to battle. In other words, Dart picked the innovative choice.

    I wonder... if a bool is an object, can I have a null bool?

    Strings are unicode. We got lists and maps in the language. We have a functional expression notation for fast notation of small simple function evaluation thingies. It is perhaps the easiest way to get some functional ideas into a language without making it not seem like Java.

    Note that a simple expression like '2 + 1' is internally compiled into a method invocation of the 2 objects + method with a parameter of 1.

    In a very controversial move the type system is unsound. Having an unsound type system means that the type system may allow for certain bugs to happen in production code. Time will show if this is a problem in practice. It may be you almost never hit the problem in practice and then this choice might end up being pretty sane. But if you can hit the unsoundness easily, this choice will end up being pretty bad.

    As in Java, constructing new types is had by making classes. To construct compound types, you need to use generics and parametrize the types as holes in the generic.



    The most interesting part of the language, which is pretty conservative, is the choice of baking in both dynamic and static types into the same language. While the static type system is unsound, time will show if this unsoundness is troubling. C is also pretty bad type-wise, but that has not withheld people from writing software in it and using the type system to weed out bugs.

    The other interesting part is hidden in the library as a library extension and are called Isolates. An Isolate is, technically, an actor which processes by itself. There is a concept of message passing ports between actors so one can send and receive information. Naturally, futures also form part of this definition.



    My primary critique of the language is that Dart provides nothing new to the table. It looks like a Java-for-the-web clone without providing something new. Granted, it may sway some Java programmers to move toward the Web in the long run. Granted, this is in the interest of Google because the reliance on Java is dangerous when Oracle controls it. As a language, Dart is considerably less innovative than Google Go. Rather it buys into existing ideas, and adds a gradual type system as the most innovative feature. I am not too worried about the unsoundness currently, but it is an inherent danger lurking until the static checker will warn about it.

    The secondary critique is the Isolate construction. First, there is a difference between a heavy and a light isolate. One provides a new thread of execution while the other provides an light-weight process in the current thread. There is no concept of migration so the programmer has to know beforehand how the program will run and execute to use isolates. It does provide pure isolation though, something not present in Go.

    There are problems with this though. Can I create 100000 heavy isolates? Can I create 100000 light isolates? Suppose one of the light isolates ties up the CPU for a long time. Will isolates then be preemptively scheduled out so others gets a chance, or will the system keep running the single isolate while all others wait in queue? I am asking because it hurts latency.

    Second, the notation is not built-in but provided as a library. This usually leads to code with heavier weight.

    Third, there is no way to scrutinize the mailbox. You get a callback whenever a message is delivered, but it is up to you to build the search infrastructure for the mailbox. This is in strong contrast with Erlang and it usually leads to more complex actors in the long run.

    Fourth, there is no concept of robustness. I can't link together Isolates so I can be told when other Isolates die and take care of it. Without links and monitors, it is cumbersome to build anything resembling OTP for Dart.

    Fifth, where is my selective receive? We can easily imagine a single isolate having access to multiple ReceivePorts.

    In conclusion, the current Isolate-implementation leaves much to be desired. I'd much rather want to hack Google Go in the browser than Dart to be honest. Go is a neater language, is more innovative, is farther ahead and has built-in-concurrency primitives. You just need to add a jQuery-like interface to Go and then it can substitute Dart as an imperative language.

    What Dart DOES provide however is the ability to run faster than Javascript. It is clear that some of the choices enables the language to be compiled to much faster executing code. Also, if all you know is Javascript, the language is probably pretty neat. To the Java programmer, proper closures will be a warm welcome.

    But for a guy who has seen things you wouldn't believe. Haskell ships off the shoulder of MultiCore CPUs. Watched OCaml glitter in the dark near TAPL. All those moments will be lost in time, like tears in rain. Time to die.

    Finally, a disclaimer: I may be wrong on several occasions. I've had some 1-2 hours to study the language, and you can't really grasp a programming language in less than 1-2 months. So I may be wrong. Wrong on all points. Like fine wine, languages need a bit of time to mature and bloom into what they are. Dart is pretty young and drafty in its nature. A lot of things can happen.
    10

    View comments

  5. When a dog owner wants to train his dog, the procedure is well-known and quite simple. The owner runs two loops: one of positive feedback and one of negative ditto. Whenever the dog does something right, the positive feedback loop is invoked and the dog is treated with a snack. Whenever the dog does something wrong, the dog is scolded and the negative feedback loop is used.

    The result is positive and negative reinforcement of the dogs behavior. The dog will over time automatically behave as the owner wants and never even think of misbehaving.

    When a programming language trains its leashed programmer, it likewise uses positive and negative feedback. Whenever a problem is easily solvable in the constructs and features of said language, it reinforces the use of those features and constructs. And also in the same vein, if something is hard to do in the language, the programmer will shy away from thinking the idea, since it may be too hard to do in the language. Another negative feedback loop is when resource usage of a program is bad. Either it will use too much memory of too many CPU resources to carry out its work. This discourages the programmer from using that solution again.

    The important point is that while all practical general purpose languages are Turing complete, the way they train programmers to behave as they want is quite different. In an Object Oriented language for instance, the programmer is trained to reframe most - if not all - questions as objects of bundled fields and methods. A functional programmer is trained to envision programs as transformations of data from the form X into the form Y through the application of a function. And so on.

    Which brings us to how ZeroMQ and Erlang will train its programmers into doing vastly different things:

    In ZeroMQ, the programmer will configure a messaging network. That network will probably be fairly static. It is not the case that there is one configuration, invoked in the beginning of the system, which stays there for all of the systems execution, but chances are that the configuration is still fairly rigid and not subject to much change.

    Around this messaging network, we now add processes which operate on the network. We send and receive messages on the network, processing them in various different programming languages. A process is also fairly static in the sense that the process is started, and then tend to use the messaging network as a queue: It will pick off some work, look at the data, process the data and push it back into the messaging network again. Taken to the extreme, ZeroMQ applications are often queue processors like this.

    Incoming requests are transformed to a message and then sent into the messaging network of ZeroMQ. Then it will be processed by some vacant process in the other end and it will be moved from queue to queue until it finally completes. Note ZeroMQ is more complex than a simple point-to-point messaging framework, but the gist of the idea here is that many applications will have the above mentioned system design.

    In Erlang, the programmer will take the request and turn the request into a process. Erlang programmers are trained by the language to like having a large amount of small processes like this, because the language encourages that behaviour. ZeroMQ-solutions on the other hand will be discouraged to dequeue a message and create a process out of it - if the processing language uses a traditional heavyweight thread for a process.

    Now, given that the request is a process, an Erlang programmer will do the dual to the ZeroMQ programmer. Since the process is now the request, whenever the request needs to do some processing, it will do so itself. If it needs some external information, the request/process will ask for it. The preemptive scheduler will ensure that no request can stall the pipeline. Also, since the scheduler in Erlang is parallel, it is relatively easy to get all cores to work, but by having enough requests/processes in the run-queue.

    Note, also, how the messaging network in this solution is fairly dynamic. Point-to-point exchanges are formed on demand when a request need some information.

    To the Erlang programmer, the representation of the request is the process. There is no "message" living in a queue which is the representation of the request at any point. There are less need to tune fan-in/fan-out at the end of queues to make parallel work go through, since the schedulers will automatically parallelize.

    Both styles have their advantages and disadvantages. But that is not the interesting point. The interesting point is how the ZeroMQ idea encourages one system design, whereas the Erlang solution encourages another. In Erlang a process is cheap, so it is totally feasible to turn each request into one. Furthermore, since there is no concept of a message channel, sending a message to a target is all about knowing or finding the name of the target (see for instance Ulf Wiger/Erlang Solutions gproc framework for an efficient name registry). On the contrary, since ZeroMQ has the concept of a generalized socket as a channel endpoint and assuming the programming language we use for processing does not have cheap processes. Then the only feasible solution is to build a static channel network with processors.

    Another interesting point is that most languages with static type systems have latched onto the channel solution. It is easy because a channel can be annotated with the type traversing the channel. But it also trains programmers to think in that model. It would be fun to see a language where process Id's - that is names of processes - have a type of what that process receives.

    The concluding remark: ZeroMQ/Erlang above is just an example. You can find this kind of reinforcement with basically any language design choice. It is rarely that a feature is ubiquitously good or bad. Rather, most choices are a trade-off. Haskells lazy evaluation enables some optimizations but disallow some other optimizations. Ocaml, being eager, can do the optimization that Haskell can't and can't do the ones that Haskell can. Similarly, for a programmer, getting easier access to some feature often means less control of something else, or that something else become much harder to pull off.  The next time you enter in a discussion on the merits of a given language feature, make sure there is not a programming language holding your leash and telling you what is right or wrong!


    5

    View comments

  6. This post is all about parallel computation from a very high level view. I claim Erlang is not a parallel language in particular. It is not created with the primary goal of speeding up computation and harnessing multiple cores, your GPU and so on. If that is your goal, there are other languages which will probably fare much better for your needs (Haskell, for instance).

    Note however, while Erlang is not a parallel language, its runtime is rather excellent at forcing out parallelism of existing concurrent programs. So when we say Erlang is parallel, we say that Erlang is parallel in a specific way! The recent years have seen much work in Erlang/OTP on making the runtime concurrently parallel and we are reaping the benefits. The reason can be found in the simple observation that a Erlang program has thousands of processes which gives thousands of executable threads of control. Since you have more than one thread of control, and communication between them is largely asynchronous, you have all the opportunity for a parallel speedup.

    Parallelism vs Concurrency
    A fine point indeed, is that parallel computation and concurrency are different entities when considering computation. Simon Marlow has a blog post in which he describes the details.

    The gist of it, however, is rather simple: computing parallel is to hunt for a program speedup when adding more execution cores to the system running the program. This is a property of the machine. Computing concurrent is to write a program with multiple threads of control such that it will non-deterministically execute each thread. It is a property of the language.

    If you come with a view from the hardware and upwards toward to computing logic, then surely you may fall into the trap of equating the two terms. The reason is that to implement parallel computation in current hardware and software, you use a concurrent semantics. The pthread library for instance is writing a concurrent program which then takes advantage of multiple processors if they are available. To
    program for the shared memory architecture, you use mutex'es and locks. Most other methods build on top of these primitives. Deep down at the hardware level, it is the presence of a Compare-and-swap
    operation that is at the heart of the idea.

    One may then surmise: are we always going to use these primitives deep down when we implement parallel programs. Are we always going to go parallel from the concurrency primitives? If your opinion is largely yes, then you may hold the view from below, from hardware and up.

    The view up from the logic and down, is that the language has certain constructs available to it, of which some are parallel. How we are going to implement those features is postponed until later. That is, we squint our eyes in a way such that we cannot see the deeper down details - but can concentrate on the general ideas. One could imagine different hardware on which another implementation would be more beneficial, for instance by running on special purpose hardware for the task, an FPGA or a GPU.

    The main point is that parallelism is a declarative notion in the logic. If you want an array processed in parallel, you are not going to explain how the underlying implementation should squeeze in more computational work on multiple cores. You are just stating that the semantics of your program is such that it is allowed to do so. Most importantly, in parallel array processing you must be very specific about the order in which elements are being processed. For some computations it doesn't matter. For other computations, there is a notion of order and you can't screw it up. The difference in semantics here w.r.t. order is among what concerns a language designer with the view from logic.

    So there are essentially two views: from hardware or from logic. From the perspective of logic, if you have toyed with formal semantics, is that we define concurrency-primitives in a different way than parallelism-primitives formally. That is, they are different entities. The perspective of hardware on the other hand, conflates the ideas because the only way to achieve parallelism is to use
    concurrency - especially in UNIX.

    Erlang is not a parallel programming language!
    Erlang is not a concurrent functional programming language. It is not concurrent in the sense that Erlang was built for robustness and fault tolerance. To implement robustness and fault tolerance, it was decided that concurrency was a good vehicle. Likewise for the decision to make the language functional. These choices were merely done in other to achieve the greater goal and had to be chosen along the way.

    Erlang is not a parallel language either! The goal is to provide robustness and fault tolerance. Going parallel does not directly facilitate this goal, and hence from a perspective of language logic there are very few primitives that has to do with the concept of parallelism. Contrast with concurrency of which there are several central primitives built into the language.

    From the view of the hardware and the actual implementation of Erlang/OTP, much attention has been paid the recent years to make the Erlang VM be able to execute processes in parallel. This has mainly been done because you can then increase hardware utilization and hope to gain a speedup in your programs. In other words, it is not a necessity, but a nice-to-have feature.

    The main point to take away from this observation is that in Erlang, parallel computation is implicit. If you write your program in an idiomatic way, you will often find that there are automatic speedup gains from going to multiple cores. But you have little direct control of how things are going to be parallel. Rather, you design your program such that it can implicitly take advantage of multiple cores.

    There is an inherent want for many newfangled Erlang programmers to harness the full potential of their new K-core machine (with K > 1). The idea is simply that by using Erlang, the program will run faster because you will stand a chance at getting the utilization of the cores up.

    But this is and has not been the primary focus of Erlang! And hence it is not a given that your program will run faster. To make it go fast, the program has to be written in a style that makes it possible for multiple cores to work on the program at the same time, implicitly.

    That said, Erlang was designed in a way that makes automatic use of multiple cores much simpler than some other languages out there. In fact, a common case is that you should be able to speed up an unaltered program in many cases just by adding more cores. Given the focus on robustness and faul tolerance this sounds decision logical. If you have a system that works and is relatively void of bugs, then you should not have to go out and change the internals of the program to make it run faster. The Danger is that your new altered program is incorrect and then it does not matter at all if it runs faster or not.

    One might take the goal of Erlangs parallel endeavours as: we hope to achieve a relatively good amount of speedup when adding multiple cores without you having to change the program. That is, if you add more cores, we expect some speedup, but not necessarily a speedup which is the best possible. It is an attempt at a best-effort solution rather than a perfect one. Herein lies the gist of why Erlang is not a parallel programming language. It is not about maximizing the throughput and usage of the many cores, but to use them to automatically enhance an already written program.

    Finally, if one looks at the Erlang history, parallelism only came to the language fairly late. While some believe that the language is excellent in a multi-core era because it is (relatively) easy to make the runtime parallel this was not the goal initially. By sheer luck, some of the design decisions made for Erlang makes the language brilliant for running on multiple cores.

    Finally, a word has to be said on granularity of processes. What is the smallest piece of computation that can be made to work on multiple cores. In Erlang, the grain is finer than most other languages concepts of "threads". And processes are so cheap that it is easy to spawn a new one to do separate computation. But it is important to note that the grain is coarser than the Haskell concept of a "spark" (see
    Runtime support for multicore Haskell (PDF) for instance). And the spark concept is spot-on for squeezing out parallelism at a much finer grain than what Erlang has. The essential trick for sparks in Haskell is that they share the Thread-State-Object (context). So you don't have to go start new threads to service sparks, making them much cheaper to execute even if there are millions of them. As an explicit method of gaining parallelism, sparks is a very interesting concept.

    In conclusion: Erlang can do parallelism in the view of adding it on top of concurrency. This means that parallelism is implicit rather than explicit in the program and hence you want to program in a certain style to make sure your program goes fast. The rest of this post here are ideas for doing that.

    Utilization vs Speedup
    It is very easy to get full utilization of a 128 core machine: Let 1 core do the actual work, and let the remaining 127 cores run an infinite loop.

    All cores will now show that they utilize 100%, yet the program is not running any faster than before. We would love to have the program run 128 times faster, now we had 128 times the computational  power available. The actual *speedup* is what we are after.

    As a programmer, you have two goals. The first goal is to get the utilization up. You can't hope for  speedup if cores are sitting there, idling. *The utilization is an upper bound for speedup*. Next, the goal is to make the additional cores do something better than the endless loop above. That is, decrease the execution time of the program.

    It is worth to look at Amdahl's law in the parallel variant. The central aspect of the law is *"serial parts in program will slow your program down"*. That part can only be run on a single core, and hence it will limit the other cores into idling if they have to wait. And when we consider a parallel part, there is a critical path, called the span of the computation (see Work & Span for instance). Making the span smaller will also yield a faster program - but when optimizing for the span, the critical path may jump to another part of the computation.

    Also of importance is Gustafson's law. Amdahl assumes a fixed workload size. If we can complete that workload in half the time, we have a speedup of two. But as the number of cores increases, so does the overhead of orchestrating their communication. Gustafson (et. al) proposed that we instead make  workload size the scaling variable. That is, suppose the solve a workload of size 1 in 1 minute on 1 core. Now, if we have 8 cores, perfect speedup would be that we were able to solve a workload of size 8 in 1 minute on 8 cores. Contrast with Amdahl: workload 1, 1/8 minute and 8 cores.

    Erlang and speedup
    While Erlang is not an explicit parallel language, it does not hurt to write your Erlang programs in a style which invites multiple cores to do work. In this section, we will try to give some general advice on how to do exactly that. To make Erlang able to execute programs in parallel, you must obviously structure your program as such.

    First, your process run-queue must have enough work that it can be divided among several cores. If there is only ever a single process that can run, that will limit the parallelism of your program.

    Second, be aware of serializing bottlenecks. A process executing is the granularity. So if a large number of workers all go to the same single process for their data, that single process is now the contention point which is serial and for that part of the code there will not be any speedup. If the query to the single process is dominating in your program you are in a bad shape. The extension of this is rather simple: if you have 4 workers processing a queue, then your maximal speedup is 4. This may not be enough on a 16-core machine.

    Third, synchronous calls blocks more than asynchronous casts. If possible, prefer asynchronously communicating processes over synchronous ones. Long chains of synchronous calls following each other are serialization in disguise. They block out other callers when they are processing. They are also prone to creating deadlocks in programs. Asynchronous calls will in addition tend to fill up mailboxes with work. It is much more efficient to handle 8 messages in a context switch time slot than it is handling 1 message and then switch right away on an empty queue. In some cases you can go pseudo-async by making 1 out of 100 calls synchronous for instance. This is an effective way to ensure that you don't overflow a slow receiver as it forces in some amount of flow control.

    In Erlang a process is very cheap. It is far cheaper than a `pthread` thread in C. This leads to a programming methodology where you are not afraid of creating processes. Much like you would be creating objects in OOP. A million processes is not a limiting factor. A good design choice in Erlang is simply to create a process per incoming request or event. Take an HTTP server for instance. Each GET-request will spawn a new process to handle that GET. Had processes been heavy in weight, this approach would not have been viable at all. The idea is that the data you are going to be operating on is morphed into a thread-of-control so the data itself can go ask other parties for information, process on itself and so on. There is a strong similarity to OOP here as each process plays the role of an living object (they are "dead" in OOP only to be temporarily revived from their zombie state when the current executioner of control comes by).

    The design effectively ensures our run-queue is always filled up well with work to do. Had we opted for a queue-pipeline approach as is common in many other languages, the amount of workers working on each queue would have to be tuned to the number of cores. Furthermore, you have to watch out for overloading a specific queue. In our morph-data-into-process design it is the run-queue we have to watch
    out for, but that is simpler and there are tools for doing it (see for instance Erlang Solutions esl/jobs application).

    Note, however, that if these many small processes all request data from the same server-like process, then performance will dwindle and suffer. The server-like process is a serializer which we warned about earlier.

    One idiom I commonly use is to have a protected ETS table with the option {read_concurrency, true} set. When all the spawned processes want to read data, they go directly into the ETS table and fetch what they need. Writes are serialized through a central process. Of course this scheme can be altered if you need a large amount of writes, or general reader-writer access.

    In the newer Erlang/OTP releases (R14 and upwards, I believe) the access to the ETS table is highly parallel in the runtime. Hence, your million processes will not have to wait on each other to read data out of the table.

    Another common trick is to increase the independence of computations. Whenever you can eliminate some sharing, perhaps by recalculating it or caching it somewhere, then you increase the independence of the computation. In Erlang semantics, processes have exclusive access to their own data. Hence, the independence of data is already really good. What you want to avoid is having to partner with other processes to solve a task -- if the partner is heavily contended.

    The central point is to be aware where serialization might occur in your code and then eliminate it. Otherwise, you may see a bad utilization, and no speedup. There is a nice tool, Percept, which can profile the runnability of a process; when a process or port can be run on a core.

    Also, you should be aware of the following interesting gotchas:

    I disabled SMP and now my code runs faster!
    Yes. This is a common thing. First, when you benchmark your code, a very grave but all too common mistake is to use your code optimized for parallel execution on the single core. Parallel code often has some overhead due to communication and often, it also impedes the execution speed if you run it on a single core only.

    You must benchmark the fastest possible non-parallel version of your program and count the speedup against that. The other solution is simply not fair. It is one way you can make your benchmark look better by cheating.

    The overhead is also present when you take your program, written in Erlang, and run it on the non-SMP variant of the Erlang VM. It can often go faster for some work loads. The reason is that your program may not exploit parallelism that well and thus the overhead of going truly parallel is dominating the actual speedup gain. I have seen this with some of my Haskell programs as well.

    There is a fairly high price for suddenly having multiple cores inside the VM doing things. You will have to synchronize in the runtime and even though your synchronization primitives are heavily optimized, they have overhead. Now, if the problem you are looking at is not even saturating a single core, then surely, it would have been better to run in a single-threaded manner. It is only when you could benefit from a second core that it will help you.

    If, on the other hand, your problem maxes out the core, then adding a new core to the mix has some overhead, but it will make the computation scale further and you may experience a better throughput in the end.

    So if you are well below a single core in utilization, this is often a way to lower the amount of work your system has to do. And make the code run faster.

    My code has faster than perfect speedup!
    Another common thing that can happen is super-linear speedup. That is, you may have 8 cores and expect a speedup close to 8 -- but all of a sudden you have a speedup of 12. This happens. The reason it happens is usually that when you add cores to your computer, you also alter other resources. Most notably, you often increase the amount of available cache. If you distribute the computation over multiple physical machines, you may also increase the amount of available RAM.

    These factors can improve your computational speed by quite a lot if you can manage computations such that they become independent of each other. It directly reaps the benefit of the added caches as there will be no migration of data from one cache to another. If the computation can suddenly fit into the caches, then the speed can be much increased. Likewise, the same pattern can emerge if you add more RAM to your cluster. Suddenly the central data can be kept mostly in memory and you don't have to load data off the disk anymore.

    Hiding latency - the secret weapon for fast parallel computation

    If there is a single trick resulting in fast parallel speedups, it has to be latency hiding. Note problems come in three varieties: The simple, the impossible and the interesting.

    The simple problems are called the Embarrassingly parallel problems. They are simple to solve: split them into equal sized chunks and solve each part. Then reintegrate the individual solutions into a final result. The idea of Map-Reduce leans itself to a solution method. It is obvious if your problem is simple, then Map-Reduce solves it. The impossible problems are the class of problems for which you have little to no hope of speeding up the computation by having more cores. A compiler, for instance, can compile multiple modules at once. But the steps of lexing-parsing-code-gen-optimization-and-so-on has to happen in that order and has to happen in a serial fashion - a large span is dominating. Some numerical iterative computations require a finished computed step N before they can tackle step N+1. These problems are so data dependent that you have no hope of making those parts parallel. Essentially the span of each iteration is chained together in a way that makes it impossible to speed up.

    This leaves us with the interesting class of problems. In order to compute these, you need to have some amount of sharing/dependence in between the processes. And this is where the latency hiding trick comes into play.

    When one process sends data to another process, there is latency involved. We must copy data into another mailbox, or perhaps copy to another computer. The bandwidth of the network and the size of the data we have to copy puts us into a situation where there is a minimum latency we can expect when we copy. Now, the naive solution is to repeat a two step process: Compute this iteration. Then Exchange this iteration with other processes. But every time we exchange data the processes are waiting and are idle. If communication time dominates, then our speedup will suffer.

    The latency hiding idea almost doesn't even warrant a name: "do something else while the data transfers". That's it. Like an operating system is doing something else while a process waits for disk IO to complete, a process can do something else while waiting for data to transfer.

    To Erlang programmers this means that the order in which you send and receive matter. If you reorder the receives such that the communication overhead is hidden, then your program will be able to find the relevant message in its mailbox already there when it is needed. The rule is send early and receive late. As soon as you have data another process will need, you must make it available to them. And when you need data make sure you get it as late in the process as possible, just before you need it. That way, it may already be there in the mailbox, ready for you to use.

    The conclusion is that one must take into consideration the impact of latencies. In high performance  computing it is of primary importance. A worthy bet is that on modern computers, moving data around is often the bottleneck. Not the raw speed of the CPU core. If we have a distributed system, then the latencies may be much higher due to network copying - with a much higher impact on speed. The CPU cache also plays a crucial role in limiting the speed by which your computations will run.

    Conclusion
    We have given a brief overview of some of the parallel computation basics. We have argued that concurrency and parallelism are different beasts. We have argued that Erlang is not particularly parallel. We have surmised on what one should do to increase parallelism in Erlang programs. We have also hit on a small number of interesting gotchas when computing parallel. And we have discussed about the concepts of latency hiding.

    There is much more to be said about Erlang and parallel computation. I particularly like that Erlang does not assume a cache-coherent architecture for instance. But that will be a post for another time since I am all out of blog space and this is already too large.
    2

    View comments

  7. Google released Google+. Everyone has now blogged about whether it will "rank or tank", but nobody has really looked into the concept of circles from a more technical viewpoint to my knowledge. Let us remedy that.

    Online communication services can all be classified according to different aspects. A chat IM message is, as an example, usually a one-to-one realtime communication. There is an assumption that I will answer a chat message rather quickly - even when I don't have to. IRC - Internet relay chat - is different: communication is topical as I join a channel discussing a certain topic. Communication is many-to-many and there is no assumption I answer the message right away. Likewise for emails - but there exchange of messages happen at a slower pace with larger bodies of text. The reason you use more than one service is essentially that different communication services cater to different kinds of communication. The collaborative benefits you gain when developing source code over IRC is not paralleled in any place currently for instance - not as a free open project at least.

    What kind of communication services disrupts your flow of thought by alerting you and requiring you to answer (phones, I am looking at you!)? How realtime/instant is a message? Can you post additional content like photos, images, urls? Can you split communication into topics of interest? Are topics invite-only or free-to-join by default? How is moderation done on topics? Is it easy to form groups of privacy to discuss something? Is the communication one-to-one-focused, or is it one-to-many or even perhaps many-to-many? Can you build robots to automate tasks for the service? And so, in the light of Google+:

    Google+ major point is to work like Facebook does in the core. This means that I am not the primary customer - as I don't use Facebook much. Hence, I am probably the wrong guy to judge if it will work out or not. By the core of Facebook, I mean that the primary goals of G+ and FB overlap very much. In the long run, the goal is to form a relation-graph between people and get access to it for Google.

    The major, essential, difference between Google+ and Facebook is the formation of connections. In Facebook, A adds B and then B confirms the relationship. The opposite of this is Twitter: you just follow people you are interested in. Facebook runs a 2-way handshake whereas Twitters is one-way. Since connections in Facebook is 2-way, then we trivially have: if A is connected to B, then B is also connected to A. In other words: the connection graph is undirected in Facebook. Contrast with Twitter: graphs in twitter are directed. There are numerous one-way connections in Twitter. Most notable are celebrity-figures with an extremely high amount of followers. The communication from such an account is essentially 1-to-many and public.

    So what does Google+ do? It chooses a variant of Twitter. In G+, the graph is directed like in Twitter, but you also label the edges with a name -- your circle designation for the connection. You can have multiple edges to the same person/account, so in reality the graph is a multigraph where edges are discriminated by their circle-labels. Another representation is to deem that a graph edge is labeled with a set of circle-labels.

    When A adds B to the circles, A is adding an edge in the graph from A to B for each circle. So A builds a one-way path to the other end. If later B decides to add A to his circles, then there will be edges in the opposite direction. They may not be labeled the same at all. This means that A can have B in the circle of close friends while B has A in the circle ObnoxiousCatPhotoSubmitters. This is an interesting skew of connections which come into play. When you post something to a circle, the recipients can not see the message as if it is originating from that circle - in fact circles are exclusive to an account and can't be seen by others. This in effect makes a message a low-level value which is exchanged from a sharer to a list of people the sharer decides via circles. Upon receipt, the message is then cast into the light of the recipients circles.

    So forming connections is like in Twitter, only that edges are labeled by Circles given by the source. This is a rather genius move. When A shares, he chooses the circles he want to share the message with. The system then searches all edges labeled with those circles, and posts on those edges only. If A wants to post Tech-related stuff, A can do so in the Tech-circle so only people that might care is hit by the sharing.

    • The sender, A, has complete control of who sees the message. This is what Google refers to as the ability of G+ to accommodate different kinds of privacy levels on messages. Also, it is the sender, A, that defines his or her own levels of privacy with Circles as they are local to an account. It also blurs the definition of "follower", "friend", "connection", "add" and so on. Some edges are labeled as "close friends" and are thus strong edges from a privacy perspective. Other edges are weak in the sense that you have a connection, but there are limits to what you want to share with the other person.
    • When data arrives at B, messages from A gets sorted according to the circles of B. So when B looks at streams from circles in which A is in, he sees the shared information from A.
    • Consequence: A needs to moderate what he sends along the Tech-circle to B. If A begins sending Cat photos to his tech-friends, he will end up in the ObnoxiousCatPosters circle at B - and only that circle. B is free to replace him at any time if he wants. The assumption currently is that relations between people are formed one primary ground. "He is family". "She and I are always discussing cool graphics design". This may largely be true.
    • One thing I have not figured out yet: Suppose I want to follow B for more than one interest. Cats and Dogs. I have a circle of Cats and one of Dogs. But when B posts in his public circle for all to see, his posts will end up at my place in both Cats and Dogs. To the best of my knowledge there is no way I can currently split this. My guess is that Google will have to solve it in the longer run. They have several options among of which is to use statistical analysis on the posts through machine learning. Or allow #tags as in Twitter.
    • Another thing I have not figured out completely yet: I don't think G+ can substitute for the other communication tools I am already using. Since I am not much of a Facebook user I am not the right demographic, but I don't see G+ replacing Email, IRC or Twitter. Twitter is the thing it will have the best options at replacing in my opinion. Email and IRC needs a more wave'esque approach to the problem I am afraid. I would really like to see an "IRC for the masses" and I am not convinced that Huddle is it. Further, the locality of circles to an account makes them distinctively different from IRC channels. On the other hand, if you have a topical connection graph, you can use machine learning to suggest IRC-like topics later on - and it is not that hard to add proper IRC-like group chat features to Google Chat which is already integrated.
    I find the technical difference mentioned here quite interesting. It is definitely a generalization of the existing models of connection-forming, and it will be interesting to see if it takes off, and what people will get out of using it.

    So in conclusion: G+ is a funny hybrid between the style of Facebook and Twitter. The circle concept is central to how the connection-graph is formed. And thus, central to how G+ works. It will be interesting to see if Facebook will have a counter-solution. Currently, FB can create groups of friends and handle these in many ways like G+ can handle circles. We can only hope that this competition from Google will force Facebook to improve their system as well. But they probably can't get rid of the two-way connections easily.
    3

    View comments

  8. When people carry out  benchmarks of web servers, their reporting of data is often inadequate. Apart from the trouble of http://www.mnot.net/blog/2011/05/18/http_benchmark_rules and getting that right, I want to focus on the reported data. The main point is this: Don't maim the raw data by taking the mean - give us a link to the raw information so we can draw our own conclusions. And don't use benchmarking tools which destroy the raw data after test.

    Overload:
    The interesting situation is not the standard situation. What you want to know is not that server X is 30% faster than server Y - but rather what happens when the server gets a spike of incoming traffic. That is, the Poisson teams up against you and you hit serious load. In that case, you want the server able to sustain the load instead of the server which will degrade in performance.

    It is far better to dismiss some operations as errors quickly but do serve the requests you accepted, and serve them fast. The raw speed matters less as 30 or even 70% is just a constant factor deciding when you need to scale - either a faster machine or a new one next the the computer you already got. Of course, with 150% it begins to matter a lot which is faster, but for most servers, the speed is so good it doesn't matter. Stability and the overload situation is more important to optimize for.

    Errors and latency:
    In a benchmark, there is a bar which describes what is an error. In some benchmarks, the error will be a request which never completed. In other tests, it will be that a given request took too long. There is little reason to accept a request of 10 seconds as the poor user will almost always have reloaded or clicked again. Moderns users are impatient to the point of being painfully so.

    This leads to the idea we should regard requests which are too slow as errors as well. If a request doesn't complete under 5 seconds, it is an error as well. But for a lot less, users may become bored on your site and not use it. Even a 100ms bump can mean a lot.

    This leads to the virtue of the benchmarker numero uno: Record latencies for the measurements


    You should, ideally, be recording each of your 250000 requests and their individual latency. But this is usually too much for the benchmark tool to handle. Rather, you want the tool to do random sampling and record, say, 2500 samples and store them in a file. The cool thing about this is that you can hand the data out when you present your information. It is like cake to a statistician, who will be able to use the raw data.

    Beware the heretic: Average
    Unfortunately, many benchmark tools report statistics rather than raw data. They will give you the minimum, maximum, average, median, stddev and so on. The very first thing you should do when you get raw data is to plot them. You want to see what the data looks like. Just a simple histogram plot can often tell you a lot about the data at hand:
    This looks fairly okay as a normal distribution of data with a mean of 7 and a standard deviation of 2. In fact it is, because it was randomly generated as such. Here is another one:
    And that one is very interesting. It has an odd latency spike around 25 as well! Again, the example is generated from two normal distributions whose observations we have plotted here as a histogram. For the first one it may be ok to use the stddev and average value because they make sense in that case. But for the second it is most definitely not acceptable. We ask R for the summary of the first:


    > summary(a)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     0.1309  5.6200  6.7750  6.8090  7.9960 13.3500

    For a web server the Min value would tell us something about the lower limit. If the server is 30ms away travelling at the speed of light, that is the minimum value we can ever hope for. The Maximum observation is interesting as well. If it is too high, the user will not hoover and is gone. The mean is the average value of the observations. We'll get back to the dreaded mean. The 1st quartile, Median and 3rd quartile is achieved by lining up all the observations in a sorted manner to get the distribution function:
    And then you pick the guy 25% in from the left of the X-axis as the 1st quartile. The guy at 50% as the median and finally the guy at 75% as the 3rd quartile. If there is no middle value (in the case of an even number of observations) you pick the mean between the two middle observations. For the first example plotted here, the median and mean is basically the same value. But what about example 2?


    > summary(example_2)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     0.1309  5.8300  7.1380  9.4180  8.9150 26.1900 

    This looks bad. The Median and Mean doesn't line up and this should normally be a warning sign you should plot the histogram. In fact, if you look at the above histogram plot you will find that the observations around 25 skew the mean too much to the right. In fact 9.4 is not a very likely value. It is more likely to get either a 7 (in around 85% of the time) or a 25 (around 15% of the time). The median tells us that the most likely value is around 7. The distribution plot should tell us something as well:

    A value of 17 is highly unlikely. Yet, if I had constructed more observations in the 25 area, then the mean would have been 17. That is why the mean is a bad indicator for a data set. And it is why we want the raw data. If you report something, make it be the median. Or make whiskers on your values so we can see the development as you increase your load on the server:

    Notice how Mark Nottingham adds small vertical bars to each measurement so he can report the range in which most of the repsonses fall? That is a neat idea!

    Even better is to use boxplots on each observation so you can see how it relates to the mean value:



    My hunch:
    I think there is a gem to be found but currently hidden from our scrutiny. I hypothesize some web servers are far better than others when it comes to harnessing the latency and keeping it stable, whereas other servers are more likely to skew the results a lot. It tells you a lot about the robustness of the server and it is another key factor on which to measure. I don't care that much about requests per second as I care about this number. It is like claiming your network is better because it has more bandwidth, blindly ignoring the question of latency (a rant for another time).

    Face it, if you measured two web servers correctly, you could do the statistics to figure out if one were significantly (in the statistical sense) better than the other. In Ostinellis work, http://www.ostinelli.net/a-comparison-between-misultin-mochiweb-cowboy-nodejs-and-tornadoweb/ for instance, we see that mochiweb is only 66% faster than Node.js, which isn't going to make a dent in practice. The 25% difference between Misultin and Cowboy reported is even less so. But the response time for Node.js is 6.6 times as large compared to Mochiweb. (Around 600ms compared to around 90ms at 10000 in load). That is difference you should be worried about. And that is the reason for my plea:

    Give us latency samples when you benchmark web servers!
    (Edit: Smaller changes to sentences in the beginning to get the narrative to flow better)

    2

    View comments

  9. Every once in a while, we see the same discussion popping up again: "Should I concentrate on Node.js or Erlang for my needs?" Let me be clear: I think it is the wrong discussion. Fact is, if you have chosen a language which support a large number of concurrent tasks at the same time, you are probably going to be ready for the future.

    Divination is hard, but I am not sure we are going to see a fast multi-core revolution with thousands of processes. As per Moore's law, if transistor count increased as we would like, we would have had 2 cores in 2006 (Intel Core 2 Duo), 4 (physical) cores in 2008 (Core i7, Intel also) and 8 physical cores in 2010. Where is my 8 core laptop today? Even Sandy Bridge is not up there. I have a hunch a couple of things are happening: the market has seen a shift to low-power devices, which is more important to the consumer, notably by the ARM incursion. And people can't utilize a multi-core machine with 8 cores since their programs are not written to take advantage of several cores. Finally, it is cheaper to buy infrastructure on-demand at places like Heroku, Google, Amazon, Microsoft, Rackspace and so on.

    But if parallelism is not coming, distribution and concurrency is! Modern programs have to live on several devices at the same time. You check your mail on the mobile phone, on the tablet device, on your laptop and on the workstation, if you still use one. Likewise, a modern program will probably live on multiple devices at once and to harness that game, you need efficient distribution. Enter new languages with support.

    And this is why the battle must stop. Like a game of Chess the point of the languages are to be different. Haskell, Google Go, Erlang and Node.js all come with a given set of features for controlling the multi-core problem. But like in Chess the pieces are different and built for different purposes. And here we are only looking at current implementations. Theoretically, the ML family, with Standard ML, Ocaml and F# are really well suited for the concurrency and distribution revolution as well, even though the implementations lack in robustness.

    The enemy of the state, is a large set of old ways of programming. It is not so much the language chosen as it is the ideology. You can't rely on shared information when the RAM is physically split between two devices (and you have no fast reliable link in between). You can't rely on OOP in the Java sense to save you. And you can't usually get consistency across the distribution parts due to unreliability.

    Rather, see the different languages as stakes in the future of computing. They are different for a reason and they gamble by being different. Personally, I like Erlang because it has what I see as a good view of the problem and hand, and a novel way at solving it. Yes, it can be made better in so many ways, but I think Erlang is at the right point on the "Worse is Better" scale. It actually solves a lot of my real-world gripes.

    Node.js plays different cards in the game. I really like the mix of continuations, lack of blocking, V8 and Javascript on the server side. Go plays it yet differently, taking a much more low-level approach to the problem at hand. Go feels like an updated C with channels, and it got substructural subtyping to boot. Haskell is just awesome because the language needed very little change to support concurrency. The semantics of Haskell are on a different level than most other, and with Haskell, it is the fabric of computation you are mangling to your desire.

    In the endgame of Chess, pieces like the knight become less efficient since its cunning ability to pass through other pieces is not as powerful. Time will tell what piece was the knight. But if you did not bet on every approach, you wouldn't know if you just hit a local optimum.
    4

    View comments

  10. 4 Functions and one view

    Let us define an Agda module:

    >    module Partition where
    > 
    >    open import Data.Bool
    >    open import Data.Nat
    >    open import Data.Product
    

    In Agda, we can define the option-type as the following hybrid between Haskell and ML:

    >    data Option (A : Set) : Set where
    >      Nothing : Option A
    >      Just : (o : A) -> Option A
    

    That is, option types are formed by either being Nothing or by being Just A for some element type A. Likewise, we can define Lists as:

    >    data List (A : Set) : Set where
    >      [] : List A
    >      _∷_ : (x : A) -> (xs : List A) -> List A
    

    That is, lists are either the empty list [] or they are formed as a cons operator, where we have an element x of type A, the head, and a tail-list. Together this forms a list.

    We now define a really simple function on lists. We will ask if there is an element of the list that satisfies a predicate p. That is, we ask “Exists an element satisfying the predicate?”

    >    exists : {A : Set} -> (A -> Bool) -> List A -> Bool
    >    exists p [] = false
    >    exists p (x ∷ xs) with p x
    >    ... | true = true
    >    ... | false = exists p xs
    

    The function makes use of a with statement of Agda. with p x can in this case be seen as a case p x of where the two possible variants are then handled to the left of the pipe bar |. The three dots, ... signifies that nothing else change in the parameters. Generally, the match in Agda is dependent, so the output of p x could have constrained the other parameters (that is the other parameters depend on the output of the match).

    This function is very simple, but it does not yield us a lot of information, should we want to know which element that actually existed. Essentially, we have become blind to some information. On the other hand, if we are only interested in the question of existence, then the function will not produce additional information, we later need to throw away.

    Let us add a remedy to our alchemical formulae: We present the find function:

    >    find : {A : Set} -> (A -> Bool) -> List A -> Option A
    >    find f [] = Nothing
    >    find f (x ∷ xs) with f x
    >    ... | true = Just x
    >    ... | false = find f xs
    

    This guy is exists-one-step-up. Here, when our scrutineer finds a matching element, we return the element itself. Thus, if we need the element for further work, we have it at our disposal. In practice, it may seem more expensive than exists, but it is not: first, x we return is a reference, so it has a fixed size. Furthermore, if we don’t use x, then the compiler is free to declare all code involving producing x as dead and remove it from the code. Essentially, a compiler like Haskell’s GHC will probably re-obtain the exists version automatically by optimization.

    But the sexiness of this function is still pretty low. We don’t know if there is only a single element matching, or if there are more elements matching. We add another ingredient, and present to you the filter function:

    >    filter : {A : Set} -> (A -> Bool) -> List A -> List A
    >    filter f [] = []
    >    filter f (x ∷ xs) with f x
    >    ... | true = x ∷ filter f xs
    >    ... | false = filter f xs
    

    Now, when the function return, we get back a list of all the guys matching the predicate. Thus, we can just pick up the head and in a lazy language, we would probably not make more work on the rest of the list if we chose to do so. It is yet a generalization in our stack. It provides a view with a lot more information than from before. We can for instance easily detect the case where there are 3 elements satisfying the predicate by pattern matching, and then handle that case specially, should we want to do so.

    But we are not done yet. Everything was merely foreplay. Add the next level of abstraction and generality:

    >    partition : {A : Set} -> (A -> Bool) -> List A
    >              -> List A × List A
    >    partition d xs = part d xs [] []
    >      where
    >        part : ∀ {A : Set} -> (A -> Bool)
    >              -> List A -> List A -> List A
    >              -> List A × List A
    >        part d [] yes no = yes , no
    >        part d (x ∷ xs') yes no with d x
    >        ... | true  = part d xs' (x ∷ yes) no
    >        ... | false = part d xs' yes (x ∷ no)
    

    Partition use a discriminator function, d, which splits the list into the yes’s — those matching the predicate — and the no’s; those that do not. It is interesting we can walk down back the chain easily:

    >    filter₁ : {A : Set} -> (A -> Bool) -> List A -> List A
    >    filter₁ p xs = proj₁ (partition p xs)
    > 
    >    find₁ : {A : Set} -> (A -> Bool) -> List A -> Option A
    >    find₁ p xs with filter₁ p xs
    >    ... | [] = Nothing
    >    ... | x ∷ xs' = Just x
    > 
    >    exists₁ : {A : Set} -> (A -> Bool) -> List A -> Bool
    >    exists₁ p xs with find₁ p xs
    >    ... | Nothing = false
    >    ... | Just _  = true
    

    Which shows how partition is an aggressive generalization of exists.

    Essentially, this is a question of information you have when using the different functions. With partition, you have the most information: you know exactly which elements matched, and which did not. Again, with enough optimization, it should be possible for a compiler to derive the faster versions if possible. The view this function provides happen to be better than the views the other functions provide. It is, in some sense, the correct implementation.

    In Agda, views mean something more specific. A view there is a datatype, with the sole purpose of being evidence of some observation. If we compute some outcome, we shove the resulting information into a data type, the view and then we can later ask the question “Why?” by matching on that type. The Agda tutorials has more details, but the general idea of using a data structure to carry information is what motivated this blog post initially. I also find the relationship between the four functions interesting by itself.

    You may wonder: Why do we even bother to have the other definitions around? Mostly, it has to do with optimization hints to the compiler. An implementation like find tell the compiler we are only interested in a single element, first match and it can throw away other parts of the computation. In other words, it can be seen as a guide to the compiler about our current intent. It is also an intent-hint to the human reader. We make it explicit we don’t need certain parts of the computation, so the reader is free to ignore those.

    In some languages, the only discriminator is the boolean test. So you are essentially forced into a model where either you only have exists or you are to write a specialized version of the above whenever you need access to them. If your language miss Algebraic Data Types or has no construct which can stand in for it easily, then you are doomed to reimplement lots of stuff all the time. Your modularity suffer deeply as a result.

    1

    View comments

Loading