1. This post serves several purposes. First, we bring an update on the code state: The client is now able to leech, that is download, in a lab setting. The lab consist of a test tracker, an rtorrent client two computers. The tesfile is file of 5000000 bytes containing a specific phrase over and over. This gives good compression ratio in git and thus it is not that irritating to move over the wire if needed.

    Alex Mason did a good job improving several things in the project. He added prioritization to logging, got compilation working on GHC 6.12.x and is now looking at improving every aspect of data parsing. The BCode work, able to handle the bittorrent bencoded data using applicative functors (Data.Applicative) look really good.

    The latter part serves two purposes in parallel: It describes the used idioms and it describes the piece manager code used for leeching torrents. This is a work in progress and there are some things we don’t handle yet.

    Also, I am trying to use John MacFarlane’s excellent pandoc package for typesetting this in blogger.

    The Piecemanager

    The Piece Manager is defined in the module PieceMgrP. It is responsible for keeping track of the torrents internal Piece State. That is, what do we need to download, and what have we already downloaded and can serve to other peers. Remember that in our terminology, a Block is a subset of a piece, given by an offset and a length. You could call it a slice of a piece.

    The basic idiom of the piecemanager is that of an owning process. In traditional semaphore-oriented programming we protect a data structure by a mutex. We could also protect it software transactions, but since we partially derived haskell-torrent from etorrent, we’ll go with the message passing method. We simple invent a process to manage the data structure. Operations on the structure is then serialized by passing them to the process and gettings answers back, RPC-style. It might not be parallel, but it certainly is easy.

    The data structure we want to protect is the piece database:

    > data PieceDB = PieceDB
    >    { pendingPieces :: [PieceNum]
    >    , donePiece     :: [PieceNum]
    >    , inProgress    :: M.Map PieceNum InProgressPiece
    >    , infoMap       :: PieceMap
    >    }
    

    The database contains a list of pieces which are pending for download. This list should, in a perfect world, be maintained with a histogram such that we know the rarity of each piece. A good client prefers rare pieces to young pieces and selects randomly among pieces of the same rarity. A weaker client picks randomly without taking their rarity into consideration. A dumb client, like haskell-torrent currently, just picks them off from a single end. This is bad for the torrent cloud to do, but it is a start. If someone comes up with a data structure which is (practically) efficient for storing histograms, I would be happy to hear about it.

    The donePiece record field is the list of pieces that are done. We keep this around because when a new peer is connected to the client then we need to tell this peer about what pieces we have fully downloaded.

    Then we have Data.Map Map which tells us something about Pieces that are in progress. The InProgress data type has the following structure:

    > data InProgressPiece = InProgressPiece
    >    { ipDone  :: Int
    >    , ipSize  :: Int
    >    , ipHaveBlocks :: S.Set Block
    >    , ipPendingBlocks :: [Block]
    >    } deriving Show
    

    These fields are (in order) how many blocks in the piece we have downloaded, the complete size of the piece, a set of the blocks we have downloaded and a list of blocks pending download. The size of the piece is almost always the same, but the last piece is different if the complete file is not a multiple of the block size.

    Returning to the PieceDB, the last entry describes the complete torrent. The PieceMap tells, for each piece, its offset in the file, its length and its SHA1 digest. Note we do not support multi-file torrents yet, although this code would probably be unaltered. The offset is in the multi-file-case the offset in the concatenation of the files in the torrent.

    Starting the process

    The PieceManager process is started with the start call:

    > start :: LogChannel -> PieceMgrChannel -> FSPChannel -> PieceDB -> IO ()
    > start logC mgrC fspC db = (spawn $ lp db) >> return ()
    >   where lp db = do
    >           msg <- sync $ receive mgrC (const True)
    >           case msg of
    

    We supply a number of CML-channels to the process from the outside and then we spawn off the main loop before returning (). This is probably not good in the long run, where we will need to handle errors in the process. But for now we accept that the code is supposedly error-free and never have any problems.

    The loop code itself synchronizes on messages here named msg. These messages have the form

    > data PieceMgrMsg = GrabBlocks Int [PieceNum] (Channel [(PieceNum, [Block])])
    >                  | StoreBlock PieceNum Block B.ByteString
    >                  | PutbackBlocks [(PieceNum, Block)]
    >                  | GetDone (Channel [PieceNum])
    

    and each of these are going to be explained in the following. A general note is that if the message contains a channel, it is usually a form of RPC-message, where the channel is the return channel on which to send back an answer. One could probably factor this out into a generic RPC-construction with a bit of cleverness, but I have not given it much thought yet.

    In the following, we present some examples of processing these messages.

    Grabbing blocks

    Here is the fragment for grabbing blocks. It is one of the paths in the diagram above.

    > GrabBlocks n eligible c ->
    >    do logMsg logC $ "Grabbing blocks"
    >       let (blocks, db') = grabBlocks' n eligible db
    >       logMsg logC $ "Grabbed..."
    >       sync $ transmit c blocks
    >       lp db'
    

    Basically, this call is a request by a peer process to get exclusive access to n blocks among all the available blocks for a while. This ensures that only this peer will download the blocks, eliminating block collisions. The eligible value is a list of pieces that the peer has already downloaded, and we should of course only choose blocks among those. Our block grabber may honor the request or return an empty list if no block can be satisfied.

    The guts is hidden in the call to the helper grabBlocks’, which we will describe later.

    Storing blocks

    The other path in the diagram is maintained by this code fragment:

    >  StoreBlock pn blk d ->
    >      do FSP.storeBlock fspC pn blk d
    >         let (done, db') = updateProgress db pn blk
    >         if done
    >            then do assertPieceComplete db pn logC
    >                    pieceOk <- FSP.checkPiece fspC pn
    >                    let db'' =
    >                      case pieceOk of
    >                        Nothing ->
    >                          error "PieceMgrP: Piece Nonexisting!"
    >                        Just True -> completePiece db' pn
    >                        Just False -> putBackPiece db' pn
    >                    lp db''
    >             else lp db'
    

    We get a request to store the block blk in piece pn where d is the data we want to store as a ByteString. We invoke the filesystem to actually store the block. In a real world, we would check that this is really what we wanted. If the piece is already complete and checked, we don’t want a stray block to errornously destroy the piece. In general we want more checks like these in the client.

    Then we update the database with the progress on the piece. If the piece is done, we invoke a checking of that piece. Either it is complete and Ok, in which case we mark it as such in the database — or it is not Ok, in which case we put it back for downloading. This happens in the real world at times due to clients with bugs so it is always important not to trust the other client. If the piece completes, we should send out HAVE messages to all connected peers. I plan to make the Peer Manager do that, but the code has not yet been implemented for that.

    The RPC idiom

    When we initially connect to a peer, we will have to transfer the pieces we have. To do this, we construct a BITFIELD message and to construct this, we need the set of pieces which are complete. The GetDone message handles this:

    > GetDone c -> do sync $ transmit c (donePiece db)
    >                 lp db
    

    and the peer which want this calls the function

    > getPieceDone :: PieceMgrChannel -> IO [PieceNum]
    > getPieceDone ch = do
    >  c <- channel
    >  sync $ transmit ch $ GetDone c
    >  sync $ receive c (const True)
    

    I think there is an idiom to be extracted from this code and as it is used quite a lot it would be very beneficial to have in the long run.

    Grabbing blocks, the inner workings

    The diagram above hints that grabbing blocks has a more complicated control flow. The idea of the code is that the control flow is modeled by a tail call into the next box. Let us break up the code:

    > grabBlocks' :: Int -> [PieceNum] -> PieceDB
    >             -> ([(PieceNum, [Block])], PieceDB)
    > grabBlocks' k eligible db = tryGrabProgress k eligible db []
    >   where
    

    So we will try to grab k blocks from the eligible pieces from db. The empty list is the accumulator in which we add the blocks we found as we go.

    >    tryGrabProgress 0 _  db captured = (captured, db) -- Enough, exit
    >    tryGrabProgress n ps db captured =
    >        case ps `intersect` (fmap fst $ M.toList (inProgress db)) of
    >          []  -> tryGrabPending n ps db captured
    >          (h:_) -> grabFromProgress n ps h db captured
    

    Grabbing from the pieces already in progress minimizes the number of “open” pieces. We want to minimize this as a complete piece can be checked for correctness. Correct pieces can then be shared to others and since our download speed is dependent on our upload speed, complete pieces is key. Finding a piece is simply to carry out set intersection. If you’ve read to this point, there is a refactoring opportunity by using M.keys.

    A smarter client, as hinted, would order the eligible pieces such that the rarest pieces were tried first. One could also toy with the idea of pruning: tell the peer what pieces we already have downloaded so it won’t include them in further requests. This would keep the ps parameter small in size.

    There are three exit-points. Either there was a piece, h, we can grab from that one, or there were none among the pieces in progress: we will then seek the list of pending pieces. Finally, if we are requested to grab 0 blocks, we already have enough blocks and can return those we have together with the new database.

    >   grabFromProgress n ps p db captured =
    >       let ipp = fromJust $ M.lookup p (inProgress db)
    >           (grabbed, rest) = splitAt n (ipPendingBlocks ipp)
    >           nIpp = ipp { ipPendingBlocks = rest }
    >           nDb  = db { inProgress = M.insert p nIpp (inProgress db) }
    >       in
    >           if grabbed == []
    >            then tryGrabProgress n (ps \\ [p]) db captured
    >            else tryGrabProgress (n - length grabbed) ps nDb ((p, grabbed) : captured)
    

    Here, we have found the piece p as being in progress. So we find its InProgress structure, and use Data.List.splitAt to cut off as many blocks as possible. We may find that we can’t grab any blocks. This happens when we or other peers are already downloading all pieces. We then prune the piece p from the set of eligible pieces and try again. Otherwise, we update the database and the number of pieces to grab and go back to start. Another hint: We should probably prune p from ps here as well :)

    >   tryGrabPending n ps db captured =
    >       case ps `intersect` (pendingPieces db) of
    >         []    -> (captured, db) -- No (more) pieces to download, return
    >         (h:_) ->
    >             let blockList = createBlock h db
    >                 ipp = InProgressPiece 0 bSz S.empty blockList
    >                 bSz = len $ fromJust $ M.lookup n (infoMap db)
    >                 nDb = db { pendingPieces = (pendingPieces db) \\ [h],
    >                            inProgress    = M.insert h ipp (inProgress db) }
    >             in tryGrabProgress n ps nDb captured
    

    Finally, this part grabs from the pending pieces. Either we can’t find a piece and then we can just exit. In the other case we have found a piece. We then create the blocks of the piece and insert it into the pieces in progress. The tail-call to tryGrabProgress will then find it.

    I hope the similarity to the diagram is clear. When building these tail-call mazes I usually start with the diagram sketch. Then I hand-write them in Graphviz’s dot-notation and create them. Finally I write the code.

    4

    View comments

  2. On Peer Processes
    The diagram above is the control-flow in the Peer Process as it stands right now. This process is responsible for communicating with another Peer in haskell-bittorrent. This post is not about the details of the inner workings. Rather it serves to document the legend of the diagram.
    All the ellipses that are blue signifies other processes. The inbound process waits on the socket for incoming messages and sends them into the Peer process. The outbound process is a queue which dequeues messages as fast as possible to the peer in the other end. Peer Manager is the process which manages a set of peers -- we have one Peer process for each peer. Finally, Piece Manager is responsible for managing the state of the torrent: what subsets of pieces do we have and what is missing before we have the full file?
    A little bit of terminology: a piece is the smallest array of bytes we can SHA1 check for correctness. A block is a subset of a piece which is transferred by the wire protocol. In practice these are always 16K although the protocol provisions for alternative sizes. The reason for this split is that the 16K blocks means we can interleave other messages in between. Further, we can get a hunch on how fast we transfer to a peer by looking at the amount of bytes we can dequeue and smaller blocks means a better granularity.
    The diagram depicts the control flow of the client. Black arrows designate how control transfers from one function to the next (usually by tail-calling this is functional programming after all). The blue dotted arrows specifies synchronization points with external processes. If a blue dotted arrow enters a box it usually means that control can only transfer to this point if the corresponding event tied to the arrow fires. The Piece Mgr and "Grab up to ..." is the exception from the rule however: when arrows are back and forth, it is a rpc call.
    If a blue dotted arrow leaves towards an external process it means we will wait for a synchronization towards that process before continuing out of the box.
    The current state
    Modulo some work in the debugging and QA department, the client should be able to leech in a laboratory setting. We cheat by having no Peer Manager yet and we can essentially not really support many Peers properly yet. When the client portion stabilizes it should be fairly easy to add proper Peer managent. I hope to support leeching in the coming days.
    Wanna help?
    If you want to help out, there is plenty to do. I am still, deliberately, leaving many things open so it should be fairly easy to pick up the code from github and then do some of the simpler stuff. I've not run the excellent hlint by Neil Mitchell - so there should be an opportunity there. There is also an amortized queue implementation which could do with some documentation. Finally, the TODO list is littered with what needs to be done.
    On github: haskell-torrent.
    2

    View comments

  3. I have been writing code lately on a BitTorrent client in Haskell. Here is the grand old history of that effort. Many years ago, I attempted to start a client, Conjure, written in Haskell. Sadly, it really did not make it to the point where it was really usable, for me at least. Then a couple of years passed, I became involved with programming in Erlang. Erlang has a really nice Actor-based concurrency model. Since BitTorrent is highly concurrent and event-driven it was easy to start another implementation effort in Erlang. Thus etorrent was born. Sadly, I lost interest in etorrent - partially because I am partial to statically typed languages of which Erlang is not one. Thus, we come to this december, where I decided to figure out the state of concurrency in Haskell. To really give GHC a whirl, we need to implement something which is fairly complex. Some people implement ray tracers when toying with programming languages, some solve projecteuler problems. I ... write BitTorrent clients.
    BitTorrent is a good maturity check. First, you need decent Disk I/O, decent network I/O and a fairly complete HTTP client. Second, the protocols are easy, but still requires some nontrivial parsing to occur. Finally, the system is fairly concurrent, so it fits nicely into concurrency models if the language has one.
    Initial Thougths:
    Initally, there is the choice of a concurrency framework. In Haskell, there are several such. The built-in Control.Concurrent is fairly low level and fairly simple. At the CS department at Copenhagen University, DIKU, there is currently a course running in CSP and some students decided to use Control.Concurrent.CHP by Neil Brown. I think it will suit them well for the course, but it irritated me that it mapped so weakly to the etorrent model. So Control.Concurrent.CML became the basic library for concurrency. This concurrency model is not entirely unfamiliar to me: it is written by John Reppy for ML.
    The CML model is somewhat reminiscent of the Pi-calculus. In particular, we can transmit channels on channels. This gives a neat method for doing RPC-style operations. Send off a message together with a responder channel and then wait for a response on that channel. This means we are easily able to emulate most of what the Actor model has if needed.
    Haskell-torrent design:
    Haskell-torrent uses a basic design, where a fairly large number of processes communicate with each other to solve the problem of uploading and downloading files. We have a limitation of one single torrent right now, but that should be fixable in time. The process diagram at the header describes the basic layout. Each box is a single process, apart from the "Peer" box, which is multiple processes. The diamond-shapes describes external communication over the network.
    Console: A process responsible for User communication. The idea is to have a simple console-interface on which to log messages (it is heavily used for debugging). The Console can in time also be used for parsing user input and affecting the system. It only responds to a "quit" command at the moment. Logging is done with a LogChannel on which the Console process syncs and prints out information.
    Main: Initial program entry point. Spawns everything else and is the last thing that closes again.
    Status: Keeps track of various torrent-specific statuses. How much is left to download? How many seeders are there? How fast is the torrent currently? The intention is to keep this kind of information around in that process.
    Tracker: Communicates with the tracker. Is a specialized HTTP client. It talks to the Timer so it won't continually request the tracker for information. In due time, it should also be able to do UDP tracking. It is using bencoded documents as per the tracker spec. These bcoded documents are handled as in Conjure with the monadic parser generator Parsec.
    PeerMgr: Handles a set of peers. Gets lists of other peers via the Tracker process and spawns a pool of them. In due time, this process will handle choking/unchoking of peers so we don't clog the TCP/IP stack by trying to talk to everybody at once.
    Listener: Listens on a TCP port and spawns off Peers when they try to connect. Not implemented at all yet, but is fairly easy to get to work.
    OMBox: A very simple implementation on top of CML of what is present in Control.Concurrent.SampleVar. This will let the PeerMgr be able to place things for the Peer to read later and it avoids a deadlock between those two which otherwise would occur. OMBox is currently not cleaned up properly. We need some mechanism for poisoning the communication channel so it closes down again gracefully.
    Peer: The system contains zero or more peers. Each peer is really four processes. One controls the state of the peer and reacts on the events from the peer. Two are responsible for sending messages to the peer: one as a queue, and one blocks on the socket. Finally, a process receives messages on the socket and then sends them to the controller, blocking in the process.
    PieceMgr: Keeps track of the current piece-state. It knows what pieces we have, what we are missing and what to request next. Peers will be using it for asking what to download next.
    FS: Abstracts away the file system for the rest of the client. Is implemented as a communication process which peers ask for blocks to send.
    Current State:
    Currently, the client is pretty flaky. It is able to seed in a controlled lab-setting, but needs much work to be able to leech data from other clients. The TODO-list is fairly long and some of the things on it are pretty hard. But there are also many things which are fairly simple to fix. Feel free to grab anything and help out if it sounds interesting. I am deliberately keeping things simple and leaving doors open. It is my hope that even aspiring Haskell programmers might stand a chance at solving some of the things on the TODO list. I'll happily take any help I can get as in the long run, this is fairly hard to carry out as a single person. The code is on github in my haskell-torrent repository if this should indulge you.
    10

    View comments

  4. The Bittorrent Wire protocol is a protocol used for communication between peers in a bittorrent cloud. The protocol is explicitly defined to balance between conserving bandwidth and being easy to parse. Hence, it is a binary protocol, but it is fairly easy to parse.

    When I wrote etorrent as an experiment, I was faced with the problem of parsing this protocol. In Erlang the task is not hard however, since the language can pattern match on bit-patterns. I began by defining a set of constant values, each corresponding to a message type:

    %% Packet types
    -define(CHOKE, 0:8).
    -define(UNCHOKE, 1:8).
    -define(INTERESTED, 2:8).
    -define(NOT_INTERESTED, 3:8).
    -define(HAVE, 4:8).
    -define(BITFIELD, 5:8).
    -define(REQUEST, 6:8).
    -define(PIECE, 7:8).
    -define(CANCEL, 8:8).
    -define(PORT, 9:8).
    

    The designation 1:8 means that we need to represent the integer value of 1 in 8 bits of information (i.e., in a byte). With these values down, defining a decoder in Erlang is pretty straightforward:

    
    %%--------------------------------------------------------------------
    %% Function: recv_message(Message) -> keep_alive | choke | unchoke |
    %%   interested | not_interested | {have, integer()} | ...
    %% Description: Receive a message from a peer and decode it
    %%--------------------------------------------------------------------
    recv_message(Rate, Message) ->
        MSize = size(Message),
        Decoded =
        case Message of
            <<>> ->
                keep_alive;
            <<?CHOKE>> ->
                choke;
            <<?UNCHOKE>> ->
                unchoke;
            <<?INTERESTED>> ->
                interested;
            <<?NOT_INTERESTED>> ->
                not_interested;
            <<?HAVE, PieceNum:32/big>> ->
                {have, PieceNum};
            <<?BITFIELD, BitField/binary>> ->
                {bitfield, BitField};
            <<?REQUEST, Index:32/big, Begin:32/big, Len:32/big>> ->
                {request, Index, Begin, Len};
            <<?PIECE, Index:32/big, Begin:32/big, Data/binary>> ->
                {piece, Index, Begin, Data};
            <<?CANCEL, Index:32/big, Begin:32/big, Len:32/big>> ->
                {cancel, Index, Begin, Len};
            <<?PORT, Port:16/big>> ->
                {port, Port};
        end,
    {Decoded, etorrent_rate:update(Rate, MSize), MSize}.
    
    This is pretty straightforward idiomatic Erlang as one would write

    it. We convert the numbers into tuples or atoms. The idea is that the atom or the first element of the tuple identifies the type of the message that was sent. Note that in idiomatic Erlang code, the process will crash if the pattern match fails. In that case, we have to handle that problem in another janitorial process.

    I toyed with the idea of doing the same parsing in Haskell. Initially, my thought was that it would be rather hard to beat the Erlang code in size. Looking at the above code, it is almost the specification of the protocol and we can't get rid of the specification.

    In Haskell, I started with the default construction of an algebraic datatype. It simply reflects the tuple/atom construction of Erlang:

    type BitField = B.ByteString
    type PieceNum = Integer
    type PieceOffset = Integer
    type PieceLength = Integer
    data Message = KeepAlive
            | Choke
            | Unchoke
            | Interested
            | NotInterested
            | Have PieceNum
            | BitField BitField
            | Request PieceNum PieceOffset PieceLength
            | Piece PieceNum PieceOffset B.ByteString
            | Cancel PieceNum PieceOffset PieceLength
            | Port Integer
    deriving (Eq, Show)
    

    If it seems more verbose than the initial Erlang counterpart, then it fools you. In the above, we also encode the types of the messages - but this is the only place where we would have to designate the types in the whole program. Type reconstruction by inference will figure out the details for us.

    I expect my data to be presented to me via lazy bytestrings (see Data.ByteString.Lazy). It also turns out that George Giorgidze built the HCodecs package. This package provides decoders for MIDI, Wave and SoundFont2 files. But the real gem of the package is the development of binary parsers for ByteStrings. This turns out to be exactly what we would like to process.

    The parser is a monad. So we can use do-notation to decode the messages. Here is an example of decoding the PIECE-message:

    7 -> do pn <- fromIntegral <$> getWord32be
       os <- fromIntegral <$> getWord32be
       c <- getRemainingLazyByteString         return $ Piece pn p c 

    This looks nice, but the <$> from Control.Applicative hints somewhat at where this will go. Note that we are not using the full power of the monad. In particular, we don't use earlier results from the monad to decide what to do next. Since there is no dependency chain, it means we can just utilize that every monad is also an applicative functor. In particular we can define

    7 -> Piece    <$> gw32 <*> gw32 <*> gw32 <*> getRemainingLazyByteString
    ...
    where gw32 = fromIntegral <$> getWord32be
    

    using the applicative functor to solve the game. With this down, the rest of the parser is easy:

    decodeMsg :: Parser Message
    decodeMsg =
    do m <- getWord8        case m of          0 -> return Choke
        1 -> return Unchoke
        2 -> return Interested
        3 -> return NotInterested
        4 -> Have     <$> gw32
        5 -> BitField <$> getRemainingLazyByteString
        6 -> Request  <$> gw32 <*> gw32 <*> gw32
        7 -> Piece    <$> gw32 <*> gw32 <*> getRemainingLazyByteString
        8 -> Cancel   <$> gw32 <*> gw32 <*> gw32
        9 -> Port     <$> (fromIntegral <$> getWord16be)
        _ -> fail "Incorrect message parse"
    where gw32 = fromIntegral <$> getWord32be
    

    And there you have it: A Haskell version which is more succinct than the Erlang version while being type safe in the process. The reason it works is due to the clever type class hierachy in Haskell: The parser is a monad and all monads are applicative functors. This lets us combine parsers cleverly into an effective decoder for wire protocol messages.

    I do hope it is fairly efficient. Protocol decoding is probably going to be a hot place in the program.

    Data.ByteString.Lazy

    HCodecs

    Erlang code

    Haskell code

    2

    View comments

  5. So I spent the time necessary to look into this Behavior-driven-development idea. Summing it up in one sentence: "BDD is computable specifications". That is, you write a specification which is a program. This program takes as input a component and produces as output one of success or failure. A driver program then tries all the behavior-programs in succession and produces a report to the user or programmer.
    This is a natural extension of test-driven-development, where one writes a computable test for each component in the software system before the component itself is written. The insight is that these tests are specification behaviors in the system so one can name them as such rather than as "tests".
    The idea of building a model of the specification which can be computed is not new. There are strong similarities with Design by Contract (Meyer, 1986) and I am sure you can find older examples in the Lisp community with strong similarities. The BDD term is not defined rigorously, in tradition with most software development terms, sadly. Thus it means different things to different people. Some will say that DbC is BDD and some will say that it is not.
    Main insight: If you have a specification written in English prose, you have no way to automatically check if the program lives up to the specification. You can only test it manually, which is a tedious, cumbersome process prone to failures. If the process is computable, we can get the machines to work for us and work out the details.
    Getting the machine to understand the specification means it needs to be precisely specified, like a computer program. Our hope is that while the program specified may be complex, the specification itself is not. Thus we can specify the behavior of the program and have some assurance that the specification is correct. If the specifications are as hard to write as the main program it turns out to be turtles all the way down.
    There are several ways we could attempt to write down a formal specification. We could, for instance, use Hoare logic - meaning we specify the behavior of the program in an assertion language different from the implementation language. One then tests the specification by verifying that the rules of HL is fulfilled.
    Another path is to make the type system strong enough to capture formal logic and specify the program in the type system. To achieve this, one must have a type system much more expressive and powerful than what most mainstream programming languages provide. Testing the specification then amounts to a type check. Aside: Expressive type systems like the one found in Haskell or ML provides an approximation to this complete correctness. When we say that "type systems eliminate bugs in the program", we mean that the approximation caught a bug which would otherwise have slept into the program. Note that (static) type systems are a knob you can turn. The more expressive the system, the better an approximation to the specification. Turned to 11, the type system is the specification, but most programmers abhor the idea of programming at this level of precision. End of aside.
    The BDD path is to define the specification in the host-language as a program itself. Usually this can be done by a compiler from a domain-specific language (describing behaviors) into the language in which the tested component is written. The advantage of this approach is that you don't need a separate tool chain but can reuse the existing one for running your tests. Testing the specification is done by running the spec-program against the implemented components and observing their behavior.
    Thus, BDD is a runtime-oriented path to verifying the program specification. We check the program correctness by executing the test. I would imagine that we also design the behaviors such that they cover all paths of the component. This is fairly easily achieved by a coverage checker.
    Personally, I see the BDD-idea as a poor-mans formal specification -- it is what you end up with if you just use the available tools rather than define a spec-language. You end up with exactly the same problems as most formal specifications any way: Explaining and understanding what they mean is hard. The problem can somewhat be alleviated by clever tricks where the actual code is hidden in a textual presentation which normal people would have a chance at understanding. This part of BDD circles around the problem of defining a good User Interface for the specification in question. This is the part of BDD I find the most interesting.
    Also, the fine merits of BDD might be the first casualty to any deadline. Deadlines kill. Anything which can be placed into the scheme of "short-term good, long-term bad" and incurs technical debt is turned on as soon as the deadline approaches. Everything falls to a fast cost-benefit analysis. The problem here is that there are no measurable way to evaluate different software development methodologies so an equally good and faster analysis is thrown dice.
    0

    Add a comment

  6. Let there be no doubt, that I like to use PostgreSQL as my main database of choice. I like my databases to use SQL, a declarative language, for my operations and queries. I guess the reason for this is my strong knowledge and experience with functional languages in general. The path from functional programming to SQL is rather short.
    This is a list, in no particular order, why I love PostgreSQL. Note that many of these things are far better explained in the excellent documentation of PgSQL, so you can look up things there.
    • First, PgSQL takes types seriously. Data is typed from the outset and you need to convert data to match the given type. This means we rule out dates like '2009-02-31' by default. MySQL fails this blatantly for instance. Also, it allows us to import data into a typed language fairly easily as soon as we have a type mapping defined.
    • PgSQL has TOAST-storage. This includes several techniques like compression and out-of-line storage. What this means is that you can process data faster since disk I/O is what tend to kill database performance as soon as your data get beyond 'fits-in-memory'.
    • PgSQL can do partial indexes. I keep partial indexes on unprocessed data. Hence, even if the table contains billions of rows, I can quickly grab the unprocessed records and process them. I also keep some partial indexes around for performance.
    • PgSQL does bitmap index scans. This can combine several indexes into a bitmap telling the system what rows or pages to retrieve. Again, it lowers the amount of disk I/O needed to access data. Wonderful for complex queries.
    • PgSQL uses Multi-version concurrency control (MVCC). This means you can do backups of your database without having to worry too much about what happens while you take the backup. It also means you can be sure that your ACID compliance is done correctly without affecting the performance of individual queries.
    • You can disable autocommit by default in the psql client. No more accidentially destroying data.
    • Data Definitions (the DDL) are transaction safe! You can undo an ALTER TABLE statement if it goes wrong. It makes data manipulation into a blissful operation: Alter the table, CREATE VIEW to get the old view of the data back. COMMIT the transaction. You can do this on a live system, which is wonderful.
    • Need to add millions of rows? Use the COPY command (or its psql client \copy cousin). It is a lot faster than INSERT INTO for large amounts of data. My work desktop machine with an old SATA disk imports millions of (rather large) rows in a matter of minutes. This including raising a couple of indexes on the data.
    • Need to export data? CREATE TEMPORARY TABLE export ( ... ), fill it with a SELECT statement, fire up COPY. Done. I've built many reports directly in the guts of the psql prompt with a couple of VIEWs and this technique.
    • PgSQL is fast, especially for complex queries. This is especially true for recent PostgreSQL releases, where performance was the optimization target. And it paid off. Combine it with a query cache or a dedicated key-value store and you have a very stable, robust and fast platform which can scale to thousands of simultaneous users.
    • Built-in full-text-search. I've used this to answer search queries from users on the data with ease.
    And I get to combine PgSQL with Ocaml, awk and R. Don't underestimate the data manipulation power of awk, it is extremely fast.
    0

    Add a comment

  7. The single most important role in a software project, be it Open or closed source is the role of the gatekeeper. This person, or group of persons oversees the general progression and development of code in the project. Their responsibility is to read through the code of all commits and review them for correctness before they are let into the source code baseline upon which further development happens. There are many advantages in having a gatekeeper role in a software project. First and foremost, all code gets some review. Second, the gatekeeper can usually direct people to reuse existing code rather than write their own (At one time, the pine mail reader had several places in which file I/O operations happened for instance). Third, the gatekeeper can make sure that coding guidelines and structure is adhered to. If you are serious about the development, the gatekeeper must have the power of vetoing everything. It doesn't matter if it is the final release theres at stake. If the GK says no, it ain't gonna happen. This is important to ensure the quality in the project. Also, the GK should not have roles beyond the role of reviewing software. The best thing you can do for a software project is to get a model where people are willing to share their code patches with a mailing list so other people can chime in on the code. While the GK has the role of getting people to talk to each other and steer the project in general, getting other developers to work together is very important. I've seen to many projects with cage-isolation in the sense that each developer has their area of responsibility. It works fine until someone quits. Also, the GK hat could walk around among the most experienced developers. As a software vendor, you win if you can continue to improve the quality of your product and the skill of your creative software people. Constructive critique is a tremendous way to get this done. I don't give much for the idea of "Agile" development in general, but wether you adopt it or not, the GK role is crucial anyway. In some kind of perverted way, leading Open Source projects can't be wrong -- and you will find that almost all of them employ a de-facto gatekeeper somewhere.
    0

    Add a comment

  8. The ExSML project is a fork of Moscow ML which aims to replace the current MosML runtime with one based on LLVM. In Moscow ML the following things happen when you compile a file of SML source code: First the file is lexed and parsed into an abstract syntax tree. Then elaboration happens. In the elaboration phase we resolve, through type inference, the types of all expressions. We also resolve overloading such that all types are explicit from then on. Second the program is compiler by the frontend compiler into an Intermediate Representation (IR) called Lambda. This is an extended lambda calculus which is far simpler then core ML. At this step we also compile pattern matches into simple decision DAGs via a pattern match compiler by Peter Sestoft. The backend compiler then transforms lambda into bytecode which the mosml runtime reads and executes. We want to cut at the IR and produce LLVM code instead. However, there is a slight problem: LLVM is essentially a typed assembler whereas Lambda is untyped lambda calculus. To get around this, we have decided to use a typed variant of Lambda. This means we need a type system and we want the type system to be safe. Enter Twelf. The plan is to formalize the safety properties of the type system in Twelf. We are a far way from the goal at the moment as our Twelf-representation lacks many important things: reference operations, exceptions, reference state, recursive types, sum types, product types and polymorphism. But we hope to be able to add these to the representation with a little work. Knowing our language is type safe will give us some obvious advantages later on when we begin doing code transformations.
    1

    View comments

  9. Lately, we saw the emerging trend that newspapers are beginning to die. Good riddance. I've never bought a newspaper, I will never buy a newspaper, and I am pretty sure people younger than my age won't as well. The market is gone, so how are newspapers going to survive? Currently, they have placed their hope in advertising. As Steve Ballmer has said, it is all about advertising, advertising, advertising! This to the point where newspaper sites are infested to the brink with ugly colored flash behemoths in odd colors with nasty animations. Don't even get me started on what I want to do to the marketdroid who imagined that sound is a good idea, but let me divulge it includes a meeting with Mr. Garotte at the very least. I don't buy that advertising will be helping them anytime soon. First, I love to block Ads. I utterly despise ads, so I block them with AdBlock+ in Firefox. I don't want some idiotic marketdroid to invade and misuse my precious desktop space. Goodbye. This also has the nice side-effect that it undermines the Ad-industry and every industry basing their income system on Ads. Luckily, the majority of people can't block it, because they are too uneducated to install ad blocking software. Consequence: I can have my free stuff without paying for it or seeing Ads. But the majority will learn eventually, and that will be the slow and effective demise of Newspapers as we know them. We already face the first part of the "crisis" in Denmark as a major danish newspaper will have us pay for their service. This is a really good thing! I would probably like to pay others who use their time producing interesting material I can read about topics in my sphere of interest. But I also do not lament the death if everything I can get is the "Extended nobility package" where I get Wine and Restaurant reviews, news distilled to 99% irrelevance off of Reuters, and "Feminist" crap. Either newspapers will get a new networked revival or they will die. To survive, newspapers must produce something with a greater pastime value than perusing the /b/ board off 4chan (Might I add that the /b/ board is the sewage spill of the internet) -- or else irrelevancy will hit them where it hurts the most.
    0

    Add a comment

  10. People are talking about the RailsConf keynote "What killed Smalltalk?" by Robert Martin. You can watch the keynote talk at blip.tv. Of course, the Smalltalk guys weren't slow to answer the argument, which revolves around the fact that code easily becomes cluttered in Smalltalk programs. The arch-angel that will save this problem with Ruby is, of course, unveiled to be "Test Driven Development" much to the rejoice of the Audience.

    Martins presentation is flawless in the sense that he does a really good job at capturing the audience. Another interesting point is that he has no slides, just a stack of paper cards on which the key points of the talk is written. In Academia, this would not have worked for most talks I note. Sooner or later you'll at least need a blackboard to draw up and explain a formulae or similar.

    A side argument however is that the dynamically typed languages won over the statically typed languages. It is driven by the observation that Java now has reflection, which must mean it is dynamically typed, right? Unfortunately, this viewpoint is misguided in the sense it lacks delicate detail about the situation at hand. This blog postseeks to remedy the omission.

    Static vs. Dynamic typing is uninteresting

    Taking the argument as being static versus dynamic typing alone is a moot point. Here is why: suppose we want to embed a Java program in Ruby. This can easily be done, if we choose to ignore the types and strip them out. In fact, any statically typed language can be embedded in the dynamic world by stripping the types away. After all, assembly language is untyped, so compilers will at some point have to eschew the typing information for good in their transformation.

    But any dynamically typed language can also trivially be embedded in a language with static type discipline. We invent a new type, lets call it T and then proceed to give everything type T. This is what happens inside the Ruby interpreter written in C. Of course we must then have a way to discriminate the different values of T but this is exactly what low-level operations will do in a dynamically typed language anyway.

    Thus we have a way to convert back and forth between the static and dynamic world, so why do we not regard them as equal? We know they are not equal, but in what kind of sense are they not equal? If we convert from Java to Ruby and then back to the static world, everything will have the magic type T. We have lost the Java type information in the process. But this merely states that the conversion is not isomorphic.

    If one thinks about it, there is little to be gained by studying the trivial world of dynamic typing: everything would look the same once it has been converted into the dynamic world. If we look at the static world, we obtain a different picture.

    Meta-theoretic properties, baby!

    Since Ruby would be trivially typable in the statically typed world, the whole debate about static typing versus dynamic typing falls apart. It is uninteresting because the level at which we communicate is too low. What really matters are the properties of the type systems: what kind of guarantee do we have from the type system weighing in favor of complex type systems -- and the hassle of maintaining well-typed programs weighing in favor of trivial type systems.

    In Ruby, or Smalltalk, the contents of a value (an object) can be changed without having to retype big parts of the program. This is not always true in Java. It holds when there is a certain subtyping relationship between the objects in question, but it can't be done in general, since there is no "duck typing" available. Reflection in Java adds the magic type T so to speak so we can embed any type in a T and we can project any type out of a T. The latter might incur a runtime-error when done however; this is no different from Ruby where everything might potentially incur runtime errors.

    The static type system on the other hand tends to guarantee varying levels of type safety. If the program is well typed (read: passes the type checker), then there will be no type errors at runtime. Since there will be no errors at runtime, we can just strip away type discrimination at runtime as well -- yielding great speedup compared to dynamically typed languages.

    The difference is that Java, even with the addition of reflection, still have better meta-theoretic properties than Ruby, though we have added a way to poison the type system safety. This is not at all new to most static type systems out there, which allows coercing types unsafely if you really know what you are doing.

    Java is also a fairly simple language from a type-complexity perspective. ML is much further up the scale. And Haskell is even further up, with many more complex concepts in the type system. Most theorem provers rely on even more complex type systems to get their work done. In fact they have type system; so powerful they can, in principle, encode correctness specifications of the program. You do not really need unit tests at that level. But it becomes increasingly harder to write correct programs, so most proponents of static typing settles on something further down the scale.

    Also, most advanced languages will infer (deduce) types automatically. For ML it is customary to just specify types at the API-level and omit typing details elsewhere. This brings some of the ease from dynamic typing back into static typing.

    Wrapping up

    One can't possibly say that "dynamic typing won" when the world is much more complex. In reality one must ask: "What static type system won?"; which is not trivially answerable and subject to individual level of expertise, viewpoint, ideology, religion and what not.

    What is clear though should be that the type systems of Java with reflection and Ruby are far from equal. Neither are Java or Haskell for instance. Or Haskell and Coq. Or...

    Based on that, I think the view Mr. Martin puts forth is a bit too quick and simplified. I also hope we can move beyond the simple dynamic/static debate which is hardly interesting.

    And finally the snide remark: What is interesting is the viewpoint that type systems are chastity belts or impose Bondage & Discipline upon programmers when I have just shown, by trivial conversion, how to embed dynamic features into any statically typed language. You just got the skeleton key to the chastity belt, should you need it!

    3

    View comments

Loading