Helping with HaskellTorrent
I have a ideology in which an Open Source project must be easy to hack for other people than the original author. Thus, I am trying to make this possible with HaskellTorrent. In particular, I keep several things in unsolved so others might join the fray, should they want to. This is a list, in no particular order, in which I put forth some things there are left to be done.
Do note that I tend to keep bugs off this list. Many projects use the bug tracker as a way to track what needs doing. I have a TODO.md file in the top-level dir which contains things. Some of these things are taken from this list.
HAVE message pruning
Torrent files work by splitting a file into pieces and then breaking pieces into blocks. The blocks are then exchanged between peers until every block in a piece is downloaded. At that point, the SHA1 checksum of the piece can be used to verify the piece got downloaded correctly. If this is the case, a HAVE message is broadcasted to all peers we are connected to. This notifies other peers of the newly available piece so they can begin requesting it.
However, there is an optimization oppurtunity. If the peer already has told us (usually by a HAVE message) that it got the piece already, there is no reason to tell it about the availability. Thus, we can prune the sending of the HAVE for those clients. There is already an IntSet with the necessary information inside the PeerP process with the information, so the change is fairly simple.
Optimize Peer/PieceMgr communication
Profiling shows we spend a considerable amount of time in the communication between the Peer and the Piece Manager. The Peer will, when downloading, try to keep a pipeline of block REQUEST messages going towards the other end. It mitigates the delay on the internet doing so. Hence, it periodically asks the PieceManager for new blocks to request. It does so by sending a number of blocks it wants together with an IntSet of what pieces the peer at the other end has.
The Piece Manager is responsible for giving exclusive access to blocks to peers. There is no reason to download the same blocks at multiple peers, so it keeps track of what blocks where requested. Also, it must only serve blocks that the given peer can download. Another goal is that we would like to complete a piece as early as possible so we will try to complete from pieces that are in progress first.
Currently, the system works by looking at the pieces in progress and then try to serve from these. If all blocks have been taken on pieces in progress, we find one that is pending (randomly) and available at the peer. This one is then promoted to being in progress and the algorithm runs again.
There are a couple of nice optimizations possible. First, if the peer is a seeder, it effectively has every piece. Thus, there are no reason to keep the IntSet around for those, nor is there any reason to carry out expensive IntSet intersections. Second, if the peer is not a seeder, it would be better to keep a bit array around rather than an IntSet. It does count for a good amount of live memory in the Peer processes, and we expect there to be quite many of those. Third, we could benefit from keeping a cache of the last piece the peer requested blocks from. Trying this cache element blindly in the Piece Manager is advantageous to us: we spend less time in the Piece Manager code. It is also nice for the peer: if a piece is cached at the other end, we might help by requesting as many blocks as possible from that piece. Disk IO is a point of contention in modern bittorrent client implementations.
Always pick the rarest piece first
We currently pick pending pieces at random. A better scheme is to know about their availability and then pick them rarest first. The rarest piece is the easiest for us to spread, so it maximizes our ability to give back, which in turn maximizes our ability to download fast. The right Haskell cabal library for this is Data.PSQueue in the package PSQueue. If there are more pieces eligible at the same rarity, we will pick one at random. This gives an excellent use of View Types in Haskell, should one be interested.
To do this, there is a milestone you need to reach beforehand. When we receive knowledge of piece availability, either through a BITFIELD or a HAVE message, we should propagate that information to the Piece Manager. In the beginning, we can just do nothing with it, and throw it away. It will pave the way for a PSQueue implementation however.
Increase code quality
Run hlint on the code. There is a target in the Makefile, which has recently been fixed. Find a GHC warning which is not yet enabled in the .cabal file and enable it, fixing the bugs that turn up. I guess some of these are more evil to fix than others, so start with the easier ones.
Another area are tests. Running HaskellTorrent —tests will run the embedded test suite. This one fails on 64bit architectures at the moment (I think, this is the narrowing-down I’ve done). If you spot any area which you can figure out how to test, I would really like to discuss it with you. The code can do with a lot more testing than it use right now.
Discuss use of attoparsec, attoparsec-iteratee and the event library
I am seriously contemplating impaling all network performance bottlenecks once and for all by using this triple. If you are interested in hacking these, I’ll be happy to talk to you about it. It would push the bottleneck to the Disk layer once and for all, I think. The cool thing is that there are code to gain inspiration from.
Another nail to the coffin is to support the FAST-extension in the new parser right away. It would pave the way for the rest of the client to understand this extension, so we would be able to get better and faster communication. It also plugs a bug in the original bittorrent specification.
Use mmap() for Disk IO
The right way to do disk IO is by use of mmap() on the files. Reading and writing files are not going to be hard, but we also need to get the hopenssl library to talk to the mmap()’ed backend so we get fast checksum calculations.