The interesting situation is not the standard situation. What you want to know is not that server X is 30% faster than server Y - but rather what happens when the server gets a spike of incoming traffic. That is, the Poisson teams up against you and you hit serious load. In that case, you want the server able to sustain the load instead of the server which will degrade in performance.
It is far better to dismiss some operations as errors quickly but do serve the requests you accepted, and serve them fast. The raw speed matters less as 30 or even 70% is just a constant factor deciding when you need to scale - either a faster machine or a new one next the the computer you already got. Of course, with 150% it begins to matter a lot which is faster, but for most servers, the speed is so good it doesn't matter. Stability and the overload situation is more important to optimize for.
Errors and latency:
In a benchmark, there is a bar which describes what is an error. In some benchmarks, the error will be a request which never completed. In other tests, it will be that a given request took too long. There is little reason to accept a request of 10 seconds as the poor user will almost always have reloaded or clicked again. Moderns users are impatient to the point of being painfully so.
This leads to the idea we should regard requests which are too slow as errors as well. If a request doesn't complete under 5 seconds, it is an error as well. But for a lot less, users may become bored on your site and not use it. Even a 100ms bump can mean a lot.
This leads to the virtue of the benchmarker numero uno: Record latencies for the measurements
You should, ideally, be recording each of your 250000 requests and their individual latency. But this is usually too much for the benchmark tool to handle. Rather, you want the tool to do random sampling and record, say, 2500 samples and store them in a file. The cool thing about this is that you can hand the data out when you present your information. It is like cake to a statistician, who will be able to use the raw data.
Beware the heretic: Average
Unfortunately, many benchmark tools report statistics rather than raw data. They will give you the minimum, maximum, average, median, stddev and so on. The very first thing you should do when you get raw data is to plot them. You want to see what the data looks like. Just a simple histogram plot can often tell you a lot about the data at hand:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1309 5.6200 6.7750 6.8090 7.9960 13.3500
For a web server the Min value would tell us something about the lower limit. If the server is 30ms away travelling at the speed of light, that is the minimum value we can ever hope for. The Maximum observation is interesting as well. If it is too high, the user will not hoover and is gone. The mean is the average value of the observations. We'll get back to the dreaded mean. The 1st quartile, Median and 3rd quartile is achieved by lining up all the observations in a sorted manner to get the distribution function:
And then you pick the guy 25% in from the left of the X-axis as the 1st quartile. The guy at 50% as the median and finally the guy at 75% as the 3rd quartile. If there is no middle value (in the case of an even number of observations) you pick the mean between the two middle observations. For the first example plotted here, the median and mean is basically the same value. But what about example 2?
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1309 5.8300 7.1380 9.4180 8.9150 26.1900
This looks bad. The Median and Mean doesn't line up and this should normally be a warning sign you should plot the histogram. In fact, if you look at the above histogram plot you will find that the observations around 25 skew the mean too much to the right. In fact 9.4 is not a very likely value. It is more likely to get either a 7 (in around 85% of the time) or a 25 (around 15% of the time). The median tells us that the most likely value is around 7. The distribution plot should tell us something as well:
A value of 17 is highly unlikely. Yet, if I had constructed more observations in the 25 area, then the mean would have been 17. That is why the mean is a bad indicator for a data set. And it is why we want the raw data. If you report something, make it be the median. Or make whiskers on your values so we can see the development as you increase your load on the server:
Notice how Mark Nottingham adds small vertical bars to each measurement so he can report the range in which most of the repsonses fall? That is a neat idea!
Even better is to use boxplots on each observation so you can see how it relates to the mean value:
I think there is a gem to be found but currently hidden from our scrutiny. I hypothesize some web servers are far better than others when it comes to harnessing the latency and keeping it stable, whereas other servers are more likely to skew the results a lot. It tells you a lot about the robustness of the server and it is another key factor on which to measure. I don't care that much about requests per second as I care about this number. It is like claiming your network is better because it has more bandwidth, blindly ignoring the question of latency (a rant for another time).
Face it, if you measured two web servers correctly, you could do the statistics to figure out if one were significantly (in the statistical sense) better than the other. In Ostinellis work, http://www.ostinelli.net/a-comparison-between-misultin-mochiweb-cowboy-nodejs-and-tornadoweb/ for instance, we see that mochiweb is only 66% faster than Node.js, which isn't going to make a dent in practice. The 25% difference between Misultin and Cowboy reported is even less so. But the response time for Node.js is 6.6 times as large compared to Mochiweb. (Around 600ms compared to around 90ms at 10000 in load). That is difference you should be worried about. And that is the reason for my plea:
Give us latency samples when you benchmark web servers!
(Edit: Smaller changes to sentences in the beginning to get the narrative to flow better)