Recent benchmarks of FreeBSD 8 have provoked a discussion both on benchmark results themselves, on benchmark methodology in general and on upcoming improvements. One of the things mentioned is the small utility used internally in FreeBSD that was recently "blessed" into being a generally available part of the operating system - the ministat.
Ministat is a small statistical analisys utility that is mostly used to calculate statistical differences between data sets. It is extremely easy to use; in its most basic mode, which is what is commonly used, it accepts two or more input files, each containing numbers delimited by newlines (i.e. one number per line). The numbers (which in this case are "samples" for the analisys) can be in any C-parseable format.
The canonical example data used in the man page are the "iguana" and "chameleon" files. While I have no idea what these filenames are supposed to suggest about their contents, the data itself is usable. The "chameleon" file contains the following lines:
50
200
150
400
750
400
150
And the "iguana" file contains:
150
400
720
500
930
Note that the files contain a different number of samples. It is important to note that the samples used here must be "raw" data, obtained from the measurments, and not processed or smoothed in any way. Running "ministat iguana chameleon" gives an output like the following:
x iguana
+ chameleon
+----------------------------------------------------------------------+
|x * x * + + x +|
| |__________M__|____A______________M__A_______________________| |
+----------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 7 50 750 200 300 238.04761
+ 5 150 930 500 540 299.08193
No difference proven at 95.0% confidence
This example case is especially contrieved to show how easy is to have samples that completely look different when casually viewed (look again at the input data!) but because of their properties, mostly standard deviation, lead to no statistically usable conclusion.
As the above ministat output says, there is 95% confidence that the sample data sets prove nothing about the difference between them.
The correlation and application to benchmarks is, I hope, obvious. One other thing potential users need to keep in mind is to explain what the results represent - what are the units ("bytes", "iterations", "clock cycles", etc.) and the scale ("kilobytes", "megabytes", etc.) of the data sets. It's important to communicate this information, along with "which is better" (e.g. "bigger is better") to the readers. It would also be nice if detailed instructions on how to reproduce the tests are provided so the readers can attempt to reproduce them.
I probably need to note that there are several similar OS benchmark results circulating around the net that I did some years ago, that would probably fail a statistical analysis. I hope I learned something over these years :) The FreeBSD wiki has some more information on benchmark procedures.
Note that it is sill possible to produce bogus results even with such analisys (which says nothing about the specific environment and setup of the benchmarks) but that using a tool such as ministat must be the first step toward producing usable results.
The ministat tool is simple enough to be trivially portable to any platform.
#1 Re: Using Ministat
#2 Re: Using Ministat
Oops. I apologize for the wrapping.
#3 Re: Using Ministat
Cool, I found a Linux port here: http://github.com/codemac/ministat