Lies, Damned Lies and Benchmarks

Lies, Damned Lies and Benchmarks
Steven J Vaughan-Nichols
  June 18, 2013

Benchmarks can reveal the truth – well, some of the truth – about technologies. A well-constructed benchmark can provide a way to compare performance, reliability and other metrics that can make a difference in comparing product quality and effectiveness. But you have to look closely at what a benchmark is measuring and how it was run – and, even then, take its results with a large grain of salt. Here’s how to approach benchmarks and their touted results. Let’s start with the fundamentals. We’d all like to have some magic number that reveals “The Truth” about how good any given device or program is. There are no such numbers. Anyone who tells you that there is such a single performance benchmark is either lying or is trying to sell you something. What a good benchmark can do is give you is accurate data about some performance aspects of a given technology for a specific audience. Then, taken in context, the benchmark can help you decide if that technology is what you need. Before you can decide if a benchmark is “good,” you need to know the facts about them. There are two general types of benchmarks: component and system level.

Breaking Down Benchmarks

In the first category, the evaluators look at a sub-set of a greater system. For example, SunSpider 1.0 focuses on measuring real-world Web browser JavaScript performance. It doesn’t try to measure other related browser performance issues such as the DOM (Document Object Model) or other browser APIs. In the second benchmark category, such as the Business Application Performance Corporation’s (BAPCo) SYSmark 2012 Lite 1.5, the goal is to measure overall performance of common business programs on 32-bit Windows 7 and Windows 8 PCs. Beyond the component/system division is another dividing line between benchmarks types: synthetic and application. In the former, the benchmark is written to measure a specific performance characteristic. Common examples of this kind of test are Whetstone, Dhrystone, and super-computing’s Linpack. The core problem with synthetic benchmarks is that they focus on low-level performance issues. That may be fine, if you’re trying to optimize code for speed on a particular platform, but it often doesn’t reflect real-world performance improvements. Application-level benchmarks, such as SYSmark, FutureMark’s PCMark 8 (which measures application performance on Windows systems), and PeaceKeeper (which looks at over-all Web browser speed), attempt to evaluate what end-users can expect from a given platform or program. Application-level benchmarks, too, have built-in problems. As Michael J. Miller, CIO of Ziff Brothers Investments, a private investment firm and former editor-in-chief of PC Magazine, pointed out in 2011: Real-world benchmarks are always backwards looking. One of the legitimate complaints about SYSmark 2012 is that it doesn’t include many applications that take advantage of GPU computing (or “heterogeneous computing,” as AMD has taken to calling it). In particular, the versions of Internet Explorer and Firefox it uses aren’t the ones that use the GPU for rendering. But I’m not surprised, because it simply takes so long to create a good, repeatable, easy-to-distribute benchmark that the applications are always getting old by the time the test is ready. That’s always been the case. Benchmarks are also always subject to criticism from those who claim that they’re not measuring products fairly (especially their own). In 2011, for example, AMD quit BAPco on the grounds that SYSMark was biased in favor of Intel chips and the benchmark didn’t reflect the growing importance of the the graphical processing unit (GPU) for accelerating jobs such as video and audio encoding/decoding and Web browsing. On the other hand, AMD was emphasizing its GPU performance in its new chips, so clearly it had its own axe to grind.

Who’s right? Who’s wrong?

You can argue about what should and shouldn’t be measured in any specific benchmark until the cows come home but Carl Nelson, owner of HardCoreWare, a PC hardware review and benchmarking site, sums matters up nicely when he writes, “You shouldn’t be using one benchmark to determine the performance of a system anyway.” He’s correct. You should use multiple benchmarks before deciding on what’s Gospel truth about any given technology. Benchmarks are limited in other ways. People expect them to produce some perfect, objective truth. They don’t. All the best benchmarks really can do is give you an approximation of how well software or hardware does in a specific circumstance. Even with the best benchmarks, bias can appear. “Every benchmark needs to be taken with a grain of salt,” says Miller. “The battery life estimates that just about every major notebook vendor uses are perhaps the best example of how even a relatively fair benchmark can be misused. We all know that nearly every notebook gets worse battery life than what the company claims, usually based on a test called MobileMark, which is also created by BAPCo. In part, that’s because to get the best score, the vendors turn off or turn down a number of features (wireless, screen brightness, etc.) to get the best MobileMark score. But it’s also because, to create a fair test, MobileMark focuses on things that are easily repeatable.” You see this kind of concern raised with any benchmark. For example, back in 1999, I ran the first benchmarks showing that Linux, with Samba and Apache, was a faster file and Web server than Windows NT. I often heard from people, both Linux and Windows system administrators, who complained that I hadn’t optimized the servers. That was absolutely true. But had I done so, I would have been measuring, in part, how good we were at optimizing server performance, rather than measuring each platform by default in delivering files and Web pages. So, then, as today, you should not merely run a benchmark, or read the results. You need to review how the benchmark is meant to be run and how it actually was run. For example, many Windows PC benchmarks are run with the supplied anti-virus (A/V) software turned off. Does this give a more “objective” view of how Windows works on a specific PC? Perhaps, but it also doesn’t reflect the real-world experience of anyone who wants his PC to work well for more than a few hours. Others would argue that to test a PC, all extraneous factors—screen savers, video-acceleration, anti-virus software, disk caching—should be adjusted to common settings. There’s merit to both approaches. But to really use the benchmark’s results for your own decision-making, you need to know exactly what assumptions were used on the test-bench. In the case cited above, my A/V-equipped PC is very likely to run slower than the same system without anti-virus. It’s also far too common for vendors to publish results based on equipment that you won’t find in any real server room or data-center. For example, the TPC-H benchmark, which is often used to measure DBMS performance, can be greatly influenced by the hardware it’s used on, far more than by the DBMS it’s testing. Curt Monash, a well-regarded technology industry analyst, observed that “Most TPC benchmarks are run on absurdly unrealistic hardware configurations.” In addition, anytime you see comparisons using benchmarks make sure that they’re really comparing apples to apples and oranges to oranges. This is actually much harder than it sounds. With a PC, for example, every component—the processor, the I/O bus, the amount and type of memory, etc.—affects its overall benchmark performance. If any one piece isn’t matched properly with the other, its performance will be slower than their simple sum might lead you to believe. Indeed, some systems are almost impossible to benchmark “fairly.” Peter Wayner, contributing editor of the InfoWorld Test Center, recently found when he was benchmarking “identical” Amazon EC2 Micro cloud instances with the Da Capo Java benchmarking suite, the “results were all over the map.” While “Medium machines [3.75GB of RAM and a promise of one virtual core with two EC2 Compute Units] were much more consistent … even these numbers weren’t that close to a Swiss watch.” Why were the results so varied? Because on the cloud, you trade control for flexibility. As Wayner says, “On a bad day, you could end up sharing a CPU with 10 crazy guys trying to get rich by minting Bit-coins by melting the processor; on a good day… you could end up sharing a machine with 10 grandmothers who only update the Web pages with a new copy of the church bulletin each Sunday.” I can’t blame you at this point if you think all benchmarks are worthless. They’re not. You just have to use them carefully; consider precisely what each one tries to do; and look critically at the methodology and the results. That done, benchmarks can help you find the right technology for the right job. Just never think that they’re anything so special that they will automagically make your IT decisions for you. They won’t.

See also:

 

You Might Also Like