Subject: [l/m 5/5/93] Performance metrics (5/28).c.be FAQ
Date: 5 Feb 1996 13:25:06 GMT

5.Performance Metrics.....<This panel>
6.Temporary scaffold of New FAQ material
7.Music to benchmark by
8.Benchmark types
9.Linpack
10.Network Performance
11.NIST source and .orgs
12.Benchmark Environments
13.SLALOM
14
15.12 Ways to Fool the Masses with Benchmarks
16.SPEC
17.Benchmark invalidation methods
18
19.WPI Benchmark
20.Equivalence
21.TPC
22
23
24
25.Ridiculously short benchmarks
26.Other miscellaneous benchmarks
27
28.References
1.Introduction to FAQ chain and netiquette
2
3.PERFECT
4


Performance/benchmark metric terminology

The usual important quote is:
.What's important is the time it takes to solve MY problem.
This does not help the architect designing the next machine.
It is an arrogant closed minded, Gestaltist statement which conflicts
with the analytic/reductionist needs for science.

Synthetic problems/benchmarks have some if limited value.
We walk before we run, and we crawl before we walk.  Similarly,
right now, there is more benchmarking noise than signal.

Perhaps the only, certainly best measure, is the second (time):
For one of the best studied metrics see the atomic clocks of the NIST.
Subject to relativistic effects: the Lorentz time contraction.
Don't laugh, this is becoming more important at the pico-second level

Less reliable measures include:

MIP, GIP, TIP.:
MIPS, GIPS, TIPS:.Million (Giga, billions; tera, trillion) Instructions
.Per Second
.:.Meaningless Indicator of Performance
.:."Marketing's" Indicator of Performance
What's an "instruction?"
.An instruction is an event.  It is frequently a minute change in the
.state of a CPU (and the computer).  Frequently, an instruction is
.synonymous with the clock rate of a machine: that ignores instructions
.requiring more than one clock pulse tick to execute.

.A common fallacy by naive benchmarkers is that a CPU determines the
.speed of a computation.  This is frequently false.  The people in
.the know these days understand 
.Amdahl's "other law:"
.1 MIPS for each 1 MB main memory at 1 MB/S transfer to disk

MFLOPS, GFLOPS, TFLOPS: Million (Giga, billions; tera, trillion) Floating-Point
.Operations Per Second
.: The measure ignores non-floating-point instructions.  Particularly
.bad for numeric codes transitioning from 2-D to 3-D since additional
.time is required for array address calculation, and for algorithms
.requiring big non-numeric steps like matrix transposition.
.: the original program name for Frank McMahon's Livermore Loops
.program.
.: one of the metrics used by Dongarra's LINPACK benchmark.

LIPS, KLIPS, MLIPS: Logical "inferences" Per Second --
.from the Logic Programming community (Gabriel LISP benchmarks).
.Also available in Prolog (Evan Tick).  LIPS roughly correspond to
."calls per second" for very simple predicates.

Packets Per Second:.Unit of measure used by the networking, communications
.community.  Sometimes useful.
.: What do they do: make consistent packets?

MHz, GHz, Bits per Second, Bytes per Second, Words per Second:
.: Frequently used to mismeasure the performance of computer networks
.like Ethernet (tm).  It confuses the base band carrier frequency
.with the data trasnfer rate.  It's not truth, but not complete false.
.: Also sometimes call Null or Wait instructions.

TPS.:.Transactions per second, agreed on metric by
.the transaction processing council.
.: What's a transaction?

Stones.: An arbitrary unit of computation based on the Whetstone
.(or Dhrystone or other *stone) which is subject to the influences
.like compiler optimization or cache metrics.
.: What's a transaction?

Normalized metrics

SPECint92, SPECfp92: Normalized metrics based on performance against a 
.DEC VAX-11/780. Based on SPEC integer/floating point workloads of
.CINT92 and CFP92.
SPECmark89: A normalized metric based on the performance against a
.DEC VAX-11/780.  Based on a SPEC Release1.2b workload (replaced by 
        CINT92 and CFP92) on a 780 under glass.

Speed up:

Efficiency:

Our problems aren't counting seconds (intervals or days), it's not counting
instructions, operations, floating point operations.

Events counts like instructions or operations are best done by non-instrusive
instruction/operating counting hardware.  These are expensive to say the least.
Software profilers/event counters are also some times useful, but they are
subject to optimization.

We need to distingush "virtual" operations or instructions from
real or actual instructions.

Prefixes:
kilo, mega, giga, tera, eka, peta,
milli, micro, nano, pico, femto,


Performance metrics are unlike conventional mathematics.
You can't make mathematical inferences (excepting "guaranteed not to
exceed numbers"), you can't apply all mathematical operators.  The basis
for metric theory is that for a metric space X and a metric function d()
which maps pairs of elements in X to the real number system, then 
.a)
.b) d(A,B) = d(B,A)
.c) d(A + B) <= d(A) + d(B) [triangle inequality]

You might have a benchmrk sized for 128 elements.  A program might not test
well if it used 127 or 129 elements instead.  It is not possible in
infer or interpolate between values because of benchmarking "gotchas."
This is especially bad when dealing with powers of two: an artifact
of computer architecture, but sometimes also due to software (in a base-10
world).

Mathematics derives a large portion of its power because of assumptions of
continuity.  Computers are very discrete objects.  What works for case n might
not work for case n-1 or n+1 (vector architectures for instance).
Some interesting thing are learned by simply modifying the size of a benchmark
by one (remember Kernighan and Plauger: beware off-by-1 errors).

Can you even be assured of consistent measures?
Most benchmarks try to run their tests in standalong conditions to
attain consistency.  This is an artifact of not being able to have a
non-intrusive measurement environment.

Measurement issues:
1) Reproducibility: first and foremost.  You must be able to reproduce
   performance.
2) Accuracy and precision.  Tough because of human limits.
3) Resolution.  Details sometimes count.
4) History (memory).
5) 

Another important: measurement tools and environments
What are some nice ones:
Simple ones (non-standard) software
Several: 'arch' name architecture,
Cray: flotrace, hpm (hardware and software actually), others
SGI/MIPS: gr_osview, ancillary: hinv (hardware inventory), pixie
Convex: syspic,
Obsolete ones: gprof, prof (your names may not vary, but the tools does,
.watch for name collision)

Other useful tools should be reported.  Why?  Because most people do
not get reasonable experience with the various kinds of tools out there
to understand their advantages, drawbacks, etc.

Beware of the graphical tools.  They can deceive you.  All performance
monitoring tools can deceive you.  Use them carefully.

Example of a good/useful tool from a 'Class A' measurement environment.
Sample Cray Research, Inc. Hardware Performance Monitor (HPM) output:

hpm VERSION 1.3

   (c) COPYRIGHT CRAY RESEARCH, INC.

    UNPUBLISHED -- ALL RIGHTS RESERVED UNDER
    THE COPYRIGHT LAWS OF THE UNITED STATES

 STOP  (called by EMPTY )
 CP: 0.001s,  Wallclock: 0.038s,  0.2% of 8-CPU Machine
 HWM mem: 97679, HWM stack: 2048, Stack overflows: 0
Group 0:  CPU seconds   :       0.00      CP executing     :         197638

Million inst/sec (MIPS) :      44.47      Instructions     :          52730
Avg. clock periods/inst :       3.75
% CP holding issue      :      42.57      CP holding issue :          84134
Inst.buffer fetches/sec :       0.77M     Inst.buf. fetches:            913
Floating adds/sec       :       0.21M     F.P. adds        :            246
Floating multiplies/sec :       0.23M     F.P. multiplies  :            267
Floating reciprocal/sec :       0.05M     F.P. reciprocals :             54
I/O mem. references/sec :       0.22M     I/O references   :            256
CPU mem. references/sec :      14.58M     CPU references   :          17287

Floating ops/CPU second :       0.48M
 STOP  (called by EMPTY )
 CP: 0.001s,  Wallclock: 0.002s,  4.2% of 8-CPU Machine
 HWM mem: 97679, HWM stack: 2048, Stack overflows: 0

Group 1:  CPU seconds  :        0.00119  CP executing:         198071

  Hold issue condition              % of all CPs       actual # of CPs
Waiting on semaphores              :   0.14                       284
Waiting on shared registers        :   0.00                         0
Waiting on A-registers/funct. units:   9.35                     18520
Waiting on S-registers/funct. units:  27.98                     55418
Waiting on V-registers             :   1.35                      2671
Waiting on vector functional units :   0.00                         9
Waiting on scalar memory references:   0.56                      1101
Waiting on block memory references :   1.86                      3685
 STOP  (called by EMPTY )
 CP: 0.001s,  Wallclock: 0.002s,  4.4% of 8-CPU Machine
 HWM mem: 97679, HWM stack: 2048, Stack overflows: 0

Group 2:  CPU seconds   :        0.00121     CP executing  :          201785

Inst. buffer fetches/sec   :       0.75M  total fetches    :             913
                                          fetch conflicts  :            5265
I/O memory refs/sec        :       0.00M  actual refs      :               0
    avg conflict/ref   0.00:              actual conflicts :             100
Scalar memory refs/sec     :       5.51M  actual refs      :            6668
Block memory refs/sec      :       8.77M  actual refs      :           10619
CPU memory refs/sec        :      14.28M  actual refs      :           17287
    avg conflict/ref   0.15:              actual conflicts :            2668
  CPU memory writes/sec    :       8.66M  actual refs      :           10479
  CPU memory reads/sec     :       5.62M  actual refs      :            6808
 STOP  (called by EMPTY )
 CP: 0.001s,  Wallclock: 0.030s,  0.2% of 8-CPU Machine
 HWM mem: 97679, HWM stack: 2048, Stack overflows: 0

Group 3:  CPU seconds  :        0.00119     CP executing:         198445

 (octal) type of instruction     inst./CPUsec      actual inst.  % of all inst.
(000-017)jump/special           :       5.30M             6315     11.98
(020-077)scalar functional unit :      33.24M            39578     75.07
(100-137)scalar memory          :       5.60M             6668     12.65
(140-157,175)vector integer/log.:       0.01M               14      0.03
(160-174)vector floating point  :       0.00M                2      0.00
(176-177)vector load and store  :       0.12M              141      0.27

  type of operation                ops/CPUsec       actual ops   avg. VL
Vector integer&logical          :       0.12M              138      9.86
Vector floating point           :       0.19M              232    116.00
Scalar functional unit          :      33.24M            39578
=====

Im memoriam to Rear Adm. Grace Murray Hopper, for all the "nano seconds"
and "pico seconds" she passed out (30 cm/1 ft copper wires or salt grains).
She will be missed.

                   ^ A  
                s / \ r                
               m /   \ c              
              h /     \ h            
             t /       \ i          
            i /         \ t        
           r /           \ e      
          o /             \ c    
         g /               \ t  
        l /                 \ u
       A /                   \ r
        <_____________________> e   
                Language
 
