1
0
Files
2022-09-29 17:59:04 +03:00
..
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00
2022-09-29 17:59:04 +03:00

TABLE OF CONTENTS

This document is divided into the following sections.

  THE BENCHMARK
  MANIFEST
  RUNNING THE BENCHMARK
  BENCHMARK DATA
  VISUALIZING THE DATA
  ADDING A NEW OPERATION TO THE BENCHMARK
  ADDING A NEW VISUALIZATION METRIC TO THE BENCHMARK


THE BENCHMARK

This benchmark tests the performance of the hardware cache coherency
under varying levels of memory contention and using different
shared-memory operations.

The test method is to have a number of processors contending for a
single cache line by applying one of the following shared-memory
operations:

   fetch_add		- 64bit atomic increment using load-linked/store-cond
   fetchop_fetch_add	- 64bit atomic increment using SN0 fetchop memory
   load			- load 64bit word
   store		- store 64bit word

Each processor applies the shared memory operation to a shared variable
and then spins on locally cached memory (simulating parallel work).

The benchmark collects data for a varying number of processors and a
varying parallel workload.  The data contains per-processor values of
the total wall, user, and system times elapsed, number of successful
operations (nops), number of failed operations (fops), wall, user, and
system time per successful operation, number of page faults and page
reclaims, and voluntary and involuntary context switches.

The raw output of time per operation includes the time spent in the
parallel work loop.  See `VISUALIZING' below for ways to convert the
data into more reasonable formats.

MANIFEST

startup*	- prepare a machine to run benchmark
cleanup*	- release a machine after running benchmark
config.example  - example benchmark configuration file
cycle.c		- sgi cycle counter front-end
cycle.h		- sgi cycle counter front-end
data/		- data collected from previous runs
gen_charts*	- generate ASCII charts of benchmark data
gen_composite*	- generate composite images from single images
gen_everything*	- generate a full set of charts and images from data
gen_single*	- generate single images from benchmark data
rescale*        - match the scale of two ASCII charts, to compare graphs
lat_op.c	- benchmark source code (framework)
lat_op.h	- benchmark source code
lat_op_asm.s	- benchmark source code (shared-memory ops)
makechart.scz	- generate an image from ASCII chart (Wingz macro)
procs_vs_workload* - massage data into an ASCII chart
run_it*		- run the benchmark given a configuration file
savechart.scz	- save a generated chart (Wingz macro)
toggle.scz	- switch between contour and surface view (Wingz macro)
readyprint.scz  - (partially) prepare an ASCII chart for printing (Wingz macro)

RUNNING THE BENCHMARK

Lines beginning with '$' are commands to be executed as is, except as
noted below.  Lines beginning with '#' must be run as root.  BENCH is
the benchmark source directory; RUN is your private data directory.

$ cd BENCH
$ make go
$ cp go RUN
$ make clean

$ cp BENCH/config.example RUN/config

Edit "RUN/config" to suit your needs.

Replace <my-email-address> with your email address, s.t. people who
try to log in will see that you've reserved the machine for the
benchmark.

# BENCH/startup <my-email-address>

The machine is now more or less reserved for the cache contention
benchmark.

$ cd RUN
$ BENCH/run_it

Approximate run time in seconds = `time' * ( `maxnthr' / `delnthr' )
	* ( `maxwork' / `delwork' ) * number of `ops'

A typical run takes 6 - 8 hours.

# BENCH/cleanup

The machine should be back to its original state now.

BENCHMARK DATA

The benchmark will have placed its data in the directories
<arch>.<hostname>.<date>/{op1,op2,...} (one directory per shared
memory operation tested).

Each directory will contain the following files:

   data-<hostname>-w<work>-t<nthr>*
	- data for <work> parallel workload and <nthr> threads (processors)

   hinv-<hostname>
	- the output of hinv(1M) at the time the benchmark was run

   uname-<hostname>
	- the output of "uname -a" at the time the benchmark was run

   errors-<hostname>
	- the stderr output of the various runs (should be empty)

   mpadmin-<hostname>
	- the output of "mpadmin -s" at the time the benchmark was run
	  (all of the processors which you're working with should be
	  isolated and preferably non-preemptive; see mpadmin(1))

   ticks-<hostname>
	- the calibrated ``nop-timer'' number of ticks per microsecond
          (for some amount of repeatability, you can run "go -z<ticks>")

VISUALIZING THE DATA

[N.B. You have to have Wingz installed on your machine to run
gen_single, which is called by gen_everything to create 3D graphs of
the data.  You have to have netpbm installed to run gen_composite,
which is called by gen_everything to put two graphs into one image
file.]

$ gen_everything RUN/charts RUN1 [RUN2]

This will take a while.  While gen_single is running, Wingz will take
over control of your display (for ~5 minutes).

gen_everything will generate the following:

    charts/RUN1_<op>_<metric>.{txt,gif}
    charts/cmp__<op>_<metric>.gif

where <op> is a shared-memory operation in the cache benchmark
and <metric> is one of the following:

   o "absolute" - time in usecs per shared-memory operation
   o "failed" - number of failed operations
   o "fairness_progress" - fairness of successful operations
   o "slowdown" - slowdown of one processor with N others vs. alone

Currently the measure of "fairness" is the difference between the
maximum and minimum number of operations completed by all N CPUs.
Another metric of fairness is the variance or standard deviation of
the number of completed operations per processor; this can be viewed
by changing the "fairness" variable in gen_charts.

For a qualitative view of the difference between two architectures,
look at charts/cmp_*.gif.


ADDING A NEW OPERATION TO THE BENCHMARK

Let's say you wanted to add an operation called "myop" to the
benchmark.  Following these steps should suffice to add a simple
assembly operation.  Anything more complicated is left as an exercise
for the reader.  I apologize in advance for some of the code you'll
have to deal with.

1. In lat_op.c, "Operation definitions" section, add the following
prototypes (macros):

    decl_op_init(myop);
    decl_op(myop);

2. In lat_op.c, "Operation definitions" section, add the following to
the "ops" array:

    ops_op(myop),

3. In lat_op.c, "Operation routines" section, add the following code:

    decl_op_init(myop)
    {
       /* ... myop initialization code goes here ... */
    }

    void use_myop_asm(volatile uint64_t *sharedp, uint64_t *failp,
                      uint64_t *sucp,
                      uint64_t *work_ticks, volatile uint64_t *counterp);

    decl_op(myop)
    {
        use_myop_asm(sharedp, &this_thread->fops, &this_thread->nops,
               this_thread->work_ticks_jittered, counter);
    }

4. For the operation routine, you'll probably want to add some
assembly code to lat_op_asm.s.  Something like the following should
do:

    LEAF(use_myop_asm)
        .set noreorder

        TIMING_BEGIN

	/* ... myop assembly code here ... */

        TIMING_END

        .set reorder
        END(use_myop_asm)

(Note that copying and modifying use_fetch_add_asm is probably your
best bet for correctness.)

Recompiling at this point should provide you with an executable which
can handle "myop", e.g. "go -o myop ..." will run with your operation.

To perform a full run of the benchmark for some set of datapoints, edit
the config file (copy config.example if necessary) and add "myop" to
the "ops" variable, then run the benchmark as above (see ``RUNNING THE
BENCHMARK'').

gen_charts, etc. should work with your new operation's data.


ADDING A NEW VISUALIZATION METRIC TO THE BENCHMARK

All of the metrics for visualization are loosly defined in gen_charts.

Let's assume that you wanted to visualize the maximum number of failed
store conditionals normalized to the 1 processor case.  Adding the
following line to the ``METRICS FOR VISUALIZATION'' section of
gen_charts would work:

	$gen -t "$dirbase $test failure ratio" -d $dir/*/$test \
		-n fops max \
		> $charts/${dirbase}_${test}_failure_ratio.txt

After adding that line to gen_charts, the rest of the graph generation
code should oblige and generate 3D graphs for you.

Run "procs_vs_workload -h" for more information on the flags which can
be applied when generating charts.