irix-657m-src/eoe/cmd/tlbstats/tlbstats.1

'\"macro stdmacro
.TH TLBSTATS 1
.de sh
.br
.ne 5
.PP
\f3\\$1\f1
.PP
..
.UC 4
.SH NAME
tlbstats \- analyze program TLB usage
.SH SYNOPSIS
.B tlbstats
-algorithm alg_name [ options ] < trace_file
.br
.SH DESCRIPTION
.I tlbstats
analyzes a program's TLB (translation lookaside buffer) usage.
A programs performance can be split into three categories -
instruction execution time (which
.I "prof -pixie"
or
.I pixstats
reports), TLB fill time (which
.I tlbstats
reports) and memory subsystem time (cache misses, write back stalls, etc.).
By timing the overall execution, running
.I prof
and
.I tlbstats
one can infer how much time is spent in the memory subsystem.
.PP
To use
.IR tlbstats ,
first use
.IR pixie (1)
to translate and instrument the executable object module for the
program:
.in +5
pixie -tlbtrace -idtrace_file 9 \f2prog_name\f1
.in -5
.sp
Next, execute the translation on an appropriate input.  This
produces a
.I .Counts
file and a trace file:
.in +5
\f2prog_name\f1 9>trace
.in -5
.sp
Note that only
.IR ksh (1)
can redirect an arbitrary file descriptor.
Finally, use
.I tlbstats
to generate a detailed report on the TLB usage:
.in +5
tlbstats <trace
.in -5
.br
.PP
There are three algorithms supported:
.BR utlbonly ,
.BR irix4.0 ,
and
.BR irix4.0r4k .
Each of these has a default set of options that correspond to the
actual way the hardware works.
The options may be changed to experiment with various ideas but
it should be understood that most of the options are in reality
built into the underlying hardware (most importantly the number of
entries).
.PP
The
.B utlbonly
algorithm is a simple algorithm that treats all TLB misses as if they could
be handled using the very fast 'utlb' miss handler only.
.PP
The
.B irix4.0
algorithm simulates the IRIX 4.0 3 level TLB handler system running
on an R3000.
This is the default.
.PP
The
.B irix4.0r4k
algorithm simply changes the defaults to be those of an R4000 - the algorithm
is the same as the
.B irix4.0
algorithm.
.PP
The following options may be given:
.TP 14
.BI \-entries " n"
Assume (wish) that there are
.I n
entries in the TLB (excluding wired entries).
.TP
.B \-wired " n"
Assume that there are
.I n
wired entries (these are entries NEVER used by page table entries only
by pointers to page tables).
The R3000 is hard wired at 8, with 1 being used by IRIX.
.TP
.B \-cachehit " %n"
The fast TLB handler has a single load word to get the page table
entry.
This option sets the cache hit rate for this load word.
.TP
.B \-cachepenalty " cycles"
When a cache miss occurs in the TLB miss handler, how many
.I cycles
does it take to fill?
For a IP6 (PI) this is 5 cycles.
For an IP12 (4D30, 4D35, RPC) it is between 16 and 40 cycles
depending on whether there is anything in the write buffer.
For the POWER Series machines the number varies whether the word
is found in the second level cache in main memory and what speed
CPU it is.
For an IP7 (25Mhz), a second level miss takes 24 cpu cycles (40nS cycles),
and a main memory miss takes 39 - 57 cpu cycles (depending on whether
a writeback is required).
For an IP7 (33MHz), a second level miss takes 25 cpu cycles (33nS cycles),
and a main memory miss takes 39 - 57 cpu cycles (depending on whether
a writeback is required).
For an IP7 (40MHz), a second level miss takes 28 cpu cycles (25nS cycles),
and a main memory miss takes 39 - 57 cpu cycles (depending on whether
a writeback is required).
For the R4000 with a second level cache, a cache access takes 5-6
external cycles (13.3nS cycles with a 75Mhz R4000).
.TP
.B \-pagesize " n"
Assume a pagesize of
.I n
bytes.
.TP
.B \-maps2
If true, then each TLB entry maps an even and odd virtual page (this
is what the R4000 does).
.TP
.B \-v
Turn on verbose mode.
Multiple of these may be given.
.TP
.B \-1stpenalty " n"
Assume the first level (utlb) handler takes
.I n
cycles.
For the R3000 we assume 10 instructions at 1 cycle each.
For the R4000, the simulation shows 24 internal cycles.
On the R4000 though, the i-cache is only 8K, so the probability
that the utlbmiss handler will not be in the cache is fairly high.
Each second level fetch takes 11-12 internal clocks so the total handler
time is more like 24 + 2*11 = 46 internal clocks or 23 external
clocks (the handler required 2 cache lines).
.TP
.B \-1.5penalty " n"
Assume the intermediate segment handler takes
.I n
cycles.
The 1.5 level TLB handler kicks in when a program has more
disjoint segments than the number of wired entries.
A segment is 512 pages, so assuming a 4K page size this is 2Mb.
.TP
.B \-2ndpenalty " n"
Assume the second level TLB handler takes
.I n
cycles.
.PP
In reality, the amount of trace data generated by any reasonable
program is vastly larger than one wishes to store on disk.
The following commands (using
.IR ksh (1))
may be used to pipe the output of the trace
file directly to
.IR tlbstats :
.in +5
mknod pipe p
.br
tlbstats <pipe&
.br
\f2prog options\f1 9>pipe
.in -5
.SH "SEE ALSO"
pixie(1), prof(1)
.SH BUGS
The penalty for a second level miss (1000 cycles) is a guess.
.PP
The random drop in algorithm differs from the actual hardware algorithm.
.PP
Any effects on the TLB by the operating system or other processes is
not handled.