196 lines
5.1 KiB
Groff
196 lines
5.1 KiB
Groff
'\"macro stdmacro
|
|
.TH TLBSTATS 1
|
|
.de sh
|
|
.br
|
|
.ne 5
|
|
.PP
|
|
\f3\\$1\f1
|
|
.PP
|
|
..
|
|
.UC 4
|
|
.SH NAME
|
|
tlbstats \- analyze program TLB usage
|
|
.SH SYNOPSIS
|
|
.B tlbstats
|
|
-algorithm alg_name [ options ] < trace_file
|
|
.br
|
|
.SH DESCRIPTION
|
|
.I tlbstats
|
|
analyzes a program's TLB (translation lookaside buffer) usage.
|
|
A programs performance can be split into three categories -
|
|
instruction execution time (which
|
|
.I "prof -pixie"
|
|
or
|
|
.I pixstats
|
|
reports), TLB fill time (which
|
|
.I tlbstats
|
|
reports) and memory subsystem time (cache misses, write back stalls, etc.).
|
|
By timing the overall execution, running
|
|
.I prof
|
|
and
|
|
.I tlbstats
|
|
one can infer how much time is spent in the memory subsystem.
|
|
.PP
|
|
To use
|
|
.IR tlbstats ,
|
|
first use
|
|
.IR pixie (1)
|
|
to translate and instrument the executable object module for the
|
|
program:
|
|
.in +5
|
|
pixie -tlbtrace -idtrace_file 9 \f2prog_name\f1
|
|
.in -5
|
|
.sp
|
|
Next, execute the translation on an appropriate input. This
|
|
produces a
|
|
.I .Counts
|
|
file and a trace file:
|
|
.in +5
|
|
\f2prog_name\f1 9>trace
|
|
.in -5
|
|
.sp
|
|
Note that only
|
|
.IR ksh (1)
|
|
can redirect an arbitrary file descriptor.
|
|
Finally, use
|
|
.I tlbstats
|
|
to generate a detailed report on the TLB usage:
|
|
.in +5
|
|
tlbstats <trace
|
|
.in -5
|
|
.br
|
|
.PP
|
|
There are three algorithms supported:
|
|
.BR utlbonly ,
|
|
.BR irix4.0 ,
|
|
and
|
|
.BR irix4.0r4k .
|
|
Each of these has a default set of options that correspond to the
|
|
actual way the hardware works.
|
|
The options may be changed to experiment with various ideas but
|
|
it should be understood that most of the options are in reality
|
|
built into the underlying hardware (most importantly the number of
|
|
entries).
|
|
.PP
|
|
The
|
|
.B utlbonly
|
|
algorithm is a simple algorithm that treats all TLB misses as if they could
|
|
be handled using the very fast 'utlb' miss handler only.
|
|
.PP
|
|
The
|
|
.B irix4.0
|
|
algorithm simulates the IRIX 4.0 3 level TLB handler system running
|
|
on an R3000.
|
|
This is the default.
|
|
.PP
|
|
The
|
|
.B irix4.0r4k
|
|
algorithm simply changes the defaults to be those of an R4000 - the algorithm
|
|
is the same as the
|
|
.B irix4.0
|
|
algorithm.
|
|
.PP
|
|
The following options may be given:
|
|
.TP 14
|
|
.BI \-entries " n"
|
|
Assume (wish) that there are
|
|
.I n
|
|
entries in the TLB (excluding wired entries).
|
|
.TP
|
|
.B \-wired " n"
|
|
Assume that there are
|
|
.I n
|
|
wired entries (these are entries NEVER used by page table entries only
|
|
by pointers to page tables).
|
|
The R3000 is hard wired at 8, with 1 being used by IRIX.
|
|
.TP
|
|
.B \-cachehit " %n"
|
|
The fast TLB handler has a single load word to get the page table
|
|
entry.
|
|
This option sets the cache hit rate for this load word.
|
|
.TP
|
|
.B \-cachepenalty " cycles"
|
|
When a cache miss occurs in the TLB miss handler, how many
|
|
.I cycles
|
|
does it take to fill?
|
|
For a IP6 (PI) this is 5 cycles.
|
|
For an IP12 (4D30, 4D35, RPC) it is between 16 and 40 cycles
|
|
depending on whether there is anything in the write buffer.
|
|
For the POWER Series machines the number varies whether the word
|
|
is found in the second level cache in main memory and what speed
|
|
CPU it is.
|
|
For an IP7 (25Mhz), a second level miss takes 24 cpu cycles (40nS cycles),
|
|
and a main memory miss takes 39 - 57 cpu cycles (depending on whether
|
|
a writeback is required).
|
|
For an IP7 (33MHz), a second level miss takes 25 cpu cycles (33nS cycles),
|
|
and a main memory miss takes 39 - 57 cpu cycles (depending on whether
|
|
a writeback is required).
|
|
For an IP7 (40MHz), a second level miss takes 28 cpu cycles (25nS cycles),
|
|
and a main memory miss takes 39 - 57 cpu cycles (depending on whether
|
|
a writeback is required).
|
|
For the R4000 with a second level cache, a cache access takes 5-6
|
|
external cycles (13.3nS cycles with a 75Mhz R4000).
|
|
.TP
|
|
.B \-pagesize " n"
|
|
Assume a pagesize of
|
|
.I n
|
|
bytes.
|
|
.TP
|
|
.B \-maps2
|
|
If true, then each TLB entry maps an even and odd virtual page (this
|
|
is what the R4000 does).
|
|
.TP
|
|
.B \-v
|
|
Turn on verbose mode.
|
|
Multiple of these may be given.
|
|
.TP
|
|
.B \-1stpenalty " n"
|
|
Assume the first level (utlb) handler takes
|
|
.I n
|
|
cycles.
|
|
For the R3000 we assume 10 instructions at 1 cycle each.
|
|
For the R4000, the simulation shows 24 internal cycles.
|
|
On the R4000 though, the i-cache is only 8K, so the probability
|
|
that the utlbmiss handler will not be in the cache is fairly high.
|
|
Each second level fetch takes 11-12 internal clocks so the total handler
|
|
time is more like 24 + 2*11 = 46 internal clocks or 23 external
|
|
clocks (the handler required 2 cache lines).
|
|
.TP
|
|
.B \-1.5penalty " n"
|
|
Assume the intermediate segment handler takes
|
|
.I n
|
|
cycles.
|
|
The 1.5 level TLB handler kicks in when a program has more
|
|
disjoint segments than the number of wired entries.
|
|
A segment is 512 pages, so assuming a 4K page size this is 2Mb.
|
|
.TP
|
|
.B \-2ndpenalty " n"
|
|
Assume the second level TLB handler takes
|
|
.I n
|
|
cycles.
|
|
.PP
|
|
In reality, the amount of trace data generated by any reasonable
|
|
program is vastly larger than one wishes to store on disk.
|
|
The following commands (using
|
|
.IR ksh (1))
|
|
may be used to pipe the output of the trace
|
|
file directly to
|
|
.IR tlbstats :
|
|
.in +5
|
|
mknod pipe p
|
|
.br
|
|
tlbstats <pipe&
|
|
.br
|
|
\f2prog options\f1 9>pipe
|
|
.in -5
|
|
.SH "SEE ALSO"
|
|
pixie(1), prof(1)
|
|
.SH BUGS
|
|
The penalty for a second level miss (1000 cycles) is a guess.
|
|
.PP
|
|
The random drop in algorithm differs from the actual hardware algorithm.
|
|
.PP
|
|
Any effects on the TLB by the operating system or other processes is
|
|
not handled.
|
|
|