Information Technology Reference
In-Depth Information
The performance counters and timestamp
MSRs are accessed through specialized machine
instructions (i.e., RDMSR , WRMSR , and RDTSC )
or through higher-level APIs such as the Perfor-
mance Application Programming Interface (PAPI)
(London, Moore, Mucci, Seymour, & Luczak,
2001). A set of control registers are also provided
to select which of the available performance
monitoring events should be maintained in the
available counter set. The advantages of using
on-chip performance counters are: (1) they do
not cost anything in addition to the off-the-shelf
processor and (2) they can be used with a very
low overhead. For instance, copying the current
64-bit timestamp counter into memory (user or
kernel) through the Intel RDTSC instruction costs
less than 100 cycles.
Countable events on the Intel Xeon processor
include branch predictions, prediction misses,
misaligned memory references, cache misses
and transfers, I/O bus transactions, memory bus
transactions, instruction decoding, micro-op
execution, and floating-point assistance. These
events are counted on a per-logical core basis, that
is, the Intel performance counter features do not
provide any means of differentiating event counts
across different threads or processes. Certain ar-
chitectures, however, such as the IBM PowerPC
604e (IBM Corporation, 1998), do provide the
ability to trigger an interrupt when performance
counters negate or wrap-around. This interrupt
can be fil tered on a per processor basis and used
to support a crude means of thread-association
for infrequent events.
On-chip performance counters have limited
use in profiling characteristics specific to multi-
threaded programming. Nevertheless, on-chip
timestamp collection can be useful for measur-
ing execution time intervals (Wolf, 2003). For
example, measurement of context switch times of
the operating systems can be easily done through
the insertion of RDTSC into the operating system-
kernel switching code. Coupling timestamp
features with compiler-based instrumentation
can be an effective way to measure lock wait
and hold times.
on-chip debugging interfaces and
in-circuit emulators (ice)
Performance counters are only useful for counting
g l o b a l e v e in t s i in t h e s y s t e m . A d d i t i o in a l f u in c t i o in a l -
ity is therefore needed to perform more powerful
inspection of execution and register/memory
state. One way to provide this functionality is by
augmenting the “normal” target processor with ad-
ditional functionality. The term in-circuit emulator
(ICE) refers to the use of a substitute processor
module that “emulates” the target microprocessor
and provides additional debugging functionality
(Collins 1997).
ICE modules are usually plugged directly into
the microprocessor socket using a specialized
adapter, as shown in Figure 12. Many modern
microprocessors, however, provide explicit sup-
port for ICE, including most x86 and PowerPC-
based CPUs. A special debug connector on the
motherboard normally provides access to the
on-chip ICE features.
Two key standards define debugging function-
ality adopted by most ICE solutions: JTAG (IEEE,
2001) and the more recent Nexus (IEEE-ISTO,
2003). The Nexus debugging interface is a super-
set of JTAG and consists of between 25 and 100
auxiliary message-based channels that connect
directly to the target processor. The Nexus speci-
fication defines a number of different “classes”
of support that represent different capability sets
composed from the following sets:
Ownership trace messaging (OTM), which
facilitates ownership tracing by providing
visibility of which process identity (ID)
or operating system task is activated. An
OTM is transmitted to indicate when a new
process/task is activated, thereby allowing
development tools to trace ownership flow.
For embedded processors that implement
Search WWH ::




Custom Search