Digital Signal Processing Reference
In-Depth Information
scheduling techniques based on these tree regions. The simulation of synchronous
exceptions, i.e., irregular control flow due to page faults, division by zero, etc., is
optimized by code motion techniques [ 31 ] . However, only the behavior of the target
processor is modeled. The simulation neither includes a timing model nor a model
of the target processor's memory organization.
Jones et al. use large translation units (LTUs) [ 36 ] to speed up the simulation of
the EmCore processor, a processor which is compatible with the ARC architecture.
Several basic blocks are translated at once within an LTU, leading to reduced
compilation overhead and improved simulation speed. The simulator supports two
operation modes: (1) a fast cycle-approximate functional simulation, and (2) a
cycle-accurate timing model. Similar to typical static compiled simulation, code
generation is performed using the C programming language in combination with
a standard C compiler. Shared libraries are generated from the C source code and
dynamically loaded at runtime. The generated code is stored permanently and can
thus be reused across several simulation runs to improve the simulation speed
further.
Parallel Simulation
Parallel Embra [ 40 ] combines binary translation with loose timing constraints. It
relies on the underlying shared memory system for event ordering and synchroniza-
tion when distributing the simulation up to 64 cores. Similarly Parallel Mambo [ 71 ]
distributes the simulation to multiple cores in functional simulation. COREMU [ 72 ]
uses a thin library layer to handle the communication between multiple instances
of sequential QEMU based emulators with negligible uniprocessor emulation
overhead.
Almer et al. [ 1 ] showed scalable multi-core simulation using parallel dynamic
binary translation. They simulated up to 2,048 cores of the ARCompact instruction
set architecture on a 32 core x86 host machine reaching simulation speeds of up to
25,307 MIPS. Raghav et al. [ 55 ] developed an interpreting simulator for thousand
core architectures running on a general purpose graphic processor unit (GPGPU).
For instruction decoding, they used look-up tables which were mapped to the texture
memory of the GPGPU. The experimental evaluation was done for ARM and x86
multicores with 32 to 1,024 cores using an Nvidia GeForce GTX 295 graphics
card. When simulating 1,024 cores executing the same application program, up to
1,000 MIPS could be emulated; when the application programs were different, the
simulation speed dropped to 5 MIPS.
Bohm et al. [ 8 ] use decoupled dynamic binary translation to distribute the
translation to multiple cores of the host machine. Over a large set of benchmarks
they achieve 11.5% speedup on average on a quad core Intel Xeon processor. Qin et
al. [ 54 ] translate frequently interpreted code pages to C++ and use the compiler of
the host machine to generate a dynamically linked library. The translation process
is distributed to multiple cores to speed up the simulation.
Search WWH ::




Custom Search