Digital Signal Processing Reference
In-Depth Information
scheduling techniques based on these tree regions. The simulation of synchronous
exceptions, i.e., irregular control flow due to page faults, division by zero, etc., is
processor is modeled. The simulation neither includes a timing model nor a model
of the target processor's memory organization.
the EmCore processor, a processor which is compatible with the ARC architecture.
Several basic blocks are translated at once within an LTU, leading to reduced
compilation overhead and improved simulation speed. The simulator supports two
operation modes: (1) a fast cycle-approximate functional simulation, and (2) a
cycle-accurate timing model. Similar to typical static compiled simulation, code
generation is performed using the C programming language in combination with
a standard C compiler. Shared libraries are generated from the C source code and
dynamically loaded at runtime. The generated code is stored permanently and can
thus be reused across several simulation runs to improve the simulation speed
further.
Parallel Simulation
relies on the underlying shared memory system for event ordering and synchroniza-
uses a thin library layer to handle the communication between multiple instances
of sequential QEMU based emulators with negligible uniprocessor emulation
overhead.
binary translation. They simulated up to 2,048 cores of the ARCompact instruction
set architecture on a 32 core x86 host machine reaching simulation speeds of up to
core architectures running on a general purpose graphic processor unit (GPGPU).
For instruction decoding, they used look-up tables which were mapped to the texture
memory of the GPGPU. The experimental evaluation was done for ARM and x86
multicores with 32 to 1,024 cores using an Nvidia GeForce GTX 295 graphics
card. When simulating 1,024 cores executing the same application program, up to
1,000 MIPS could be emulated; when the application programs were different, the
simulation speed dropped to 5 MIPS.
translation to multiple cores of the host machine. Over a large set of benchmarks
they achieve 11.5% speedup on average on a quad core Intel Xeon processor. Qin et
the host machine to generate a dynamically linked library. The translation process
is distributed to multiple cores to speed up the simulation.