DSP Instruction Set Simulation - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

scheduling techniques based on these tree regions. The simulation of synchronous

exceptions, i.e., irregular control flow due to page faults, division by zero, etc., is

optimized by code motion techniques [ 31 ] . However, only the behavior of the target

processor is modeled. The simulation neither includes a timing model nor a model

of the target processor's memory organization.

Jones et al. use large translation units (LTUs) [ 36 ] to speed up the simulation of

the EmCore processor, a processor which is compatible with the ARC architecture.

Several basic blocks are translated at once within an LTU, leading to reduced

compilation overhead and improved simulation speed. The simulator supports two

operation modes: (1) a fast cycle-approximate functional simulation, and (2) a

cycle-accurate timing model. Similar to typical static compiled simulation, code

generation is performed using the C programming language in combination with

a standard C compiler. Shared libraries are generated from the C source code and

dynamically loaded at runtime. The generated code is stored permanently and can

thus be reused across several simulation runs to improve the simulation speed

further.

Parallel Simulation

Parallel Embra [ 40 ] combines binary translation with loose timing constraints. It

relies on the underlying shared memory system for event ordering and synchroniza-

tion when distributing the simulation up to 64 cores. Similarly Parallel Mambo [ 71 ]

distributes the simulation to multiple cores in functional simulation. COREMU [ 72 ]

uses a thin library layer to handle the communication between multiple instances

of sequential QEMU based emulators with negligible uniprocessor emulation

overhead.

Almer et al. [ 1 ] showed scalable multi-core simulation using parallel dynamic

binary translation. They simulated up to 2,048 cores of the ARCompact instruction

set architecture on a 32 core x86 host machine reaching simulation speeds of up to

25,307 MIPS. Raghav et al. [ 55 ] developed an interpreting simulator for thousand

core architectures running on a general purpose graphic processor unit (GPGPU).

For instruction decoding, they used look-up tables which were mapped to the texture

memory of the GPGPU. The experimental evaluation was done for ARM and x86

multicores with 32 to 1,024 cores using an Nvidia GeForce GTX 295 graphics

card. When simulating 1,024 cores executing the same application program, up to

1,000 MIPS could be emulated; when the application programs were different, the

simulation speed dropped to 5 MIPS.

Bohm et al. [ 8 ] use decoupled dynamic binary translation to distribute the

translation to multiple cores of the host machine. Over a large set of benchmarks

they achieve 11.5% speedup on average on a quad core Intel Xeon processor. Qin et

al. [ 54 ] translate frequently interpreted code pages to C++ and use the compiler of

the host machine to generate a dynamically linked library. The translation process

is distributed to multiple cores to speed up the simulation.

Search WWH ::

Custom Search

Home