Accelerating Data Race Detection Utilizing On-Chip Data-Parallel Cores - Runtime Verification

Information Technology Reference

In-Depth Information

SG picks up the shared-modified state of the address from the cache response mes-

sage from the LLC controller. Since the SG is private to a core, it need only moni-

tor coherence messages destined to the local first-level cache. Only a single additional

shared-modified bit is required to pass this information. The filtering mechanism does

not alter the cache coherence scheme in any way, which is desirable as they are highly

optimized designs. The temporary signatures will be the size of a RD signature per core,

which is 256 bytes for a 2048-bit signature. Prior work [26] has proposed an algorithm

that uses coherence state information to detect data races. Also, software-based data

race detection mechanisms [27] have employed techniques to filter stack and duplicate

addresses to improve performance. However, to the best of our knowledge, this is the

first work to utilize a coherence-based filtering technique to improve the accuracy of a

data race detection tool that already works at near-hardware speed.

4

Evaluation Infrastructure

In spite of the recent heterogeneous designs [7-9], some of which are already in the

market, the optimal design of a multicore CPU with on-chip data-parallel cores is still

unclear. The memory hierarchy design and shared memory consistency models are am-

biguous and the programming model is still in its nascent state. Nevertheless, such

designs provide a suitable infrastructure to off-load the task of CPU data race detection

to on-chip accelerator cores. In this work, we describe a generic execution model and

propose a data race detector inspired by these designs.

4.1

Heterogeneous Execution Environment

We utilize a heterogeneous multicore processor, consisting of CPU and GPU cores on

the same die, as shown in Figure 2. The cores and their respective LLCs are con-

nected through a common on-chip interconnection network. Communicating through

the shared on-chip interconnection network improves the efficiency of GUARD. These

cores work on different address spaces and hence we do not consider the complexities

of coherence between CPU and GPU cores in our design. We base our evaluation on

a GPU SM, with 8 SPs, that can each support up to 1024 threads. This is modeled on

Nvidia Geforce R 8600 GTS. Various parameters of the CPU and GPU cores simulated

are given in Table 1.

To simulate multicore CPU in detail, we use Simics [28] combined with GEMS [29].

The GPU cores are simulated using GPGPU-sim [30]. The on-chip interconnection net-

work is simulated using Garnet [31]. GUARD GPU Kernel is compiled using CUDA

2.3 [24]. We evaluate GUARD with applications from two widely used benchmark

suites: PARSEC [32] and SPLASH-2 [33]. Our evaluation reports data from 15 pro-

grams in total: seven PARSEC and eight SPLASH-2 programs as indicated in Table 2.

Using Simics and GEMS, we simulate a many-core system with Sun Microsystem's

UltraSPARC R

III processor running Solaris R 8 operating system. All the benchmark

programs are written in C/C++ and parallelized using either O PEN MP or P THREADS .

They are compiled using GCC 4.5.2 at -O3 optimization level. The reported results are

based on running the selected benchmarks for 1 billion instructions in total from the

Runtime Verification

Search WWH ::

Custom Search

Home