Accelerating Data Race Detection Utilizing On-Chip Data-Parallel Cores - Runtime Verification - page 212

Information Technology Reference

In-Depth Information

Ta b l e 1 . System configuration parameters for the heterogeneous CPU-GPU evaluation infrastruc-

ture

CPU

GPU

Cores

4/8/16/32

Warp Size

32

Frequency

2600MHz

Frequency

1300MHz

Pipeline Width

4

SIMD Pipeline Width

8

L1 Cache (Size/Assoc/Line) 32KB / 2 / 64B

L1 Cache (Size/Assoc/Line)

32KB / 2 / 64B

L2 Cache (Size/Assoc/Line) 2MB / 4 / 64B

L2 Cache (Size/Assoc/Line) 512KB / 4 / 64B

RoB / IW Size

64 / 32

Shared Memory per Core

16KB

MSHR / TLB Entries

256 / 64

Threads / Registers per core

1024 / 16384

L2 / DRAM Access Latency 6 / 200 Cycles

Memory Channels

8

start of their respective parallel sections, also known as the region of interest . Full sys-

tem simulation is extremely time consuming, and therefore it is practical to simulate 1

billion instructions in the region of interest. We observe that GUARD's ability to detect

data race conditions and its performance characteristics are comprehensively evaluated

by simulating 1 billion instructions in the region of interest.

Cycle accurate simulators are utilized to evaluate the performance impact of GUARD

on the CPU application being monitored. GUARD GPU kernel invocations, data trans-

fer operations and signature comparison operations are simulated in a cycle accurate

manner. We enable L1 data cache in the GPU to improve the performance of GUARD

kernel. Potentially, we could also make use of the GPU shared memory to store the

signature table. However, we utilize the L1 data caches as the access times are simi-

lar. Shared memory in the GPU is explicitly managed by the programmer and when

the signature table is updated at regular interval by the CPU, copy of the signature ta-

ble in shared memory will also need to be manually updated. This will prove to be an

additional overhead when using GPU shared memory.

5

Evaluation

This section performs a detailed evaluation of the effectiveness of GUARD. First of all,

we look at the effectiveness of our scheme in detecting data races. Then, we move on to

the performance characteristics of GUARD. We also discuss the performance-accuracy

trade-off achievable, given the limited on-chip resources available.

Table 2 shows the number of data races GUARD detects. GUARD is based on the

happened-before principle that has been used by prior work such as SigRace [5], and

thus is expected to capture the same set of data races. It is worth pointing out that simi-

lar to SigRace, GUARD does not capture all potential data races. The set of data races

captured are only those that lead to violation of happen-before principal at runtime.

GUARD works at address level granularity and hence each data race reported corre-

sponds to a unique address. The ability to detect actual data races proves the effective-

ness of GUARD. Some of the data race conditions reported here are benign, harmless,

or intended race conditions. However, it is essential for a concurrency bug detection

tool to report all potential bugs and let the programmer make a decision on its severity.

Next Page

Runtime Verification

Search WWH ::

Custom Search

Home