Accelerating Data Race Detection Utilizing On-Chip Data-Parallel Cores - Runtime Verification

Information Technology Reference

In-Depth Information

GPU Kernel Parallelization. We map the H-B algorithm to the GPU in the follow-

ing way: each GPU thread is assigned to perform signature comparison between two

CPU threads X and Y. Each thread is also assigned a particular signature combina-

tion, among RD X -WR Y ,WR X -RD Y ,orWR X -WR Y . To speedup the H-B data race

detection algorithm, we parallelize the GPU kernel at different levels:

- Between two CPU threads X and Y , three different GPU threads are used to compare

the three signature combinations (RD X -WR Y ,WR X -RD Y ,andWR X -WR Y )in

parallel.

- The current signature of CPU X could be compared with all 16 signatures of CPU Y

in parallel. We evaluate three levels of parallelization ( throttling ) for this: full, half,

and quart .In full throttle, 16 different GPU threads are used to H-B compare the

current signature of CPU X with the 16 signatures of CPU Y in parallel. Half and

quart throttle, on the other hand, use 8 and 4 GPU threads, respectively.

- We read the 2048-bit signatures in chunks of 64-bit UNSIGNED INTEGER data type

for bitwise AND calculations for the intersection operation of the H-B algorithm.

We further parallelize the GUARD's GPU kernel by utilizing different threads to

perform the bitwise AND calculations on the different chunks of the same signature.

GPU Kernel Synchronization. The GPU kernel synchronizes all the threads after the

comparison of the current SIG X with all the present SIG Y entries, using a custom syn-

chronization function gpu sync(). The current SIG X is then graduated before each

thread moves to a new SIG X .This lock-step behavior ensures correctness of signature

data accessed by GPU threads by avoiding untimely overwriting of SIG X by CPU X .

Since GUARD's GPU Kernel could utilize several thread blocks, spread across multiple

SMs, it is essential for gpu sync() to be able to synchronize across SMs. While the

CUDA library function syncthreads() [24] can only synchronize among threads in a

block, gpu sync() utilizes a global mutex variable and atomic operations to synchro-

nize among multiple SMs. gpu sync() is inspired by the GPU Lock-based Synchro-

nization discussed by Xiao and Feng [25].

3.3

Coherence-Based Filtering

Use of signature to compress the memory access trace could lead to incorrect data

race detection (false positive) as discussed in Section 3.1. GUARD compresses load

(LD) and store (ST) addresses into separate read (RD) and write (WR) signatures of

same size for comparison purposes. However, we observe that LD instructions gener-

ally outnumber ST instructions by ten to one. This means that LD instructions are the

major source of false positive rate in GUARD. The false positive rate can be reduced

by increasing signature sizes. However, this increases the signature table size and the

signature comparison effort leading to significant performance penalty.

In this section, we discuss a novel coherence-based filtering mechanism that im-

proves the accuracy of data race detection in GUARD. The filtering mechanism uti-

lizes coherence state information to identify the LD instructions that access private and

shared read-only addresses, and filters them out. This way, only LD instructions that ac-

cess shared addresses modified by other threads are compressed into the RD signature.

Runtime Verification

Search WWH ::

Custom Search

Home