Information Technology Reference
In-Depth Information
GPU Kernel Parallelization. We map the H-B algorithm to the GPU in the follow-
ing way: each GPU thread is assigned to perform signature comparison between two
CPU threads X and Y. Each thread is also assigned a particular signature combina-
tion, among RD X -WR Y ,WR X -RD Y ,orWR X -WR Y . To speedup the H-B data race
detection algorithm, we parallelize the GPU kernel at different levels:
- Between two CPU threads X and Y , three different GPU threads are used to compare
the three signature combinations (RD X -WR Y ,WR X -RD Y ,andWR X -WR Y )in
parallel.
- The current signature of CPU X could be compared with all 16 signatures of CPU Y
in parallel. We evaluate three levels of parallelization ( throttling ) for this: full, half,
and quart .In full throttle, 16 different GPU threads are used to H-B compare the
current signature of CPU X with the 16 signatures of CPU Y in parallel. Half and
quart throttle, on the other hand, use 8 and 4 GPU threads, respectively.
- We read the 2048-bit signatures in chunks of 64-bit UNSIGNED INTEGER data type
for bitwise AND calculations for the intersection operation of the H-B algorithm.
We further parallelize the GUARD's GPU kernel by utilizing different threads to
perform the bitwise AND calculations on the different chunks of the same signature.
GPU Kernel Synchronization. The GPU kernel synchronizes all the threads after the
comparison of the current SIG X with all the present SIG Y entries, using a custom syn-
chronization function gpu sync(). The current SIG X is then graduated before each
thread moves to a new SIG X .This lock-step behavior ensures correctness of signature
data accessed by GPU threads by avoiding untimely overwriting of SIG X by CPU X .
Since GUARD's GPU Kernel could utilize several thread blocks, spread across multiple
SMs, it is essential for gpu sync() to be able to synchronize across SMs. While the
CUDA library function syncthreads() [24] can only synchronize among threads in a
block, gpu sync() utilizes a global mutex variable and atomic operations to synchro-
nize among multiple SMs. gpu sync() is inspired by the GPU Lock-based Synchro-
nization discussed by Xiao and Feng [25].
3.3
Coherence-Based Filtering
Use of signature to compress the memory access trace could lead to incorrect data
race detection (false positive) as discussed in Section 3.1. GUARD compresses load
(LD) and store (ST) addresses into separate read (RD) and write (WR) signatures of
same size for comparison purposes. However, we observe that LD instructions gener-
ally outnumber ST instructions by ten to one. This means that LD instructions are the
major source of false positive rate in GUARD. The false positive rate can be reduced
by increasing signature sizes. However, this increases the signature table size and the
signature comparison effort leading to significant performance penalty.
In this section, we discuss a novel coherence-based filtering mechanism that im-
proves the accuracy of data race detection in GUARD. The filtering mechanism uti-
lizes coherence state information to identify the LD instructions that access private and
shared read-only addresses, and filters them out. This way, only LD instructions that ac-
cess shared addresses modified by other threads are compressed into the RD signature.
 
Search WWH ::




Custom Search