Accelerating Data Race Detection Utilizing On-Chip Data-Parallel Cores - Runtime Verification

Information Technology Reference

In-Depth Information

Kernel Parallelizations. We observe that not all parallelization opportunities discussed

in Section 3.2 work equally well. In addition to throttling, we also discussed utilizing

multiple threads to compare the chunks inside each signature. While we observe that

throttling has a significant impact on the performance of GUARD, the signature chunk-

level parallelism does not improve the performance significantly. When utilizing chunk-

level parallelism, each GPU thread performs a very short computation (comparing two

64-bit unsigned integers) which does not yield significant benefits. Additionally, the

overhead of managing a high number of GPU threads is not recovered by the short 64-

bit comparison. This indicates that the H-B algorithm used in GUARD benefits more

from coarse-grained parallelism than from fine-grained parallelism.

Customizable Design. The high performance of full throttle mode is obtained at the

cost of utilizing larger amount of on-chip GPU resources as shown in Table 3. If on-chip

resources are constrained, we could also select a smaller signature size and still achieve

better performance for the same level of throttling as shown in Figure 4. However, this

will be achieved at the cost of higher false positive rate. GUARD allows customiz-

ing either of these parameters, signature size, or throttling, to achieve the performance

goal we set for a particular accuracy constraint. This level of performance-accuracy

customizability is hard to achieve in hardware-based data race detection mechanisms.

Coherence Filtering. In Section 3.3 we introduced a novel coherence-based filtering

mechanism to reduce the false positive rate of data race detection using signatures. Here,

we evaluate the impact of the coherence-based filtering on GUARD. The coherence-

based mechanism filters 93.6% of all LD instructions, which results in filtering out

accesses to 96.56% of unique addresses. With filtering, the false positive rate drops

significantly as shown (w/ Filter) in Figure 4:

- from 18.8% to 4.8% for 2048-bit signatures

- from 37.9% to 9.6% for 1024-bit signatures

- from 89.9% to 65.6% for 512-bit signatures

Additionally, the filtering mechanism achieves this improvement without missing

any data race conditions in our experiments. Thus, coherence-based filtering proves

to be very efficient in improving the accuracy of GUARD. Our evaluations are based

on MOSI coherence protocol. However, the filtering mechanism can easily be adapted

to other coherence protocols. With filtering, false positive rate for 1024-bit signature

is now under 10%. Hence, half throttling with 1024-bit signatures can be utilized to

run GUARD with negligible performance overhead, reasonable accuracy, and low GPU

utilization. This is particularly attractive for CPUs with higher number of cores as the

GPU resources required to perform data race detection at full throttling can become

quite large as shown in Table 3.

Bandwidth Utilization. Signature transfer between CPU and GPU consumes on-chip

bandwidth. For a 2048-bit signature, we observe that GUARD utilizes less than 15%

of the on-chip bandwidth provided by current designs [7] to transfer signatures. This

bandwidth utilization can further be reduced by using additional hardware to compress

the signatures [5] before transferring through the on-chip interconnection network.

Search WWH ::

Custom Search

Home