Information Technology Reference
In-Depth Information
Kernel Parallelizations. We observe that not all parallelization opportunities discussed
in Section 3.2 work equally well. In addition to throttling, we also discussed utilizing
multiple threads to compare the chunks inside each signature. While we observe that
throttling has a significant impact on the performance of GUARD, the signature chunk-
level parallelism does not improve the performance significantly. When utilizing chunk-
level parallelism, each GPU thread performs a very short computation (comparing two
64-bit unsigned integers) which does not yield significant benefits. Additionally, the
overhead of managing a high number of GPU threads is not recovered by the short 64-
bit comparison. This indicates that the H-B algorithm used in GUARD benefits more
from coarse-grained parallelism than from fine-grained parallelism.
Customizable Design. The high performance of full throttle mode is obtained at the
cost of utilizing larger amount of on-chip GPU resources as shown in Table 3. If on-chip
resources are constrained, we could also select a smaller signature size and still achieve
better performance for the same level of throttling as shown in Figure 4. However, this
will be achieved at the cost of higher false positive rate. GUARD allows customiz-
ing either of these parameters, signature size, or throttling, to achieve the performance
goal we set for a particular accuracy constraint. This level of performance-accuracy
customizability is hard to achieve in hardware-based data race detection mechanisms.
Coherence Filtering. In Section 3.3 we introduced a novel coherence-based filtering
mechanism to reduce the false positive rate of data race detection using signatures. Here,
we evaluate the impact of the coherence-based filtering on GUARD. The coherence-
based mechanism filters 93.6% of all LD instructions, which results in filtering out
accesses to 96.56% of unique addresses. With filtering, the false positive rate drops
significantly as shown (w/ Filter) in Figure 4:
- from 18.8% to 4.8% for 2048-bit signatures
- from 37.9% to 9.6% for 1024-bit signatures
- from 89.9% to 65.6% for 512-bit signatures
Additionally, the filtering mechanism achieves this improvement without missing
any data race conditions in our experiments. Thus, coherence-based filtering proves
to be very efficient in improving the accuracy of GUARD. Our evaluations are based
on MOSI coherence protocol. However, the filtering mechanism can easily be adapted
to other coherence protocols. With filtering, false positive rate for 1024-bit signature
is now under 10%. Hence, half throttling with 1024-bit signatures can be utilized to
run GUARD with negligible performance overhead, reasonable accuracy, and low GPU
utilization. This is particularly attractive for CPUs with higher number of cores as the
GPU resources required to perform data race detection at full throttling can become
quite large as shown in Table 3.
Bandwidth Utilization. Signature transfer between CPU and GPU consumes on-chip
bandwidth. For a 2048-bit signature, we observe that GUARD utilizes less than 15%
of the on-chip bandwidth provided by current designs [7] to transfer signatures. This
bandwidth utilization can further be reduced by using additional hardware to compress
the signatures [5] before transferring through the on-chip interconnection network.
 
Search WWH ::




Custom Search