Information Technology Reference
In-Depth Information
Ta b l e 1 . System configuration parameters for the heterogeneous CPU-GPU evaluation infrastruc-
ture
CPU
GPU
Cores
4/8/16/32
Warp Size
32
Frequency
2600MHz
Frequency
1300MHz
Pipeline Width
4
SIMD Pipeline Width
8
L1 Cache (Size/Assoc/Line) 32KB / 2 / 64B
L1 Cache (Size/Assoc/Line)
32KB / 2 / 64B
L2 Cache (Size/Assoc/Line) 2MB / 4 / 64B
L2 Cache (Size/Assoc/Line) 512KB / 4 / 64B
RoB / IW Size
64 / 32
Shared Memory per Core
16KB
MSHR / TLB Entries
256 / 64
Threads / Registers per core
1024 / 16384
L2 / DRAM Access Latency 6 / 200 Cycles
Memory Channels
8
start of their respective parallel sections, also known as the region of interest . Full sys-
tem simulation is extremely time consuming, and therefore it is practical to simulate 1
billion instructions in the region of interest. We observe that GUARD's ability to detect
data race conditions and its performance characteristics are comprehensively evaluated
by simulating 1 billion instructions in the region of interest.
Cycle accurate simulators are utilized to evaluate the performance impact of GUARD
on the CPU application being monitored. GUARD GPU kernel invocations, data trans-
fer operations and signature comparison operations are simulated in a cycle accurate
manner. We enable L1 data cache in the GPU to improve the performance of GUARD
kernel. Potentially, we could also make use of the GPU shared memory to store the
signature table. However, we utilize the L1 data caches as the access times are simi-
lar. Shared memory in the GPU is explicitly managed by the programmer and when
the signature table is updated at regular interval by the CPU, copy of the signature ta-
ble in shared memory will also need to be manually updated. This will prove to be an
additional overhead when using GPU shared memory.
5
Evaluation
This section performs a detailed evaluation of the effectiveness of GUARD. First of all,
we look at the effectiveness of our scheme in detecting data races. Then, we move on to
the performance characteristics of GUARD. We also discuss the performance-accuracy
trade-off achievable, given the limited on-chip resources available.
Table 2 shows the number of data races GUARD detects. GUARD is based on the
happened-before principle that has been used by prior work such as SigRace [5], and
thus is expected to capture the same set of data races. It is worth pointing out that simi-
lar to SigRace, GUARD does not capture all potential data races. The set of data races
captured are only those that lead to violation of happen-before principal at runtime.
GUARD works at address level granularity and hence each data race reported corre-
sponds to a unique address. The ability to detect actual data races proves the effective-
ness of GUARD. Some of the data race conditions reported here are benign, harmless,
or intended race conditions. However, it is essential for a concurrency bug detection
tool to report all potential bugs and let the programmer make a decision on its severity.
 
Search WWH ::




Custom Search