Information Technology Reference
In-Depth Information
SG picks up the shared-modified state of the address from the cache response mes-
sage from the LLC controller. Since the SG is private to a core, it need only moni-
tor coherence messages destined to the local first-level cache. Only a single additional
shared-modified bit is required to pass this information. The filtering mechanism does
not alter the cache coherence scheme in any way, which is desirable as they are highly
optimized designs. The temporary signatures will be the size of a RD signature per core,
which is 256 bytes for a 2048-bit signature. Prior work [26] has proposed an algorithm
that uses coherence state information to detect data races. Also, software-based data
race detection mechanisms [27] have employed techniques to filter stack and duplicate
addresses to improve performance. However, to the best of our knowledge, this is the
first work to utilize a coherence-based filtering technique to improve the accuracy of a
data race detection tool that already works at near-hardware speed.
4
Evaluation Infrastructure
In spite of the recent heterogeneous designs [7-9], some of which are already in the
market, the optimal design of a multicore CPU with on-chip data-parallel cores is still
unclear. The memory hierarchy design and shared memory consistency models are am-
biguous and the programming model is still in its nascent state. Nevertheless, such
designs provide a suitable infrastructure to off-load the task of CPU data race detection
to on-chip accelerator cores. In this work, we describe a generic execution model and
propose a data race detector inspired by these designs.
4.1
Heterogeneous Execution Environment
We utilize a heterogeneous multicore processor, consisting of CPU and GPU cores on
the same die, as shown in Figure 2. The cores and their respective LLCs are con-
nected through a common on-chip interconnection network. Communicating through
the shared on-chip interconnection network improves the efficiency of GUARD. These
cores work on different address spaces and hence we do not consider the complexities
of coherence between CPU and GPU cores in our design. We base our evaluation on
a GPU SM, with 8 SPs, that can each support up to 1024 threads. This is modeled on
Nvidia Geforce R 8600 GTS. Various parameters of the CPU and GPU cores simulated
are given in Table 1.
To simulate multicore CPU in detail, we use Simics [28] combined with GEMS [29].
The GPU cores are simulated using GPGPU-sim [30]. The on-chip interconnection net-
work is simulated using Garnet [31]. GUARD GPU Kernel is compiled using CUDA
2.3 [24]. We evaluate GUARD with applications from two widely used benchmark
suites: PARSEC [32] and SPLASH-2 [33]. Our evaluation reports data from 15 pro-
grams in total: seven PARSEC and eight SPLASH-2 programs as indicated in Table 2.
Using Simics and GEMS, we simulate a many-core system with Sun Microsystem's
UltraSPARC R
III processor running Solaris R 8 operating system. All the benchmark
programs are written in C/C++ and parallelized using either O PEN MP or P THREADS .
They are compiled using GCC 4.5.2 at -O3 optimization level. The reported results are
based on running the selected benchmarks for 1 billion instructions in total from the
 
Search WWH ::




Custom Search