Information Technology Reference
In-Depth Information
isLP bit to 0. In this way we are logically moving a block between LP and RP
with just changing a bit value. Inserting, replacing or removing a block does not
affect the isLP bit of the corresponding way. It has to be done separately.
In case of local LRU replacement policy, each set has different hardware to
implement LRU policy for that particular set. In random-LRU, the LRU hard-
ware needs to maintain records only for LP n number of ways (for each set). But
since there is no physical movement of blocks within the set, the ways belonging
to LP section will change dynamically and the same is also true for the LRU
hardware. For example, if the LRU hardware implements a data structure to
maintain the LRU records then there will be LP n number of nodes in the data
structure and each node contains one aging variable and a pointer to indicate the
way in which the corresponding block resides. The detail hardware based expla-
nation of LRU policy is beyond the scope of this paper. We assumed that logical
block movement does not add any extra hardware overhead for implementing
LRU in LP .
4 Experimental Evaluation
4.1 Tiled Chip Multiprocessor
We used a 16 core Tiled CMP architecture [1] for experimenting all replacement
policies. Each tile has a processor, a private L1-cache and an L2-cache. The
tiles (or processor nodes) are connected to each other over a 2D mesh popularly
known as network-on-chip (NoC). The L2-cache with each tile can be private,
or shared among all processors on the chip. In this paper we assume a shared
cache, where the slice located in each tile will be called a cache-bank. Each bank
itself is a independent set-associative cache. All the experimental results shown
in this section are for the entire LLC, combining the results of all the banks
together.
4.2 Experimental Setup
In order to evaluate the proposed cache management technique, we performed
simulations by running benchmarks on a multi-core simulator GEMS [14] with
the help of a full-system functional simulator. GEMS has Ruby, which is a timing
simulator of multiprocessor memory system. We used MESI CMP based cache
controller in GEMS. The configuration details of the processor, cache memory
and main memory used in our experiments is given in Table 1. For calculating
the latencies incurred at L1 caches, L2 banks and directories we used Princeton's
Garnet [15] network simulator. The parameters used are listed in the Table 2.
We used six multi-threaded applications from PARSEC [16] benchmark suite
for simulation. Note that, our proposed replacement policy is only applicable to
L2 and the behavior of L1 caches remains unchanged.
 
Search WWH ::




Custom Search