Random-LRU: A Replacement Policy for Chip Multiprocessors - VLSI Design and Test

Information Technology Reference

In-Depth Information

isLP bit to 0. In this way we are logically moving a block between LP and RP

with just changing a bit value. Inserting, replacing or removing a block does not

affect the isLP bit of the corresponding way. It has to be done separately.

In case of local LRU replacement policy, each set has different hardware to

implement LRU policy for that particular set. In random-LRU, the LRU hard-

ware needs to maintain records only for LP n number of ways (for each set). But

since there is no physical movement of blocks within the set, the ways belonging

to LP section will change dynamically and the same is also true for the LRU

hardware. For example, if the LRU hardware implements a data structure to

maintain the LRU records then there will be LP n number of nodes in the data

structure and each node contains one aging variable and a pointer to indicate the

way in which the corresponding block resides. The detail hardware based expla-

nation of LRU policy is beyond the scope of this paper. We assumed that logical

block movement does not add any extra hardware overhead for implementing

LRU in LP .

4 Experimental Evaluation

4.1 Tiled Chip Multiprocessor

We used a 16 core Tiled CMP architecture [1] for experimenting all replacement

policies. Each tile has a processor, a private L1-cache and an L2-cache. The

tiles (or processor nodes) are connected to each other over a 2D mesh popularly

known as network-on-chip (NoC). The L2-cache with each tile can be private,

or shared among all processors on the chip. In this paper we assume a shared

cache, where the slice located in each tile will be called a cache-bank. Each bank

itself is a independent set-associative cache. All the experimental results shown

in this section are for the entire LLC, combining the results of all the banks

together.

4.2 Experimental Setup

In order to evaluate the proposed cache management technique, we performed

simulations by running benchmarks on a multi-core simulator GEMS [14] with

the help of a full-system functional simulator. GEMS has Ruby, which is a timing

simulator of multiprocessor memory system. We used MESI CMP based cache

controller in GEMS. The configuration details of the processor, cache memory

and main memory used in our experiments is given in Table 1. For calculating

the latencies incurred at L1 caches, L2 banks and directories we used Princeton's

Garnet [15] network simulator. The parameters used are listed in the Table 2.

We used six multi-threaded applications from PARSEC [16] benchmark suite

for simulation. Note that, our proposed replacement policy is only applicable to

L2 and the behavior of L1 caches remains unchanged.

Search WWH ::

Custom Search

Home