Random-LRU: A Replacement Policy for Chip Multiprocessors - VLSI Design and Test

Information Technology Reference

In-Depth Information

workloads. But shared LLC increases hit time, because of much larger cache size

as compared to private LLC.

Replacement policy plays a major role in the performance improvement of

CMPs last level caches [1]. The most ecient replacement algorithm for cache

is to always discard the information that will not be needed for the longest time

in the future. This optimal result is referred to as Belady's optimal algorithm

[2]. But it is not implementable in practice because it is generally impossible to

predict how far in future the information will be needed. Recent studies found

that there is a huge performance gap between LRU and the Belady's optimal

algorithm and the miss rate using LRU policy can be up to 197% higher than

the optimal replacement algorithm [3] .

As mentioned in [4], the least-recently-used (LRU) policy has two major dis-

advantages: First, a cache block can remain long time in the cache even after

its last use. This is because it takes a long time to make the block as LRU

block. In this way the block is staying in the cache unnecessarily. Such types

of blocks are called dead blocks [4]. The time gap between the death time (the

time when the block is last used) and the replacement time of a block become

worse as associativity increases. The second problem is that there can be some

blocks in L2 which may never be reused in future; such blocks are immediately

dead after loading into L2 cache. This is because of the spatial bursty reuse of

different bytes of the same cache line which aggravates with increase in block

size. The problem with such bursty reuse is that it makes the block MRU (most

recently used) in L1 and accessed once in L2. Even the block is accessing rapidly

in L1, it accessed in L2 only once. So the principle of temporal locality holds

in L1 but not in L2. These blocks are called never-reused blocks [4]. Such dead

and never-reused blocks are an unnecessary burden of LRU policy and degrade

performance.

In case of highly associative LLC's, true LRU impose additional complexity

and area overheads. That is the reason that most of the current processors in

the market use pseudo-LRU replacement policy. Pseudo-LRU has lesser hard-

ware complexity as compared to LRU and performs almost same as LRU. There

are some other interesting proposals for cache replacement policies (discussed

in Section 2). But most of them have complex mechanism and hardware over-

head for maintaining aging bits, counters etc. LRU performs better in many

applications having higher temporal locality [1]. Hence, instead of completely

removing the concept of LRU, researchers proposed many innovative techniques

to improve the performance of LRU based policies [3, 5-7]. In these proposals

the concept of aging and temporal locality exists but improves the performance

of LRU.

In this paper we proposed a mechanism called random-LRU to improve the

performance of LRU in LLC with much lesser hardware cost than true LRU

as well as some other LRU based policies. The cache using local replacement

policies maintain separate replacement hardware for every set. In our proposed

policy we have not implemented LRU policy for the entire set. We divide the

ways of each set into two partitions: random partition ( RP ) and replacement

VLSI Design and Test

Search WWH ::

Custom Search

Home