Information Technology Reference
In-Depth Information
Random-LRU: A Replacement Policy
for Chip Multiprocessors
Shirshendu Das, Nagaraju Polavarapu,
Prateek D. Halwe, and Hemangee K. Kapoor
Indian Institute of Technology Guwahati, Guwahati, India
{ shirshendu,n.polavarapu,h.prateek,hemangee } @iitg.ernet.in
Abstract. As the number of cores and associativity of the last level
cache (LLC) on a Chip Multi-processor increases, the role of replacement
policies becomes more vital. Though, pure least recently used (LRU) pol-
icy has some issues it has been generally believed that some versions of
LRU policy performs better than the other policies. Therefore, a lot of
work has been proposed to improve the performance of LRU-based poli-
cies. However, it has been shown that the true LRU imposes additional
complexity and area overheads when implemented on high associative
LLCs. Most of the LRU based works are more motivated towards the
performance improvement than the reduction of area and hardware over-
head of true LRU scheme. In this paper we proposed an LRU based cache
replacement policy especially for the LLC to improve the performance
of LRU as well as to reduce the area and hardware cost of pure LRU by
more than a half. We use a combination of random and LRU replace-
ment policy for each cache set. Instead of using LRU policy for the entire
set we use it only for some number of ways within the set. Experiments
conducted on a full-system simulator shows 36% and 11% improvements
over miss rate and CPI respectively.
Keywords: LRU, Pseudo-LRU, Tiled CMP, NUCA, Random-LRU.
1 Introduction
Chip Multiprocessors (CMPs) that contain multiple CPUs on the same die are
the main components for building today's computer systems [1]. In the long
run, it is expected that the number of cores in CMPs will increase and also
accommodate large on-chip Last Level Cache (LLC).
CMP cache architectures are mainly of two types [1]: i) CMP with private
LLC and ii) CMP with shared LLC. We assume L2 as LLC throughout this
paper. Private L2 caches are relatively small and physically placed very near
to the core, hence the cache access time is very less. But it has the capacity
problem, since the cache size is small it causes several capacity misses. On the
other hand shared L2 is comparatively very large and only a single copy of each
data can store in it; all the requesting core will share the same data block and
the cache storage can be dynamically allocate to the cores depending on their
 
Search WWH ::




Custom Search