Exploiting Disk Layout and Block Access History for I/O Prefetch - Advanced Operating Systems and Kernel Applications

Information Technology Reference

In-Depth Information

Unlike the existing file-level prefetch policies,

DiskSeen directly accesses disk blocks via LBNs,

including both file content data blocks and meta-

data blocks, such as inode and indirect blocks. The

challenge is that, being prefetched at disk level,

these blocks' semantic information is unknown,

except their LBNs. In other words, we would not

know which file a block belongs to, or what type

a block is. Meanwhile, back-translating LBNs to

files/offset is cumbersome too. In order to make

the LBN-based prefetched blocks usable by high-

level I/O routines, we treat a disk partition as a

raw device file to read blocks and place them in

the prefetching area. Only when a high-level I/O

request is issued, we check the LBNs of requested

blocks against those of prefetched blocks resident

in the prefetching area. If a match is found, the

prefetched block is moved into the caching area

to satisfy the I/O request. This design significantly

simplifies the implementation complexity.

real-life applications. We briefly introduce the six

workloads as follows.

• strided - a synthetic program reading a

1GB file in a strided fashion by reading

every other 4KB of data from the begin-

ning to the end of the file. There is a small

amount of compute time after each read.

• reversed - a synthetic program sequential-

ly reading one 1GB file from its end to its

beginning.

• CVS - a version control utility widely used

in software development environment. We

use command (cvs -q diff) to compare a

user working directory to the CVS reposi-

tory. Two identical set of data are stored on

disk with 50GB space in between.

• diff - a Linux tool that compares files char-

acter by character. Similar to CVS, it ac-

cesses two data sets.

•

grep - a textual search tool that scans a col-

lection of files for lines containing a match

for a keyword in given expression.

performance eValuation

•

TPC-H - a widely used decision support

benchmark that handles business-oriented

queries against a database system. We use

PostgreSQL 7.3.18 as the database server,

and the data set is generated using scale fac-

tor 1. Query 4 is used in the experiments.

Our experimental system is a machine with

a 3.0GHz Intel Pentium 4 processor, 512MB

memory, and a Western Digital WD1600JB

160GB 7200RPM hard drive. The hard drive has

an 8MB cache. The OS is Redhat Linux WS4

with the Linux 2.6.11 kernel using the Ext3 file

system. For configuration in DiskSeen, T , the

access index gap threshold, is set as 2048, and S ,

which is used to determine the trail extent, is set

as 128. The other system configurations are set

using default values.

For analysis of experimental results across

different benchmarks, we use the source code

tree of Linux kernel 2.6.11 as the data set, whose

size is about 236MB, in benchmarks CVS , diff ,

and grep .

workloads

experimental results

In order to analyze the performance of DiskSeen

in different scenarios, we carefully select six

representative data intensive benchmarks with

different access patterns to measure their execution

times. The six benchmarks include two synthetic

workloads, strided and reversed , and another four

In order to examine the performance of sequence-

based prefetching and history-aware prefetching

in DiskSeen, we show the execution times of the

benchmarks on the stock Linux kernel, and the

times for their first and second runs on the kernel

with the DiskSeen scheme in Table 1. Note that,

Advanced Operating Systems and Kernel Applications

Search WWH ::

Custom Search

Home