Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

A Multiprogramming And OS Workload

Our next study is a multiprogrammed workload consisting of both user activity and OS activ-

ity. The workload used is two independent copies of the compile phases of the Andrew bench-

mark, a benchmark that emulates a software development environment. The compile phase

consists of a parallel version of the Unix “make” command executed using eight processors.

The workload runs for 5.24 seconds on eight processors, creating 203 processes and perform-

ing 787 disk requests on three different file systems. The workload is run with 128 MB of

memory, and no paging activity takes place.

The workload has three distinct phases: compiling the benchmarks, which involves substan-

tial compute activity; installing the object files in a library; and removing the object files. The

last phase is completely dominated by I/O, and only two processes are active (one for each of

the runs). In the middle phase, I/O also plays a major role, and the processor is largely idle.

The overall workload is much more system and I/O intensive than the highly tuned commer-

cial workload.

For the workload measurements, we assume the following memory and I/O systems:

■ Level 1 instruction cache —32 KB, two-way set associative with a 64-byte block, 1 clock cycle

hit time.

■ Level 1 data cache —32 KB, two-way set associative with a 32-byte block, 1 clock cycle hit

time. We vary the L1 data cache to examine its effect on cache behavior.

■ Level 2 cache —1 MB unified, two-way set associative with a 128-byte block, 10 clock cycle

hit time.

■ Main memory —Single memory on a bus with an access time of 100 clock cycles.

■ Disk system —Fixed-access latency of 3 ms (less than normal to reduce idle time)

Figure 5.16 shows how the execution time breaks down for the eight processors using the

parameters just listed. Execution time is broken down into four components:

1. Idle —Execution in the kernel mode idle loop

2. User —Execution in user code

3. Synchronization —Execution or waiting for synchronization variables

4. Kernel —Execution in the OS that is neither idle nor in synchronization access

FIGURE 5.16 The distribution of execution time in the multiprogrammed parallel

“make” workload . The high fraction of idle time is due to disk latency when only one of the

eight processors is active. These data and the subsequent measurements for this workload

were collected with the SimOS system [Rosenblum et al. 1995]. The actual runs and data col-

lection were done by M. Rosenblum, S. Herrod, and E. Bugnion of Stanford University.

This multiprogramming workload has a significant instruction cache performance loss, at

least for the OS. The instruction cache miss rate in the OS for a 64-byte block size, two-way

set associative cache varies from 1.7% for a 32 KB cache to 0.2% for a 256 KB cache. User-level

Search WWH ::

Custom Search

Home