PARALLEL COMPUTER ARCHITECTURES - Structured Computer Organization

Hardware Reference

In-Depth Information

Fairly soon, large chips will have tens of billions of transistors. Such chips are

far too large to design one gate and one wire at a time. The human effort required

would render the chips obsolete by the time they were finished. The only feasible

approach is to use cores (essentially libraries) containing fairly large subassemblies

and to place and interconnect them on the chip as needed. Designers then have to

determine which CPU core to use for the control processor and which special-pur-

pose processors to throw in to help it. Putting more of the burden on software run-

ning on the control processor makes the system slower but yields a smaller (and

cheaper) chip. Having multiple special-purpose processors for audio and video

processing takes up chip area, increasing the cost, but produces higher performance

at a lower clock rate, which means lower power consumption and less heat dissipa-

tion. Thus chip designers increasingly contend with these macroscopic trade-offs

rather than worrying about where to place each transistor.

Audiovisual applications are very data intensive. Huge amounts of data have

to be processed quickly, so typically 50% to 75% of the chip area is devoted to

memory in one form or another, and the amount is rising. The design issues here

are numerous. How many levels of cache should be used? Should the cache(s) be

split or unified? How big should each cache be? How fast should each be? Should

some actual memory go on the chip, too? Should it be SRAM or SDRAM? The

answers to each of these questions have major implications for the performance,

energy consumption, and heat dissipation of the chip.

Besides design of the processors and memory system, another issue of consid-

erable consequence is the communication system—how do all the cores communi-

cate with each other? For small systems, a single bus will usually do the trick, but

for larger ones it rapidly becomes a bottleneck. Often the problem can be solved

by going to multiple buses or possibly a ring from core to core. In the latter case,

arbitration is handled by passing a small packet called a token around the ring. To

transmit, a core must first capture the token. When it is done, it puts the token

back on the ring so it can continue circulating. This protocol prevents collisions on

the ring.

As an example of an on-chip interconnect, look at the IBM CoreConnect , il-

lustrated in Fig. 8-13. It is an architecture for connecting cores on a single-chip

heterogeneous multiprocessor, especially complete system-on-a-chip designs. In a

sense, CoreConnect is to one-chip multiprocessors what the PCI bus was to the

Pentium—the glue that holds all the parts together. (With modern Core i7 systems,

PCIe is the glue, but it is a point-to-point network without a shared bus like PCI.)

However, unlike the PCI bus, CoreConnect was designed without any requirements

to be backward compatible with legacy equipment or protocols and without the

constraints of board-level buses, such as limits on the number of pins the edge con-

nector can have.

CoreConnect consists of three buses. The processor bus is a high-speed, syn-

chronous, pipelined bus with 32, 64, or 128 data lines clocked at 66, 133, or 183

MHz. The maximum throughput is thus 23.4 Gbps (vs. 4.2 Gbps for the PCI bus).

Structured Computer Organization

Search WWH ::

Custom Search

Home