Optimization - Real-Time Collision Detection

Graphics Reference

In-Depth Information

can be rewritten to use prefetching, with the help of a little code unrolling. Assuming

a cache line fits four consecutive elem values, a better structure of the loop might be:

const int kLookAhead = 4; // Experiment to find which fixed distance works best

Prefetch(&elem[0]); // Prefetch first four elements

for(inti=0;i<4*n;i+=4){ // Unrolled. Process and step 4 elements at a time

Prefetch(&elem[i + kLookAhead]); // Prefetch cache line a few elements ahead

Process(elem[i + 0]);

// Process the elements that have already

Process(elem[i + 1]);

// been fetched into memory on the

Process(elem[i + 2]);

// previous iteration

Process(elem[i + 3]);

}

If the Process() function takes at least a quarter of the time of a memory stall, the

next cache line of elem values should be in memory on the next iteration. (If it is faster

than so, it might be better to prefetch data two cache lines ahead.)

For nonlinear structures prefetching becomes trickier. The most straightforward

approach is to greedily prefetch all possible branches and hope for the best. For

example, the preorder traversal of a binary tree could be greedily prefetched as in the

following code.

void PreorderTraversal(Node *pNode)

{

Prefetch(pNode->left); // Greedily prefetch left traversal path

Process(pNode); // Process the current node

Prefetch(pNode- > right); // Greedily prefetch right traversal path

PreorderTraversal(pNode- > left); // Recursively visit left subtree

PreorderTraversal(pNode- > right); // then recursively visit right subtree

}

Greedy prefetching is simple to implement, but it is not very efficient. It is not

unlikely that the prefetched areas have been flushed out of cache by the processing

function before the code has returned to visit that part of the tree. Prefetching can be

more effective when used together with linearization (described in Section 13.5).

Another CPU architecture mechanism for addressing the high latency of memory

fetches is the use of nonblocking loads. Nonblocking loads allow the CPU to continue

executing instructions after a cache miss, only stalling if the register that memory

is being loaded into is accessed before the data is ready. Unless more than one

outstanding load is allowed by the hardware, the CPU would also stall if another

load is issued while the first one is pending, thus causing a second cache miss.

While a load is outstanding, hit-under-miss execution of loads is usually allowed. That

is, a subsequent load referring to data already in cache does not cause the CPU

to stall.

Search WWH ::

Custom Search

Home