Graphics Reference
In-Depth Information
can be rewritten to use prefetching, with the help of a little code unrolling. Assuming
a cache line fits four consecutive elem values, a better structure of the loop might be:
const int kLookAhead = 4; // Experiment to find which fixed distance works best
Prefetch(&elem[0]); // Prefetch first four elements
for(inti=0;i<4*n;i+=4){ // Unrolled. Process and step 4 elements at a time
Prefetch(&elem[i + kLookAhead]); // Prefetch cache line a few elements ahead
Process(elem[i + 0]);
// Process the elements that have already
Process(elem[i + 1]);
// been fetched into memory on the
Process(elem[i + 2]);
// previous iteration
Process(elem[i + 3]);
}
If the Process() function takes at least a quarter of the time of a memory stall, the
next cache line of elem values should be in memory on the next iteration. (If it is faster
than so, it might be better to prefetch data two cache lines ahead.)
For nonlinear structures prefetching becomes trickier. The most straightforward
approach is to greedily prefetch all possible branches and hope for the best. For
example, the preorder traversal of a binary tree could be greedily prefetched as in the
following code.
void PreorderTraversal(Node *pNode)
{
Prefetch(pNode->left); // Greedily prefetch left traversal path
Process(pNode); // Process the current node
Prefetch(pNode- > right); // Greedily prefetch right traversal path
PreorderTraversal(pNode- > left); // Recursively visit left subtree
PreorderTraversal(pNode- > right); // then recursively visit right subtree
}
Greedy prefetching is simple to implement, but it is not very efficient. It is not
unlikely that the prefetched areas have been flushed out of cache by the processing
function before the code has returned to visit that part of the tree. Prefetching can be
more effective when used together with linearization (described in Section 13.5).
Another CPU architecture mechanism for addressing the high latency of memory
fetches is the use of nonblocking loads. Nonblocking loads allow the CPU to continue
executing instructions after a cache miss, only stalling if the register that memory
is being loaded into is accessed before the data is ready. Unless more than one
outstanding load is allowed by the hardware, the CPU would also stall if another
load is issued while the first one is pending, thus causing a second cache miss.
While a load is outstanding, hit-under-miss execution of loads is usually allowed. That
is, a subsequent load referring to data already in cache does not cause the CPU
to stall.
 
Search WWH ::




Custom Search