Biomedical Engineering Reference
In-Depth Information
Figure 3.9 Magnified results on sample cells from BAEC dataset. From left to right: Chan and Vese,
GAC, GAC with adaptive force, proposed method, and manual ground truth.
is an expensive operation. Using parallel reduction this kind of operation can be
performed efficiently on the GPU [57]. The original texture is passed to a shader
program parameter and the target rendering texture area is half its size. Every
neighborhood of four texels is considered and evaluated for operation of interest.
As shown in Figure 3.10 the maximum value of every four texels is written to the
target texture. The result is passed back to the shader and the target render area
reduced by half again. This operation is performed log N times where N is the
width of the original texture and the result is the single maximum value among
the texels. A tree-based parallel reduction using the ping-pong method of texture
memory access is bandwidth bound. In CUDA, shared memory is used to reduce
the bandwidth requirement and with sequential (instead of interleaved) addressing
is conflict free. Parallel reduction of N elements requires log( N ) steps that perform
the same number of O( N ) operations as a sequential algorithm. With P process-
ing cores ( P physically parallel threads) the total time complexity is reduced to
O( N / P
+
log N ) compared to O( N ) for sequential reduction.
Figure 3.10 An example of using reduction to find the maximum value n the texture. The lightest
shaded texel from each set of four neighboring texels is kept until the single white texel is isolated.
Search WWH ::




Custom Search