ASIC Architecture to Determine Object Centroids from Gray-Scale Images Using Marching Pixels (WiMoA 2011 and ICCSEA 2011) Part 2

Local Datapath

The local data path of the MP architecture is characterized by orthogonal serial data connections between the CUs. In figure 6 a CU black box representation is shown. The figure depicts all neighboured input data to compute the y-coordinate of the object’s centroid.

 Data inputs of the Flooding-CU

Fig. 6. Data inputs of the Flooding-CU

Note 3. For a better readability the indexes of the Von Neumann’s neighborhood are substituted as follows (compare subsection 3.1)

tmp3E-93_thumb[2]

The data flow diagram shown in figure 7 exemplifies the forward calculation process in order to compute all required values in y direction. As input data the pixels of the Von Neumann’s neighborhood (denoted as the last index expression at the signal names) and the own pixel value (P0) are involved. The results are the bounding box state w0 as well as the zeroth (moo) and the first moment (mio). All gray-boxed signal names are output registers, signal names without a box are the CU inputs already denoted in figure 6.


Data flow diagram (forward calculation)

Fig. 7. Data flow diagram (forward calculation)

To carry out the backward calculation the moments processed by the forward calculation and the bounding box state of the actual pixel position are required. In addition the edge information ex|0 and ey|0 using the bounding box states w4 and w6 (bottom and right pixel neighborhood) has to be evolved. Figure 8 demonstrates the data dependencies in the same way as the figure above.

Data flow diagram including the edge detection (backward calculation)

Fig. 8. Data flow diagram including the edge detection (backward calculation)

Prototype Chip

Design Considerations

Before the global chip architecture could be established, the capability to implement the algorithms mentioned before had to be investigated. The results of the design analysis are the following:

— All CU arithmetic units (adders, subtracters, comparators) are implemented as bit-serial modules. The main reason is, that the required area resources are significant smaller in comparison to a bit-parallel CU implementation. Once the CU chip layout is clean for a given gray-scale pixel resolution it is reusable for any image resolution. The CU design depends NOT on the array resolution anymore.

— A hierarchical design strategy in three phases is recommended. The CU itself, a line of CUs and the CU-array composed of CU-lines including the chip IO resources are designed separately as hardware blocks.

— A flat design strategy is not recommended due to congestion problems during the physical chip routing process.

Physical Implementation

The physical implementation of the MP-Chip had been done for the binary image processing flow as described in subsection 4.1 by applying the design considerations listed above. In this case the bit-serial working CU has to be designed only once. The behavioral description of the entire chip is designed in a consistently generic way for both the vertical and the horizontal MP array resolutions n and m. Table 1 summarizes the layout results for one CU, one CU-line and the entire 64×64 MP chip design driven by a 50 MHz clock.

Table 1. Layout parameters of the chip modules using a 90 nm CMOS technology

Parameter

CU

CU-line

Chip

Ports

Inputs

Data

18

640

64

Control

6

6

4

Outputs

Data

12

768

64

Control

1

Used metal layers

3

6

7

Standard cells/macro blocks

172

25/64

767/64

Gates

536

34413

2207516

Physical dimensions

Height / [im

41.22

45.14

3636

Width/^m

41.60

2755.00

3424

Critical path latency/ns

1.48

2.59

9.67

In figure 9 the resulting prototype chip is shown as a GDSII database representation. The squared chip core is formed by 64x64 MP calculation units. The pad ring consists of the data in- and output pads (64 pads each) located at the top and bottom chip edge. The left and right pad ring segments contain eight pairs of core power supply pads (four each).

Chip prototype layout (without bonding pads)

Fig. 9. Chip prototype layout (without bonding pads)

Benchmark Comparison

To demonstrate the performance of chip architectures derived from our MP design strategy a comparison with simulation results of two different TMS320 DSP platforms had been carried out. The older C6416 as well as the actual DaVinci platform (DM6446) had been simulated at a virtual CPU clock of 500MHz. Therefore a software benchmark running on the DSP cores models has been created by manually optimized C code programming. The computation of centroids bases on projections of binary and gray-scale image objects (with a bitwidth of eight), where the algorithmic approach is similarly to [10].

In addition to the absolute worst case latencies shown in figure 10 the achieved speedups (figure 11) are plotted as a function of squared worst case object3 resolutions N2.

Latencies

MP-latencies vs. DSP-benchmarks for squared worst case objects

Fig. 10. MP-latencies vs. DSP-benchmarks for squared worst case objects

Speedup

Speedups for squared worst case objects

Fig. 11. Speedups for squared worst case objects

Table 2. MP architectures versus emergence and self-organization

Attribute

true ?

Comment

Emergence

Micro-Macro-Effect

yes

Radical Novelty

yes

Coherence

yes

Local interacting components

yes

Dynamic (latency)

yes

Dezentralized control

yes

Bidirectional link

no

no feedback of the emergent

Robustness and flexibility

limited

robustness against distur-

bances in image objects

Self-organization

Increasing in order

irrelevant

no design target

Autonomy

yes

Robustness (flexibility)

yes

no change of the local CU-

behavior

Dynamic4

irrelevant

self-contained during a MP

processing cycle

Conclusions and Outlook

This paper depicts an architecture overview as well as the design strategy to realize a Marching Pixels prototype chip. The Marching Pixels concept is an alternative design paradigm to determine global image information in an elegant fashion using a dedicated hardware platform. We showed, that an array of simple structured hardware modules is able to extract centroids of image objects of any shape and size faster than commonly used DSP platforms.

We denote that the benchmark comparison results are based on a serial data input and the largest possible worst case object, depending on the MP array resolution. The computation latencies decrease dramatically using the line-parallel data input scheme supported by our chip architecture and several image objects with smaller form factors than the worst case object.

Finally, the table 2 evaluates the attributes of emergence and self-organization defined in [13] with the properties of our MP architectures.

At the moment the emergent sends no feedback to the local interacting MPs. Therefore no termination condition to stop the algorithm while computing a given image exists. The algorithm has to be always active for the time which is needed to compute the worst case object. For this drawback a remedy could already be found in the following way: Projections data in x and y have to be constantly stored into register banks at the CU-array boundaries. If the centroids remain on their positions, no changes of these data registers are occurred. This is the signal to terminate the algorithm and to put out the centroid image. By applying this mechanism to the present MP architecture the system characteristic is supplemented with the still missing bidirectional link.

Next post:

Previous post: