Challenges in Embedded Vision for Augmented Reality - Advances in Embedded Computer Vision

Graphics Reference

In-Depth Information

DRAMICs. The complexity of application processor SoCs withmany internal clients

'competing' for external memory resources slows down the access. The large amount

of client memory requests has to be buffered and has to go through multi-level of

interconnect networks to ensure consistency and arbitration. As a result there is a long

latency for fetching data from external memory into the processing engines. Thus

while designing and implementing AR and CV functions one should understand

the bottlenecks and reduce bandwidth as much as possible. Hence, AR and CV

algorithms and implementations must take these limitations into account. Next, we

highlight some of the rules that can be followed to improve throughput and reduce

memory bandwidth.

The most important rule is to avoid reading data multiple times and instead to

move chunks of data into local memories, and apply as many operations as possible

to the local data. This is similar to the working of the graph model in the OpenVX

framework [ 11 ]. Intermediate data nodes in the graph are stored in local memory.

Other optimization options are to compress data during transfer to and from external

SDRAM or to hide large latencies by prefetching data. In case deterministic access is

mandatory, double buffered data transfers by DMA are performed into local memory.

The first buffer waits for data and the other buffer is processed without wait-states.

When moving from 2D or 3D sparse features to full 3D dense point clouds, data-

bases are huge and hence more effort is required to manage external memory access

efficiently. Matching or comparison of one large database (e.g., point cloud) against

another (e.g., using Iterative Closest Point (ICP)) cannot be performed by exhaustive

search since it would lead to O(n 2 ) complexity. This will need index structures such

as binary trees to reduce the complexity to O(n log (n)) . But this leads to another

challenge: the data access becomes non-deterministic and prefetching is not possible

anymore. This can be solved using another method which is implemented by GPUs.

The GPUs make use of caches and run a large number of tasks in parallel. Tasks that

have to wait while their data are fetched from external memory pass their compu-

tation unit to the next task that is ready. If enough tasks are initiated, the memory

latencies become hidden and all available computation units can be fully utilized.

Figure 7.17 shows the architecture of the Metaio AR Engine. According to the

considerations stated in previous sections, accelerator blocks only access local mem-

ory. Ideally, they can be accessed with zero cycle delay. The buffers are prefilled and

emptied by an autonomous DMA controller. As the matcher typically runs in parallel

to other algorithms and has to access big databases in external memory, it has its

own interface to the memory controller. To keep the maximal flexibility in stream-

lining data flows for the individual application, a programmable core has been added

as embedded control unit. It is software programmable and sets up all the individ-

ual operations and data transfers. This flexibility also allows the implementation of

an OpenVX-like graph model of the data flow between the individual units. The

engine can operate completely autonomously. It communicates with the host CPU

(an ARM core typically) by interrupts and through a dedicated host interface. The

host can access parts of the internal memories of the engine to set up parameters and

commands. To get to an optimum AR user experience, HW and SW have to evolve

so that they cause all processing resources available on an application processor. For

Advances in Embedded Computer Vision

Search WWH ::

Custom Search

Home