Graphics Reference
In-Depth Information
DRAMICs. The complexity of application processor SoCs withmany internal clients
'competing' for external memory resources slows down the access. The large amount
of client memory requests has to be buffered and has to go through multi-level of
interconnect networks to ensure consistency and arbitration. As a result there is a long
latency for fetching data from external memory into the processing engines. Thus
while designing and implementing AR and CV functions one should understand
the bottlenecks and reduce bandwidth as much as possible. Hence, AR and CV
algorithms and implementations must take these limitations into account. Next, we
highlight some of the rules that can be followed to improve throughput and reduce
memory bandwidth.
The most important rule is to avoid reading data multiple times and instead to
move chunks of data into local memories, and apply as many operations as possible
to the local data. This is similar to the working of the graph model in the OpenVX
framework [ 11 ]. Intermediate data nodes in the graph are stored in local memory.
Other optimization options are to compress data during transfer to and from external
SDRAM or to hide large latencies by prefetching data. In case deterministic access is
mandatory, double buffered data transfers by DMA are performed into local memory.
The first buffer waits for data and the other buffer is processed without wait-states.
When moving from 2D or 3D sparse features to full 3D dense point clouds, data-
bases are huge and hence more effort is required to manage external memory access
efficiently. Matching or comparison of one large database (e.g., point cloud) against
another (e.g., using Iterative Closest Point (ICP)) cannot be performed by exhaustive
search since it would lead to O(n 2 ) complexity. This will need index structures such
as binary trees to reduce the complexity to O(n log (n)) . But this leads to another
challenge: the data access becomes non-deterministic and prefetching is not possible
anymore. This can be solved using another method which is implemented by GPUs.
The GPUs make use of caches and run a large number of tasks in parallel. Tasks that
have to wait while their data are fetched from external memory pass their compu-
tation unit to the next task that is ready. If enough tasks are initiated, the memory
latencies become hidden and all available computation units can be fully utilized.
Figure 7.17 shows the architecture of the Metaio AR Engine. According to the
considerations stated in previous sections, accelerator blocks only access local mem-
ory. Ideally, they can be accessed with zero cycle delay. The buffers are prefilled and
emptied by an autonomous DMA controller. As the matcher typically runs in parallel
to other algorithms and has to access big databases in external memory, it has its
own interface to the memory controller. To keep the maximal flexibility in stream-
lining data flows for the individual application, a programmable core has been added
as embedded control unit. It is software programmable and sets up all the individ-
ual operations and data transfers. This flexibility also allows the implementation of
an OpenVX-like graph model of the data flow between the individual units. The
engine can operate completely autonomously. It communicates with the host CPU
(an ARM core typically) by interrupts and through a dedicated host interface. The
host can access parts of the internal memories of the engine to set up parameters and
commands. To get to an optimum AR user experience, HW and SW have to evolve
so that they cause all processing resources available on an application processor. For
Search WWH ::




Custom Search