Hardware Reference
In-Depth Information
￿
Texture Coding (TC) performs discrete cosine transform and quantization over
motion compensated residuals ('texture').
￿
Texture Update (TU) uses the output of TC in order to locally reconstruct the
original frame as it would appear after the decoding phase. This reconstructed
frame can be useful later on as reference frame.
￿
Entropy Coding (EC) encodes the motion vectors to produce a compressed
bitstream.
￿
Bitstream Packetizing (BP) prepares the packets containing the output data.
These functional blocks are implemented in application kernels (computational inten-
sive nested loops) which have been optimized for compilation on VLIW architectures,
in particular for the execution on an ADRES processor [ 5 ] used in our MPSoC
platform.
The RRM can generate a trade off between application performance and resource
usage by selecting a specific parallelization to be executed on the platform. To do
so, the RRM needs different parallel versions of the same application, i.e., different
binaries which perform the same functionalities while using a different amount of
computing elements.
A set of parallel versions has been generated starting from a sequential imple-
mentation based on the MPEG4 Simple Profile reference code. First of all, the initial
sequential version has been pruned and cleaned to set-up the parallelization proce-
dure. Then, the sequential application is parallelized using MPSoC Parallelization
Assist (MPA) tool [ 6 ].
MPA is a tool which supports MPSoC programmers on investigating different
parallelization alternatives for a given application. Once the MPSoC programmer
specifies a parallelization for the application, MPA is able to automatically insert
into the sequential code all program lines needed to spawn parallel threads and to
implement inter-thread communication. For generating different versions of the same
application, the programmer should profile the sequential application, understand
how kernels can be assigned to different threads and specify different parallelization
opportunities to MPA, without any further handmade modification to the application
code.
MPA is able to handle parallelizations either at functional or at data level (a com-
bination of both functional and data parallelism is also handled). In practice, different
functional kernels can be organized over different threads (functional parallelization)
or the same kernel(s) can be divided over different threads by dividing them w.r.t.
loop indices (data parallelization). In the second case, each thread performs the same
functionalities over a different part of the dataset.
Once different parallel versions are generated with MPA, the obtained codes
can be compiled and generated binaries can be executed on the target platform. In
particular, Fig. 9.2 shows the parallel versions of the MPEG4 encoder studied during
this Chapter. Within Fig. 9.2 , the functional blocks are reported in solid boxes while
the thread partitioning is represented by dotted lines.
In this case study, every thread needs a computing element to be executed and a
computing element cannot execute more than one thread. Thus, the number of threads
Search WWH ::




Custom Search