Database Reference
In-Depth Information
Table 7.3
Play event classification results, obtained by multiple feature types
Play
category
MPEG-7 audio
+
MPEG-7 motion
+
MPEG-7 motion
+
MPEG-7 motion
+
MFCC
audio
MFCC
audio
+
MFCC
Pass
70.0 %
85.2 %
85.2 %
94.3 %
Run
59.7 %
91.0 %
92.5 %
89.6 %
FG/XP
75.0 %
87.5 %
87.5 %
93.8 %
K/P
69.0 %
82.8 %
82.8 %
93.1 %
Overall
67 . 0%
87 . 0%
87 . 5%
92 . 5%
Table 7.4 Play event classification results, obtained by three sets of
features, based on motion combined with other modalities
Method
Pass
Run
EG/XP
K/P
MPEG-7 motion
79.5 %
92.5 %
87.5 %
65.5 %
MPEG-7 motion + audio
85.2 %
91.0 %
87.5 %
82.8 %
MPEG7 motion + audio + MFCC
94.3 %
89.6 %
93.8 %
93.1 %
different networks. This variety in the database ensured that the sample space of the
current work was diverse and included all the major broadcasters.
Table 7.3 , shows the indexing results of using MPEG-7 motion and audio
descriptors along with MFCC features. From table, we can see the classification
accuracy increased with the combining of multi-modal features. In the case of
combining the MPEG-7 audio with MFCC features, we see an overall increase of
10 %, while combining the audio features with motion descriptor features shows
an increase of 5 %. Combining all three features produces an overall classification
result of 92.5 %.
Combining multi-modal features in a reasonable fashion can enhance the
classification. But always there are trade-offs that need to be considered. Some
features may reduce the accuracy of classification of a particular category but may
enhance the overall performance of the system. Table 7.4 shows the variations in
classification that results from adding audio features to the motion features.
7.7
Summary
The chapter covers a broad spectrum of video segmentation, indexing, retrieval,
and classification techniques applicable to news and sports videos. Based on the
energy histogram of DCT coefficients, a shot detection algorithm for MPEG video
data in the compressed domain can be developed. The detection results can be
enhanced by using the ratio between two sliding windows to attenuate the low-pass
filtered frame distances. The advantage is in achieving high detection rates with low
computational complexity. In a subsequent process, news videos can be segmented
into shot, group-of-shots, and story levels, where the template frequency model can
be applied to capture the spatio-temporal information. This facilitates video retrieval
 
Search WWH ::




Custom Search