Image Processing Reference
In-Depth Information
invariant feature transform (SIFT) [ 4 ] and motion feature such as dense motion [ 5 ]. We also
detect objects in each frame and compute the relative positions between them. We try to rep-
resent actions more exactly, so that classiication result would be beter. Hence, the feature
extraction step is very important and the features must be chosen carefully. There is another
subproblem which is to combine these above features. According to research from others us-
ing only one kind of feature is not good enough in accuracy and combining different features,
accuracy gets higher. However, the efficient way of combination is still a complex problem
since it depends on datasets and kinds of feature. In this chapter, we use both early fusion and
late fusion techniques for combining features [ 6 , 7 ] .
Last but not least, to solve cooking action classification problem, we use Bayesian network
(BN) classifier. The first reason is automatic structure learning which means we do not need
to concern about the network's structure. In addition, in cooking video, the sequence of ac-
tions for each kind of dishes has its characteristics. So we can learn this feature and use it in
classiier to achieve beter classiication results. This is the second reason for us to choose BN
classifier. Other advantages are easy to update parameters of nodes in the network and modi-
fy network's structure by adding or removing nodes in the network. For three main reasons,
we have chosen BN as the classifier for our system.
To build a gesture recognition system, our main contributions are as follows:
• Representing cooking motion by image features including color histogram, local binary
pattern (LBP), edge oriented histogram (EOH), PHOG, SIFT, and also the relative positions
of objects.
• Representing cooking motion by motion features using dense trajectories motion feature.
• Combination of image features and motion features by both early fusion and late fusion
techniques.
• Using BN classification for gesture recognition.
2 Related work
As soon as the flourishing period of computer visions, human's action recognition problem
has appeared in many applications especially for smart devices. Generally, a typical human's
action recognition system works by extracting some kind of features or/and combining them
in a certain way. Most of these systems usually use both global and local image features. Until
now, many researchers have been trying to answer which feature is the best for describing hu-
man's action and whether different features are supplements for each other. These following
features have been deeply studied to answer the above question for recently years.
An answer came from a successful system built by a research team from Columbia
University. It used SIFT [ 4 ] as an image feature, Space-time interest points (STIP) [ 8 ] as a mo-
tion feature and Mel Frequency Cepstral Coefficients [ 9 ] as a sound feature. Overall, STIP was
the best motion feature for human's action description. However, to achieve beter results, dif-
ferent kinds of them, which were supplemental features, should be combined together includ-
ing image features, motion features, and even sound features. This is an important conclusion
that other teams agree with.
Researchers from International Business Machines Corporation (IBM) built another hu-
man's action recognition system [ 10 ] . It used many image features including SIFT [ 4 ] , GIST
[ 11 ] , color histogram, color moment, wavelet texture, etc. For motion features, it uses STIP [ 8 ]
combining with HOG/HOF [ 12 ] . According to their experiment's result, they concluded that a
combination of some features raised the accuracy of recognition. This conclusion is the same
as Columbia team [ 13 ] . Besides, Nikon team's system is a simple system [ 14 ] . It used scene cut
Search WWH ::




Custom Search