Gesture recognition in cooking video based on image features and motion features using Bayesian network classifier - Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Image Processing Reference

In-Depth Information

Being parallel with the image features extraction, we extract motion features from videos.

In our method, we use dense trajectories and motion boundary histogram (MBH) description

[ 5 ] for action representation. The main reason for choosing this motion feature is every cook-

ing actions are characterized by diferent simple motions, such as cuting action is related to

vertical motions while mixing action is almost described by turn around motions. Moreover,

there are many fine motions in cooking videos, so that we use dense trajectories feature that

is the best feature for representing even the fine motions. Besides, MBH descriptor expresses

only boundary of foreground motion and eliminates background and camera motion. Thus, it

is completely appropriate to be applied in this step for action representation.

To compute the optical flow from above dense samples, we use FarneBack algorithm [ 20 ]

because it is one of the fastest algorithm to compute a dense optical flow. Next, we track in

optical flows to find out trajectories in a sequence of 15 continuous frames. To describe mo-

tion feature, each video is separated to many blocks, which are N × M × L-size blocks. It means

scaling each optical flow matrix to size N × M and each block containing L optical flow. Then,

each block is divided into n σ × n σ × n t cells. Lastly, we calculate MBH feature for each cell and

join them together. For motion feature, we also use BoFs [ 16 ] to increase effectiveness of recog-

nition as SIFT feature from image features.

3.5 BNs Training

Because of three main reasons in the first section, we choose BNs as our classifier. In our ap-

proach, we use three separate networks that play different roles in this classification step. Since

there are some categories of features from label information, image features and motion fea-

tures, we use different BNs for training and classifying each of categories. By using three dif-

ferent BNs, the classiication result would be beter than using only one network. Moreover,

we can train three different networks at the same time which means training time could be

reduced.

In this step, we create three BNs to classify feature vectors into a certain action class. First

of all, we have a BN from ground-truth label data which represents the possibility of sub-

sequence action's label based on previous identified action labels. It is calculated by using

Bayes's theorem formula

(3)

where A i is the i ith action and PALs are previous action labels.

For the second BN, we have a graph in which nodes' value are extracted from high-level

feature. In this BN, see Figure 2 , we have human node which determines whether human ex-

ists in this frame. Similarly, hand node is node which determines whether hands are in frame

or not. Besides, nodes including tool using node, container using node, and food using node

are determined based on the relative position between the hands and the objects. In addition,

there is a status changing node, which expresses the changing of ingredient inside cooking

container. Finally, the action label nodes are based on the above-identified nodes, whose con-

ditional probability formula is below

Search WWH ::

Custom Search

Home