Image Processing Reference
In-Depth Information
Being parallel with the image features extraction, we extract motion features from videos.
In our method, we use dense trajectories and motion boundary histogram (MBH) description
[ 5 ] for action representation. The main reason for choosing this motion feature is every cook-
ing actions are characterized by diferent simple motions, such as cuting action is related to
vertical motions while mixing action is almost described by turn around motions. Moreover,
there are many fine motions in cooking videos, so that we use dense trajectories feature that
is the best feature for representing even the fine motions. Besides, MBH descriptor expresses
only boundary of foreground motion and eliminates background and camera motion. Thus, it
is completely appropriate to be applied in this step for action representation.
To compute the optical flow from above dense samples, we use FarneBack algorithm [ 20 ]
because it is one of the fastest algorithm to compute a dense optical flow. Next, we track in
optical flows to find out trajectories in a sequence of 15 continuous frames. To describe mo-
tion feature, each video is separated to many blocks, which are N × M × L-size blocks. It means
scaling each optical flow matrix to size N × M and each block containing L optical flow. Then,
each block is divided into n σ × n σ × n t cells. Lastly, we calculate MBH feature for each cell and
join them together. For motion feature, we also use BoFs [ 16 ] to increase effectiveness of recog-
nition as SIFT feature from image features.
3.5 BNs Training
Because of three main reasons in the first section, we choose BNs as our classifier. In our ap-
proach, we use three separate networks that play different roles in this classification step. Since
there are some categories of features from label information, image features and motion fea-
tures, we use different BNs for training and classifying each of categories. By using three dif-
ferent BNs, the classiication result would be beter than using only one network. Moreover,
we can train three different networks at the same time which means training time could be
In this step, we create three BNs to classify feature vectors into a certain action class. First
of all, we have a BN from ground-truth label data which represents the possibility of sub-
sequence action's label based on previous identified action labels. It is calculated by using
Bayes's theorem formula
where A i is the i ith action and PALs are previous action labels.
For the second BN, we have a graph in which nodes' value are extracted from high-level
feature. In this BN, see Figure 2 , we have human node which determines whether human ex-
ists in this frame. Similarly, hand node is node which determines whether hands are in frame
or not. Besides, nodes including tool using node, container using node, and food using node
are determined based on the relative position between the hands and the objects. In addition,
there is a status changing node, which expresses the changing of ingredient inside cooking
container. Finally, the action label nodes are based on the above-identified nodes, whose con-
ditional probability formula is below
Search WWH ::

Custom Search