Gesture recognition in cooking video based on image features and motion features using Bayesian network classifier - Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Image Processing Reference

In-Depth Information

4 Experiments

4.1 Dataset

In our experiments, we use ACE dataset, which contains five sets for training and two sets for

testing. There are five menus of cooking eggs and eight kinds of cooking actions performed by

actors in dataset. In addition, the ingredients and cooking utensils, which are used in dataset,

are egg, ham, milk, oil, salt and frying pan, saucepan, bowl, knife, chopsticks, etc. The videos

were captured by a Kinect sensor, which provides synchronized color and depth image se-

quences. Each of the videos was from 5 to 10 min long containing from 2000 to over 10,000

frames. Each frame is 480 × 640-size and is assigned to a certain action label indicating type of

action performed by the actors in video.

In this dataset, all dishes are based on egg, sometimes ham or some seasons are added to.

Each dish has its own color such as boiled egg has white or brown color from eggshell color

while omelet has yellow and pink color from egg and ham. Therefore, we used image features

such as color histogram, color moment feature which are extracted to classify different dish.

Besides, because each cooking action requires different cooking tool, which has characterized

shape, we use image features related to edge features such as cuting action requires knife

while mixing action requires chopsticks.

4.2 Parameter Setting

There are some parameters throughout our processes such as in preprocessing step, the

parameter d of = 1090 which depends on a certain Kinect device. Other parameters are the

thresholds determining hands are hand_size_min and hand_size_max which are obtained

from training data. For motion extraction, there are also some other parameter including

N × M = 480 × 640, L = 20, n σ = 2, and n t = 3.

Lastly, we simply use w 1 = w 2 = 1 in Equation (6) because the problem of optimizing value

for w 1 and w 2 is hard problem. However, by using w 1 = w 2 = 1 we also obtain a good result as

we expected.

4.3 Results

Our original work which was published as paper in IPCV2014 [ 21 ] has been extended by

adding some more features. We evaluate the recognition precision of image features, motion

features, and combination of them in ACE dataset. The evaluation results of using either image

features or motion features singly are shown in the first and the second columns of Table 1 .

When using only image features, some actions such as boiling, breaking, and seasoning can-

not be classified, which precision is 0%. While other actions include baking action and cutting

action achieved has beter precisions which are 36.9 and 26.1%. However, in case we apply

only motion features, all actions have beter precision than using only image features.

Search WWH ::

Custom Search

Home