Image Processing Reference
In our experiments, we use ACE dataset, which contains five sets for training and two sets for
testing. There are five menus of cooking eggs and eight kinds of cooking actions performed by
actors in dataset. In addition, the ingredients and cooking utensils, which are used in dataset,
are egg, ham, milk, oil, salt and frying pan, saucepan, bowl, knife, chopsticks, etc. The videos
were captured by a Kinect sensor, which provides synchronized color and depth image se-
quences. Each of the videos was from 5 to 10 min long containing from 2000 to over 10,000
frames. Each frame is 480 × 640-size and is assigned to a certain action label indicating type of
action performed by the actors in video.
In this dataset, all dishes are based on egg, sometimes ham or some seasons are added to.
Each dish has its own color such as boiled egg has white or brown color from eggshell color
while omelet has yellow and pink color from egg and ham. Therefore, we used image features
such as color histogram, color moment feature which are extracted to classify different dish.
Besides, because each cooking action requires different cooking tool, which has characterized
shape, we use image features related to edge features such as cuting action requires knife
while mixing action requires chopsticks.
4.2 Parameter Setting
There are some parameters throughout our processes such as in preprocessing step, the
parameter d of = 1090 which depends on a certain Kinect device. Other parameters are the
thresholds determining hands are hand_size_min and hand_size_max which are obtained
from training data. For motion extraction, there are also some other parameter including
N × M = 480 × 640, L = 20, n σ = 2, and n t = 3.
Lastly, we simply use w 1 = w 2 = 1 in Equation (6) because the problem of optimizing value
for w 1 and w 2 is hard problem. However, by using w 1 = w 2 = 1 we also obtain a good result as
adding some more features. We evaluate the recognition precision of image features, motion
features, and combination of them in ACE dataset. The evaluation results of using either image
features or motion features singly are shown in the first and the second columns of Table 1 .
When using only image features, some actions such as boiling, breaking, and seasoning can-
not be classified, which precision is 0%. While other actions include baking action and cutting
action achieved has beter precisions which are 36.9 and 26.1%. However, in case we apply
only motion features, all actions have beter precision than using only image features.