Gesture recognition in cooking video based on image features and motion features using Bayesian network classifier - Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Image Processing Reference

In-Depth Information

In the case that this ratio is higher than a threshold, this row is the border separated table and

floor areas.

For human detection, we use both depth image and segmented floor area information,

which is detected in previous step. When human appears in scene, the distance from depth

sensor to human body is much nearer than floor area. It means the pixels, whose grayscale are

higher than average grayscale of floor area, are ones of human body.

Next step, for hand detection, we use color images and skin detection, too. First, we choose

some points, which are in skin area. Based on the color of these points, we obtain the range of

the color of pixels in skin area. Then, each pixel is classified by its color. Then, we eliminate

areas that their size are not in range [hand_size_min, hand_size_max], which are the

thresholds to determine size of hands. In case there are still more than two areas, only two

largest areas are selected because the smaller areas are almost noises.

Besides, we also apply some object detection models. Objects in this case mean cooking tools

such as fry pan, pan, chopstick; and ingredients such as egg, ham, and seasoning. For object

detection, we use color images because they contain more information than depth images. For

each object, we collect the image samples which are about 100 images. Then, image feature for

each of image is calculated and a training data is made using one-versus-all SVM classiica-

tion. When testing system is executed, a slide window is used to detect objects and recognize

which kind of object.

3.3 Image Feature Extraction

Image feature has an important role in motion representing and they are extracted faster than

other features. To solve our problems, we use image feature as one of main features for fast

motion describing. In addition, there are many kinds of image feature can be used to solve our

problem. We need to consider which features are the most characteristic for motion represent-

ing.

Following preprocessing step, first, image features are extracted from every frame in video.

However, in practice, we only extract feature from key-frames which are chosen by one frame

for each k frames in video. In our experiment, we use k = 10 to reduce the amount of compu-

tation because in ten continuous frames there are not many differences. In our research, LBP

[ 18 ] , EOH [ 19 ] , PHOG [ 3 ] , and SIFT [ 4 ] are used because they can characterize the content of

frames including information about the context and are easy to be extracted.

To extract image feature more detail, each image feature is extracted from cells of a 4 × 8

grid of frame. For PHOG feature, we extract with eight gradient bins and the highest levell

l = 2. Moreover, for SIFT features, we apply BoF method [ 16 ] to increase the effectiveness of re-

cognition. After key point detection step, histogram of these keypoints is calculated based on

codebook which is the collection of millions of keypoints. Then, we gain many different kinds

of feature vectors. Next, we apply early fusion technique in [ 6 ] to join these feature vectors

together to obtain only one feature vector, which is known as the image feature vector charac-

terized for a frame in video.

3.4 Motion Feature Extraction

For solving our problem, motion feature is indispensable because of their efficient in motion

representation. However, this feature requires an enormous computation and is much more

complex than image feature. Therefore, in our research, we use one of fastest and most density

motion feature which was studied by Wang et al. [ 5 ] .

Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Search WWH ::

Custom Search

Home