Image Processing Reference
In-Depth Information
In the case that this ratio is higher than a threshold, this row is the border separated table and
floor areas.
For human detection, we use both depth image and segmented floor area information,
which is detected in previous step. When human appears in scene, the distance from depth
sensor to human body is much nearer than floor area. It means the pixels, whose grayscale are
higher than average grayscale of floor area, are ones of human body.
Next step, for hand detection, we use color images and skin detection, too. First, we choose
some points, which are in skin area. Based on the color of these points, we obtain the range of
the color of pixels in skin area. Then, each pixel is classified by its color. Then, we eliminate
areas that their size are not in range [hand_size_min, hand_size_max], which are the
thresholds to determine size of hands. In case there are still more than two areas, only two
largest areas are selected because the smaller areas are almost noises.
Besides, we also apply some object detection models. Objects in this case mean cooking tools
such as fry pan, pan, chopstick; and ingredients such as egg, ham, and seasoning. For object
detection, we use color images because they contain more information than depth images. For
each object, we collect the image samples which are about 100 images. Then, image feature for
each of image is calculated and a training data is made using one-versus-all SVM classiica-
tion. When testing system is executed, a slide window is used to detect objects and recognize
which kind of object.
3.3 Image Feature Extraction
Image feature has an important role in motion representing and they are extracted faster than
other features. To solve our problems, we use image feature as one of main features for fast
motion describing. In addition, there are many kinds of image feature can be used to solve our
problem. We need to consider which features are the most characteristic for motion represent-
ing.
Following preprocessing step, first, image features are extracted from every frame in video.
However, in practice, we only extract feature from key-frames which are chosen by one frame
for each k frames in video. In our experiment, we use k = 10 to reduce the amount of compu-
tation because in ten continuous frames there are not many differences. In our research, LBP
[ 18 ] , EOH [ 19 ] , PHOG [ 3 ] , and SIFT [ 4 ] are used because they can characterize the content of
frames including information about the context and are easy to be extracted.
To extract image feature more detail, each image feature is extracted from cells of a 4 × 8
grid of frame. For PHOG feature, we extract with eight gradient bins and the highest levell
l = 2. Moreover, for SIFT features, we apply BoF method [ 16 ] to increase the effectiveness of re-
cognition. After key point detection step, histogram of these keypoints is calculated based on
codebook which is the collection of millions of keypoints. Then, we gain many different kinds
of feature vectors. Next, we apply early fusion technique in [ 6 ] to join these feature vectors
together to obtain only one feature vector, which is known as the image feature vector charac-
terized for a frame in video.
3.4 Motion Feature Extraction
For solving our problem, motion feature is indispensable because of their efficient in motion
representation. However, this feature requires an enormous computation and is much more
complex than image feature. Therefore, in our research, we use one of fastest and most density
motion feature which was studied by Wang et al. [ 5 ] .
Search WWH ::




Custom Search