On Importance of Interactions and Context in Human Action Recognition (Pattern Recognition and Image Analysis)

Abstract

This paper is focused on the automatic recognition of human events in static images. Popular techniques use knowledge of the human pose for inferring the action, and the most recent approaches tend to combine pose information with either knowledge of the scene or of the objects with which the human interacts. Our approach makes a step forward in this direction by combining the human pose with the scene in which the human is placed, together with the spatial relationships between humans and objects. Based on standard, simple descriptors like HOG and SIFT, recognition performance is enhanced when these three types of knowledge are taken into account. Results obtained in the PASCAL 2010 Action Recognition Dataset demonstrate that our technique reaches state-of-the-art results using simple descriptors and classifiers.

Keywords: Scene Understanding, Action Recognition, Spatial Interaction Modeling.

Introduction

The enormous amount of images daily generated by millions of Internet users demands robust and generic image understanding techniques for the automatic indexing and annotation of those human events displayed in pictures, for a further search and retrieval. In essence, the main goal of this Image Understanding process is to assign automatically semantic labels to images in which humans appear. This process tries to bridge the semantic gap between low-level image representation and the high-level (Natural Language) descriptions given by humans [1]. In this domain, the recognition of human activities in static images becomes of great importance, since (i) humans appear the most in the images (and videos), and (ii) knowledge about the scene and objects nearby can exploited for inferring the human action [4].


This progress on such an automatic recognition of human events in image databases have led to recent interesting approaches which take into account multiple sources of knowledge that can be found in an image, namely (i) the pose of the human body [2], (ii) the kind of scene in which the human is performing her/his action [3], and (iii) the objects with which the human interacts in order to perform the action [5].

On one hand, a popular strategy for pose estimation is based on finding the proper parameters of an articulated model of the human body, like for example in [8]. The task of pose estimation is to find a parameters of the articulated model which correspond to a human in the image/video of interest. However, these type of approaches are computationally expensive and require good quality images of humans in both training and testing data, which is not always achievable. That is why other approaches are based on appearance matching alternatively [11]. This approach is purely bottom-up and the pose is represented by appearance feature vectors.

Pose-based methods for action recognition constitute successful solutions to the recognition of actions such as walking or running [2]. However, pose only is not enough for the analysis of more complex human behaviours like for example riding horse, phoning, or working with computer. In these cases, the knowledge about the scene (outdoor or indoor) and the interacting objects (horses, phones, screens) should be exploited.

The scene in which the human is performing an action can provide discriminant information: for example playing instrument is usually observed in the concerts, while working with computer can be often seen in the office environment. Scene analysis can be done in two main ways: (i) scene region segmentation (sky, road, buildings, etc.), and (ii) holistic scene classification (indoor or outdoor, mountain view or forest, bedroom or kitchen, etc.) using global appearance of the scene [6]. Segmentation methods usually provide detailed information about those regions of the scene which are particularly important for action understanding. Alternatively, obtaining a single label for the whole scene (indoor, outdoor, etc.) has been proven enough for action recognition: Marszalek et al. [3] studied the relevant scene classes and their correlation with human activities. They illustrate that incorporating scene information effectively increases the recognition rate.

The analysis of interactions of the human with other objects is of importance. For example, the aforementioned actions involve not only a human, but also the interaction with object(s) in the scene. Towards this end, there have been presented interesting methods for spatial interactions modeling [4,5,7,8]. The majority of these proposed interaction models are based on Bayesian models. These models provide coherent interference, however they are computationally expensive and require a proper initialization of the model. Authors in [9] proposed a simple and elegant solutions which models spatial interactions "on top, above, below, next to, near, far, overlap". These relationships provides high-level semantic interpretation for human-object and object-object interactions.

In this paper we propose an strategy towards the efficient combination of three sources of knowledge, i.e. human pose, scene label and object interaction. The main contribution of our work relies on taking into account the pose of the human, the interactions between human and objects, and the context. Also, an important requirement of our technique is that we do not make any assumption about the objects that can appear in the scene. In particular, this goal extends the work proposed in Li and Fei-Fei [4], where a generic model is proposed that incorporate several sources of knowledge, including event, scene and objects. Their model, restricted to sports activities only, ignores the spatial and interactive relationships among the objects in the scene. Our approach also extends another technique for action recognition presented by Gupta et al. [5]. They describe how to apply spatial constrains on location of objects in the action recognition. However, their model requires a strong prior knowledge in order to distinguish between manipulable and scene objects, and their approach is tested only on sport datasets, where there is no context information available.

Human action recognition

Fig. 1. Human action recognition

The rest of the paper is divided as follows: Section 2 presents our novel framework for action recognition, and details the models used for pose, scene, and interaction analysis. Since we aim to demonstrate the benefits of combining these three types of knowledge, we restrict our models to be based on standard descriptors like HOG or SIFT in order to better evaluate the gain in their combination. Section 3 provide the experiment results and shows that our performance achieves state-of-the-art results in the PASCAL 2010 VOC Challenge [13]. Finally, section 4 draws the conclusions and scope for future research.

Human Action Recognition

The overall pipeline for the action recognition is illustrated in Fig. 1. Initially, given an image and bounding box of the human, salient features are extracted. Then, object detection, based on on the Recursive Coarse-to-Fine Localization [10], is done. Next, human pose is extracted, scene analysis is performed and the interactions between human, scene and objects are analysed. Finally, using a classification procedure, an estimation about the human action is computed. While feature extraction and object detection are only used in a straightforward manner, the pose, scene, and spatial interaction analysis are detailed next.

Pose Estimation

Pose estimation is achieved by fusing knowledge about the local appearance and the local shape of the human, which location is obtained from a bounding box, provided by the dataset. The appearance of a human pose is computed in the area of the bounding box, using the Bag-of-Words (BoW) technique. The shape representation of the human pose is done with histograms of gradient (HOG) [11], which capture edge orientation in the region of interest. In order to keep spatial information, we apply Pyramid of HOG [12]. This allows capturing local contour information as well as keeping a spatial constrains. A final human pose model Hp results from the concatenation of the appearance and shape representations.

Spatial histogram

Fig. 2. Spatial histogram

Scene Model

Scene Analysis is done using SIFT features and BoW approach enhanced with a spatial pyramid presented by [6]. In our work we use a spatial pyramid over the background with two levels: zero level includes entire background region, and first level consists of three horizontal bars, which are defined by bounding box. The global scene of the image is represented with a histogram HBG which is a concatenation of histograms of both levels.

Spatial Interaction

To handle spatial interactions we combine two interaction models: (i) a local interaction model and (ii) a global interaction model, adapted from [9].

Local Interaction. Local Interaction model HLj is a SIFT based BoW histogram which is calculated over the local neighbourhood around the bounding box. The neighbourhood of the bounding box defines a local context that helps to analyse the interactions between a human and the objects that are being manipulated by the human.

Global Interaction. A basic description of actions in a scene can be done using information about the types of objects that are observed in the scene. Given NO the number of object detections O = [Oi,O2, ...,ONo] in the image I, object occurrence can be represented as a histogram HO:

tmp368-178_thumb

where ui is such that only one element of ui is nonzero, and \ui\ is the L1-norm of ui. The index of the only nonzero element in ui indicates the class of the object Oi with probability Pi.

In addition, it is important incorporate the model about how these objects are distributed in the scene. The model can be obtained by analysing the interactions across all the objects in the scene. The interaction between two objects i and j can be represented by a sparse spatial interaction feature dj, which bins the relative location of the detection windows of i and j into one of the canonical semantic relations including above, below, ontop, next-to, near, and far, see Fig. 2.

Classification process Table 1. Average precision results on PASCAL Action Dataset using different cues

Fig. 3. Classification process Table 1. Average precision results on PASCAL Action Dataset using different cues

Hp

ffp& Hbg

Hp&l Hbg&£ Hint

Walking

67.0

64.0

62.0

Running

75.3

75.4

76.9

Phoning

45.8

42.0

45.5

Playing instrument

45.6

55.6

54.5

Taking photo

22.4

28.6

32.9

Reading

27.0

25.8

31.7

Riding bike

64.5

65.4

75.2

Riding horse

72.8

87.6

88.1

Using PC

48.9

62.6

64.1

Average

52.1

56.3

59.0

Therefore, every image I can be represented with an interaction matrix Hi. Every element hIkl of the matrix HI represents the spatial interaction between classes k and l:

tmp368-180_thumb

where Ok and Ol are detections of objects of classes k and l correspondingly.

Therefore, the global interactions model Hgi is represented as the concatenation of HO and HI, and the final spatial interaction model HINT is defined as the concatenation of the local and global interaction models, HLI and Hgi.

Classification

In this stage, images represented with histograms are classified using a Support Vector Machine (SVM) classifier (see Fig. 3), which was trained and tested using the respective image sets. A histogram intersection kernel is used to introduce non-linearity to the decision functions. In order to fuse multiple image representations HP, HBG, Hint we use concatenation of normalized histograms.

Correctly classified examples of walking(a), running(b), phoning(c), playing instrument(d), taking photo(e), reading(f), riding bike(g), riding horse(h), using PC(i)

Fig. 4. Correctly classified examples of walking(a), running(b), phoning(c), playing instrument(d), taking photo(e), reading(f), riding bike(g), riding horse(h), using PC(i)

Misclassified examples of walking(a), running(b), phoning(c), playing instrument(d), taking photo(e), reading(f), riding bike(g), riding horse(h), using PC(i)

Fig. 5. Misclassified examples of walking(a), running(b), phoning(c), playing instrument(d), taking photo(e), reading(f), riding bike(g), riding horse(h), using PC(i)

Experimental Results

Instead of applying our technique in sport datasets as in [4,5], we tested our approach on a more challenging dataset provided by the PASCAL VOC Challenge 2010 [13]. The main feature of this dataset is that each person is annotated with a bounding box together with the activities they are performing: phoning, playing a musical instrument, reading, riding a bicycle or motorcycle, riding a horse, running, taking a photograph,, using a computer, or walking. To train the spatial interaction model based on object detections we used 20 object classes: aeroplane, bicycle, bird, boat, bus, car, cat, chair, cow, dog, dining table, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor.

To evaluate the importance of context and interactions, three main experiments were accomplished: (i) using only pose model, (ii) using pose model and scene model, and (iii) using pose, scene analysis, and spatial interaction models, see Table 1). A selection of correctly classified and misclassified examples are illustrated in Figures 4 and 5. The complexity of the dataset is that there are simple actions (walking, running), actions with unknown objects (phoning, playing an instrument, taking a photo, reading), and actions with known objects (riding a bike, riding a horse, using a PC). The evaluation of results is accomplished computing precision-recall curves and average precision measures.

Table 2. Comparing average precision scores in PASCAL Action Dataset (details on these methods are found in [13])

tmp368-183 tmp368-184 tmp368-185 tmp368-186 tmp368-187 tmp368-188 tmp368-189 tmp368-190 tmp368-191 tmp368-192

WILLOW LSVM

41.5

73.6

40.4

29.9

17.6

32.2

53.5

62.2

45.8

44.1

SURREY MK KDA

68.6

86.5

52.6

53.5

32.8

35.9

81.0

89.3

59.2

45.5

WILLOW SVMSIFT

56.4

78.3

47.9

29.1

26.0

21.7

53.5

76.7

42.9

48.1

WILLOW A SVMSIFT 1-A LSVM

56.9

81.7

49.2

37.7

24.3

22.2

73.2

77.1

53.7

52.9

UMCO DHOG KSVM

60.4

83.0

53.5

43.0

34.1

32.0

67.9

68.8

45.9

54.3

BONN ACTION

61.1

78.5

47.5

51.1

32.4

31.9

64.5

69.1

53.9

54.4

NUDT SVM WHGO SIFT C-LLM

71.5

79.5

47.2

47.9

24.9

24.5

74.2

81.0

58.6

56.6

INRIA SPM HT

61.8

84.6

53.2

53.6

30.4

30.2

78.2

88.4

60.9

60.1

CVC SEL

72.5

85.1

49.8

52.8

24.9

34.3

74.2

85.5

64.1

60.4

CVC BASE

69.2

86.5

56.2

56.5

25.4

34.7

75.1

83.6

60.0

60.8

UCLEAR SVM DOSP MULTFEATS

70.1

87.3

47.0

57.8

32.5

26.9

78.8

89.7

60.0

61.1

Our method: Hp Hbg & Hint

62.0

76.9

45.5

54.5

32.9

31.7

75.2

88.1

64.1

59.0

As we can see from the Table 1, for simple actions (walking, running), pose information is the most important. The minor improvements for running class can be explained by the fact that running is usually observed outdoor with groups of people, while walking does not have such a pattern and equally can be both indoor and outdoor, thus, adding context and interaction information decreases recognition rate.

Next, for those actions including interactions with unknown objects there is no single solution. Results of phoning are better when pose model are used alone; this has two explanations: (i) the typical pose is discriminative enough for this action, and (ii) the bounding box containing the human usually occupies almost the whole image, so there is not much room for the context and objects in the scene. An action like playing an instrument improves significantly with a scene model, since that activity often means "playing in a concert" with quite particular and distinguishable context, e.g. cluttered dark indoor scenes. Even though we can observe the increase of performance for taking a photo, its recognition rate is low due to the significant variations in appearance. The recognition results for the reading class significantly increase when adding object interaction models, as reading is usually observed in an indoor environment, where many objects like sofas, chairs, or tables can be detected.

Finally, the actions like riding bike, riding horse, using PC get significant improvement (13.5% in average per class) when we use a complete model (Pose & Scene & Interaction) comparing with results based on the pose model only. This shows the particular importance of using context and spatial object interaction information for action recognition.

Comparing our results with state-of-the art in Table 2, we might notice that our results performs in average around 3% behind the best results reported in [13]. However, in our work we used a simplified model based only on SIFT and HOG features, while the team which achieved the best results developed the action recognition framework based on 18 different variations of SIFT [13].

Conclusions and Future Work

In this paper, our main hypothesis is that human event recognition requires modelling the relationships between the humans and environment where the event happens. In order to asses this hypothesis, we propose an strategy towards the efficient combination of three sources of knowledge, i.e. human pose, scene appearance and spatial object interaction. The experimental results on the very recent PASCAL 2010 Action Recognition Challenge show a significant gain in recognition rate in the case our full model is used.

For the future work, we will extend our method by using more features and test our method on other datasets. Moreover, motion information should be added in order to model spatial-temporal interactions, which are of interest for video-based action recognition.

Next post:

Previous post: