Information Technology Reference
In-Depth Information
3.4 Methods for Relational User Attribute Inference
3.4.1 User Feature Extraction
We aim to extract a rich set of user features from user-generated multimedia profiles
and posts. The user features include six types: sociolinguistic feature, unigram fea-
ture, topic-based feature, profile photo feature, profile photo face feature, and post
photo feature. Both textual and visual features are considered: The first three are
text-based and the latter three are visual content-based.
Textual User Features . For each user, we aggregate his/her profile and all the posts
into a big document for textual user feature extraction. Previous work has extensively
studied the effectiveness of different textual features on attribute classification [ 26 -
28 ]. Generally, unigram and sociolinguistic model with term presence can achieve
good results for different attribute inference tasks. These two types of features are both
utilized. Sociolinguistic feature is constructed by remaining sociolinguistic words or
signs (e.g., umm, uh-huh,
). Unigram model removes these signs to
construct the feature. Instead of using all the words in the data collection for feature
representation, we use a simple method for discriminative word selection for each
attribute category. The basic idea is to measure each word by a score:
>
_
<, ><
k (
·
AT T w ki
AV E w i )
s
(
t i ) =
log
(
N w i )
(3.1)
N w i
where AT T w ki is the number of word w i in the k th attribute value; AV E w i and N w i ,
respectively, denote the average number and total number of word w i in the attribute
type. We select the top 10,000 words that have highest scores and use the word
presence as the feature weight for sociolinguistic feature. Therefore, each user is
represented as a 10,000 dimensional binary feature vector. For topic-based feature,
LDA [ 4 ] is applied to extract latent topics from user profiles and posts. After the topic
distillation, each user is represented as a distribution over the derived topic space.
Visual User Features . We extract visual features from two types of photos, i.e.,
profile photo and post photo. Each profile photo is represented as a 809-dimensional
feature vector [ 41 ], consisting of 81-dimensional color moment, 37-dimensional
edge histogram, 120-dimensional wavelet texture feature, 59-dimensional LBP fea-
ture [ 24 ], and 512-dimensional GIST feature [ 30 ]. Additionally, the profile photos
usually contain people faces. Since the people face is useful for identifying face
related attributes such as age and gender, we detect face regions from profile photos
and extract the same 809-dimensional feature vector to construct the profile face fea-
ture. We use the face detection tool 9 that can largely handle different face poses and
scales in web images. The appearances of post photos vary across a wide range of
concepts. In order to obtain a compact and semantic representation for each user, we
explicitlymap each user's post photos onto a pre-defined concept list. The concept list
9 http://www.faceplusplus.com/en/ .
 
 
Search WWH ::




Custom Search