Information Technology Reference
In-Depth Information
Table 3.3 The statistics of the collected Google
+
data
#Users
#Profile photos
#Posts
#Post photos
#Attached objects
2,548
2,548
846,339
88,988
333,331
in Google
. We name a popular user as he/she has considerable followers and shares
rich content. This strategy to some extend avoids the data sparsity and relives the
annotation task, because we can leverage the popular users' profile information from
other platforms such as Facebook, 6 Wikipedia. 7 Referring to the information from
their homepage on other platforms can largely reduce the annotatingworkload as well
as improve the accuracy of annotation. Note that since the groundtruth of most general
users' attributes is not obtainable, it thus needs annotators' subjective judgement by
comprehensive consideration. This could biasedly affect the evaluation. However,
we aim to develop a model to inference user attributes by exploiting the online
interaction and multimedia information. This guarantees the applicability of our
model to both general users and celebrities. To collect our dataset, we first built a top
20,000 celebrity ID list from Google
+
Social Statistics. 8 For each user in the list,
+
we issued his/her ID into the Google
API and crawled the profile information and
recent 500 posts (if applicable). The profile text metedata and profile photos of users
are crawled. For each post, we downloaded the textual content and attached sources
such as articles, photos, video descriptions. The initial dataset contains 19,624 users.
The users with fewer than 20 posts are filtered. We preprocess the data to filter out
the non-individual or non-english users. This results in 2,548 celebrities and 846,339
posts. Table 3.3 lists the statistics of the collected dataset.
As aforementioned, we study six types of user attributes including gender, age,
relationship, occupation, interest, and sentiment orientation. We invited eight active
social network users as annotators. Three annotators are assigned for each user
record. The annotators are asked to refer to substantial information from Facebook,
Wikipedia, Google Search, to accomplish the attribute annotation of each celebrity
user. A label is determined as ground-truth if at least two annotators agree on it.
Table 3.4 shows the distribution of each attribute.
+
Table 3.4 Number of labels for each user attribute
Attribute
Count
Gender
1,808; 740
Age
728; 1,820
Relationship
1,228; 1,321
Occupation
68; 500; 210; 261; 13; 31; 307; 88; 560; 20; 11; 141; 28; 131; 179
Interest
685; 179; 174; 385; 891; 70; 704; 91; 169; 21; 152; 47;
Sentiment orientation
1,371; 62; 1,115
6 http://www.facebook.com/ .
7 http://www.wikipedia.org/ .
8 http://socialstatistics.com/ .
 
Search WWH ::




Custom Search