Social Network Analysis in Community-Built Databases - Community-Built Databases: Research and Development

Database Reference

In-Depth Information

retrieved the date and time of user registration. To pair user accounts across

different Wikipedia projects, we used only the account name (as no other reliable

information is available). Next, we considered data within some predefined time

range only and constructed a binary matrix containing information about the

presence of a given user account in analyzed projects. In the next step, we reduced

the dimension of this matrix using NMF (using rank 3). This process has changed

the matrix a little bit. A change from 0 to 1 in attribute A of user U in our

interpretation means that many users having similar attributes as user U possessed

also the attribute A . This change can be considered as a recommendation that the

user is interested in this project (or as a prediction that this user will in the future be

interested in this project). To verify this, we compared these proposals with data

observed after several years. Only those users participating in more than two

projects have been selected.

3.3.2.1 Obtained Results

Experiment results are summarized in Table 3.1 . We have constructed several

datasets containing information about user accounts from different years (line 1).

From each dataset we have selected a random sample of data (line 2). For the year

2008, we have selected three samples of different sizes. Next, we have computed our

baseline - the probability of a random suggestion being true with respect to the data

from 8/3/2010 (line 3). Using the aforementioned process, we have generated

suggestions (line 4) and computed the precision (line 5) and recall (line 6) of these

suggestions. The results show that this process can significantly improve random

suggestions (from 2-9% up to 20-27%), but only in a limited scope (recall 4-17%).

It seems that some correlation exists between the sample size and recall, but this

correlation will not be direct and will probably be dependent on some other property.

The computation of NMF every time we need suggestions for one particular user

is clearly not effective. There are two options: we can preprocess this data from

time to time or we can use the OnlineNMF (see [ 8 ]) to update the computed data

whenever needed. Such an approach can be used not only to discover users' interest

in projects, but in the same manner also in documents, set of documents, or topics in

general. It would also be interesting to investigate the differences in behavior of

similar matrix decomposition methods, such as the singular value decomposition

(SVD, [ 22 ]) and especially the semidiscrete decomposition method (SDD, [ 28 ]),

which is well suited to binary data.

Table 3.1 Results of the recommendation experiment (compared with the data from 8/3/2010)

Data from

1/1/2007

1/1/2008

1/1/2009

Sample size

1,156

751

1,309

2,838

2,660

Random change

9%

2%

Suggestions

150

346

165

1,478

393

Precision

26%

27%

20%

Recall

4%

15%

4%

17%

14%

Community-Built Databases: Research and Development

Search WWH ::

Custom Search

Home