Database Reference
In-Depth Information
retrieved the date and time of user registration. To pair user accounts across
different Wikipedia projects, we used only the account name (as no other reliable
information is available). Next, we considered data within some predefined time
range only and constructed a binary matrix containing information about the
presence of a given user account in analyzed projects. In the next step, we reduced
the dimension of this matrix using NMF (using rank 3). This process has changed
the matrix a little bit. A change from 0 to 1 in attribute A of user U in our
interpretation means that many users having similar attributes as user U possessed
also the attribute A . This change can be considered as a recommendation that the
user is interested in this project (or as a prediction that this user will in the future be
interested in this project). To verify this, we compared these proposals with data
observed after several years. Only those users participating in more than two
projects have been selected.
3.3.2.1 Obtained Results
Experiment results are summarized in Table 3.1 . We have constructed several
datasets containing information about user accounts from different years (line 1).
From each dataset we have selected a random sample of data (line 2). For the year
2008, we have selected three samples of different sizes. Next, we have computed our
baseline - the probability of a random suggestion being true with respect to the data
from 8/3/2010 (line 3). Using the aforementioned process, we have generated
suggestions (line 4) and computed the precision (line 5) and recall (line 6) of these
suggestions. The results show that this process can significantly improve random
suggestions (from 2-9% up to 20-27%), but only in a limited scope (recall 4-17%).
It seems that some correlation exists between the sample size and recall, but this
correlation will not be direct and will probably be dependent on some other property.
The computation of NMF every time we need suggestions for one particular user
is clearly not effective. There are two options: we can preprocess this data from
time to time or we can use the OnlineNMF (see [ 8 ]) to update the computed data
whenever needed. Such an approach can be used not only to discover users' interest
in projects, but in the same manner also in documents, set of documents, or topics in
general. It would also be interesting to investigate the differences in behavior of
similar matrix decomposition methods, such as the singular value decomposition
(SVD, [ 22 ]) and especially the semidiscrete decomposition method (SDD, [ 28 ]),
which is well suited to binary data.
Table 3.1 Results of the recommendation experiment (compared with the data from 8/3/2010)
Data from
1/1/2007
1/1/2008
1/1/2009
Sample size
1,156
751
1,309
2,838
2,660
Random change
9%
9%
9%
9%
2%
Suggestions
150
346
165
1,478
393
Precision
26%
27%
27%
27%
20%
Recall
4%
15%
4%
17%
14%
Search WWH ::




Custom Search