The Semantics of Tagging - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

[

< α <

]

from del.icio.us are situated in the interval

. Figure 5.8

shows that both experimental conditions and the aggregated data from del.icio.us

have similar exponents. Our results show that a similar

732391

249359

holds for both the 'tag

suggestion' and 'no tag suggestion' condition.

5.3.3.2

Kolmogorov-Smirnov Complexity

Determining whether a particular distribution is a 'good fit' for a power-law

is difficult, as most goodness-of-fit tests employ some sort of normal Gaussian

assumption that is inappropriate for non-normal power-law distributions. However,

the Kolmogorov-Smirnov Test (abbreviated as the 'KS Test') can be employed as

a 'goodness-of-fit' test for any distribution without implicit parametric assumptions

and is thus ideal for use measuring goodness-of-fit of a given finite distribution to a

power-law function. Intuitively, given a reference distribution P (perhaps produced

by some well-known function like a power-law) and a sample distribution Q of

size n , where one is testing the null hypothesis that Q is drawn from P , then one

simply compares the cumulative frequency of both P and Q and then the greatest

discrepancy (the D -statistic) between the two distributions is tested against the

critical value for n , which varies per function.

For a power-law distribution generating function, we can get a critical p -value

by generating artificial data using the scaling exponent

and lower-bound equal

to those found in the supposed fitted power-law distribution. A power-law is fit to

this artificial data, and then the KS test is then done for each distribution that was

artificially generated comparing it to its own fitted power-law. The p -value is then

just the fraction of the amount of times the D -statistic is larger for the artificially-

generated distribution than the D -statistic of the empirically-found distribution.

Therefore, the larger the p -value, the more likely a genuine power-law has been

found in the empirical data. According to Clauset, “once we have calculated our

p -value, we need to make a decision about whether it is small enough to rule out

the power-law hypothesis” (emphasis added) (2007). The power-law hypothesis

is simply that the distribution was generated by a power-law generating function.

The null hypothesis is that by chance a function would generate the power-law

distribution observed in the empirical data. We shall also use p

The KS test for all 11 tagged web-pages, testing both the 'tag suggestion' and 'no

tag suggestion' conditions, is given in Fig. 5.9 . The average D statistic for the 'no

tag suggestion' condition is 0.0313 (S.D. 0.0118) with p

≤

1, power-

law found). For the 'tag suggestion' condition the average D -statistic is 0.0724

(S.D. 0.0256) with p

48 ( p

1, no power-law found). These results show

that the power-law function exhibited only in the 'no tag suggestion' condition

is significant, the fit is closer for the 'no tag suggestion' condition than the 'tag

suggestion' condition. The D -statistic showed a range from 0.0170 to 0.0552 for

'no tag suggestion' condition yet a range of 0.0428-0.1318 for 'tag suggestion.'

Thus, the power-law only significantly appears without tag suggestions, and with

tag suggestions a power-law cannot be reliably found. This is surprising, as tag

08 ( p

≤

Social Semantics: The Search for Meaning on the Web

Search WWH ::

Custom Search

Home