The Semantics of Tagging - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

actions of the individual users. First, we examine the how power-law distributions

form at the top (the first 25 positions) of tag distributions for each site. For this, we

employ a method from information theory, namely the Kullback-Leibler divergence.

Second, we study the dynamics of the entire tag distributions, including all tags used

for a site, and we show that the relative weights of the top and tail of tag distributions

converge to stable ratios in the data sets.

5.2.4.1

Kullback-Leibler Divergence: Definition

In probability and information theory, the Kullback-Leibler divergence (also known

as “relative entropy” or “information divergence”) represents a natural distance

measure between two probability distributions P and Q (in our case, P and Q are

two vectors representing discrete probability distributions). Formally, the Kullback-

Leibler divergence between P and Q is defined as:

log P

(

)

x

)= ∑ x

D KL (

P

||

Q

P

(

x

)

(5.4)

Q

(

x

)

The

Kullback-Leibler

distance

is

a

non-negative,

convex

function,

i.e.

D KL (

0 iff. P and Q coincide). Also, unlike

other distance measures it is not symmetric, i.e. in general D KL (

P

,

Q

) ≥

0

,∀

P

,

Q (note that D KL (

P

,

Q

)=

P

,

Q

) =

D KL (

Q

,

P

)

.

5.2.4.2

Application to Tag Dynamics

We use two complementary ways to detect whether a distribution has converged to

a steady state using the Kullback-Leibler divergence:

The first is to take the relative entropy between every two consecutive points in

time of the distribution, where each point in time represents some change in the

distribution. Again, in our data, tag distributions are based on the rank-ordered

tag frequencies for the top 25 highest-ranked unique tags for any one website.

Each point in time was a given month where the tag distribution had changed;

months where there was no tagging change were not counted as time points.

Using this methodology, a tag distribution that was 'stable' would show the

relative entropy converging to and remaining at zero over time. If the Kullback-

Leibler divergence between two consecutive time points becomes zero (or close

to zero), it suggests that the shape of the distribution has stopped evolving. This

technique may be most useful when it is completely unknown whether or not the

tagging of a particular site has stabilized at all.

The second method involves taking the relative entropy of the tag distribution

for each time step with respect to the final tag distribution, the distribution at the

time the measurement was taken or the last observation in the data, for that site.

This method is most useful for heavily tagged sites where it is already known or

suspected that the final distribution has already converged to a power-law.

Social Semantics: The Search for Meaning on the Web

Search WWH ::

Custom Search

Home