Information Technology Reference
In-Depth Information
actions of the individual users. First, we examine the how power-law distributions
form at the top (the first 25 positions) of tag distributions for each site. For this, we
employ a method from information theory, namely the Kullback-Leibler divergence.
Second, we study the dynamics of the entire tag distributions, including all tags used
for a site, and we show that the relative weights of the top and tail of tag distributions
converge to stable ratios in the data sets.
5.2.4.1
Kullback-Leibler Divergence: Definition
In probability and information theory, the Kullback-Leibler divergence (also known
as “relative entropy” or “information divergence”) represents a natural distance
measure between two probability distributions
P
and
Q
(in our case,
P
and
Q
are
two vectors representing discrete probability distributions). Formally, the Kullback-
Leibler divergence between
P
and
Q
is defined as:
log
P
(
)
x
)=
∑
x
D
KL
(
P
||
Q
P
(
x
)
(5.4)
Q
(
x
)
The
Kullback-Leibler
distance
is
a
non-negative,
convex
function,
i.e.
D
KL
(
0 iff. P and Q coincide). Also, unlike
other distance measures it is not symmetric, i.e. in general
D
KL
(
P
,
Q
)
≥
0
,∀
P
,
Q
(note that
D
KL
(
P
,
Q
)=
P
,
Q
)
=
D
KL
(
Q
,
P
)
.
5.2.4.2
Application to Tag Dynamics
We use two complementary ways to detect whether a distribution has converged to
a steady state using the Kullback-Leibler divergence:
The first is to take the relative entropy between every two consecutive points in
time of the distribution, where each point in time represents some change in the
distribution. Again, in our data, tag distributions are based on the rank-ordered
tag frequencies for the top 25 highest-ranked unique tags for any one website.
Each point in time was a given month where the tag distribution had changed;
months where there was no tagging change were not counted as time points.
Using this methodology, a tag distribution that was 'stable' would show the
relative entropy converging to and remaining at zero over time. If the Kullback-
Leibler divergence between two consecutive time points becomes zero (or close
to zero), it suggests that the shape of the distribution has stopped evolving. This
technique may be most useful when it is completely unknown whether or not the
tagging of a particular site has stabilized at all.
The second method involves taking the relative entropy of the tag distribution
for each time step with respect to the final tag distribution, the distribution at the
time the measurement was taken or the last observation in the data, for that site.
This method is most useful for heavily tagged sites where it is already known or
suspected that the final distribution has already converged to a power-law.
Search WWH ::
Custom Search