Information Technology Reference
In-Depth Information
with the axes being log(rank order) and log(frequency). For example, in Figure 1.12 the
word “the” as described in the above scatterplot would have the coordinates x = log(1),
y = log(69,971). The data conform to Zipf's law to the extent that the plotted points
appear to fall along a single straight-line segment.
Formally, let L be the number of elements in the data set, let r be the rank of a
data element when ordered from the most to the least frequent, and let
η
be the value
of the exponent characterizing the distribution. Zipf's law then predicts that, out of a
population of L elements, the frequency of elements of rank r , f
(
r
; η,
L
)
, is given by
r η
n = 1 (
1
/
f
(
r
; η,
L
) =
r η ) .
(2.54)
1
/
Thus, in the example of the frequency of words in the English language given in the
first chapter, L is the number of words in the English language and, if we use the classic
version of Zipf's law, the exponent
is
the fraction of the time the r th most common word occurs within a given language.
Moreover, it is easily seen from its definition that the distribution is normalized, that is,
the predicted frequencies sum to one:
η
is one. The distribution function f
(
r
; η,
L
)
L
f
(
r
; η,
L
) =
1
.
(2.55)
r
=
1
/
f function.” Given a set of Zipfian distributed
frequencies, sorted from most common to least common, the second most common fre-
quency will occur 1
The simplest case of Zipf's law is a “1
/
2 as often as the first. The third most common frequency will occur
1
n as often as the
first. However, this cannot hold exactly, because items must occur an integer number of
times: there cannot be 2
/
3 as often as the first. The n th most common frequency will occur 1
/
5 occurrences of a word. Nevertheless, over fairly wide ranges,
and to a fairly good approximation, many natural and social phenomena obey Zipf's law.
Mathematically, it is not possible for the classic version of Zipf's law to hold exactly
if there are infinitely many words in a language, since the sum of all relative frequencies
in the denominator ( 2.54 ) is equal to a harmonic series that diverges with diverging N :
.
1
r =∞ .
(2.56)
r
=
1
In the English language, the frequencies of the approximately 1
000 most frequently
used words are empirically found to be approximately proportional to 1
,
r η , where
/
η
is
just slightly above one. As long as the exponent
η
exceeds 1, it is possible for such a
law to hold with infinitely many words, since if
η>
1 the sum no longer diverges,
1
r η < ,
ζ(η) =
(2.57)
r
=
1
where
is the Riemann zeta function.
Just why data from complex webs, such as those of Auerbach, Lotka, Willis and
others shown in Chapter 1 , conform to the distribution of Zipf is a matter of some
ζ
 
Search WWH ::




Custom Search