Information Technology Reference
In-Depth Information
term Weighting
A special case of term weighting can be
found in binary weighting , which is normally
used when Boolean searches are carried out. In
binary weighting an index term has a weight of
false if it does not appear in the document, true
otherwise. Retrieval is carried out as the solution
of a Boolean expression, where the values true
or false correspond to the value of the proposi-
tion “the term t belongs to document d ” and are
combined with Boolean operators—that is, “and”,
“or”, “not”—in order to create complex queries.
Binary weighting is still very popular because
it is easy to implement and allows for a great
expressivity in describing the user information
need through a query.
The last phase of an indexing process is related to
a main consideration: inde terms do not describe
the content of a document to the same extent.
It has already been mentioned that stop-words,
which are frequent inside the collection, are not
good descriptors because they do not allow the
differentiation between documents. On the other
hand, it can be argued that a particular set of terms
that are peculiar of a particular documents, in
which they are extensively used, are very good
descriptors of that document because they allow
its exact identification. Clearly, the importance
of a term in describing a document varies along
a continuum that ranges from totally irrelevant
to totally relevant.
For textual documents it has been proposed
that the frequency at which a word appears in a
document is directly proportional to its relevance,
while the frequency at which it appears in the col-
lection is inversely proportional to its relevance.
These considerations gave birth to a very popular
weighting scheme, called term frequency in-
verse document frequency , in short tf idf . There
are a number of different variants of this scheme,
which share the same principles:
Application to the Music Domain
If a musical lexical unit, for any chosen dimen-
sion, appears frequently inside a given document,
it is very likely that listeners will remember it.
Moreover, a frequent lexical unit can be part of the
music material that is proposed and developed by
the composer, or can be also part of the composer's
personal style. Finally, frequent lexical units have
good chances to be part of a user query. Thus, the
term frequency seems to be a reasonable choice
also for music documents. On the other hand, a
lexical unit that is very common inside a collection
of documents can be related to style of a thematic
collection—the chord progression of blues songs,
the accent on the up beat in reggae music—or
can correspond to a simple musical gesture—a
repeated note, a major scale—or can be the most
used solution for particular passages—the de-
scending bass connecting two chords, a seventh
chord introducing a modulation. Moreover, a
user may not use frequent lexical units as parts
of a query because it is clear that they will not
address any particular document. Thus, inverse
document frequency seems to be a reasonable
choice as well.
Yet, some care has to be paid to a direct ap-
plication of a tf idf weighting scheme to music in-
Term frequency is computed, for each term
t and each document d , from a monotonic
increasing function of the number of counts
of t appearing in d .
Inverse document frequency is computed, for
the set of documents d t that belong to collec-
tion C and contain at least one occurrence of
term t , from a monotonic decreasing function
of the size d t normalized by the size of C .
A widely used implementation of the two
monotonic functions for the computation of tf idf
is reported in the following formula:
C
freq
(
t
d
)
tf · idf t,d = w t,d =
×
log
max
freq
(
l
d
)
d
l
t
Search WWH ::




Custom Search