Information Technology Reference
In-Depth Information
dexing because of the evident difference between
textual and musical communication. One thing
that is worth mentioning is that users access the
two medias very differently. In particular, music
documents are accessed many times by users,
who may choose to not listen to the complete
song, but only to a part of the song. Moreover, it
is common practice of radio stations to broadcast
only the parts of the songs with the sung melody,
skipping the intro and the coda, and fading out
during long guitar solos. The computation of
the relative importance by which a lexical unit
describes a document should deal also with these
aspects. Moreover, listeners are likely to remember
and use in their queries the part of the song where
the title is sung, which becomes more relevant
disregarding its frequency inside the documents
and inside the collection. Yet, there have been very
few studies that investigate the best weighting
scheme for music indexing, and in many cases
a direct implementation of the tf idf (such as the
one presented in this section) is used.
It is important to note that the possibility to give
different weights to lexical units is an important
difference between information retrieval and ap-
proaches based on recognition—such as approxi-
mate string matching techniques. The former allows
users to rank the documents depending on the
relevance of their lexical units as content descrip-
tors, while the latter allows for document ranking
depending on the degree at which an excerpt of
each document matches the query. In other words,
a good match with an almost irrelevant excerpt
may give a higher rank than a more approximate
match with a highly relevant excerpt. It could be
advisable to extend weighting approaches also to
methods other than indexing. To this end, a mixed
approach of indexing with approximate matching
has been proposed in Basaldella and Orio (2006),
where each index term was represented by a sta-
tistical model and the final weight of each index
term of the query was computed combining the
tf idf scheme with the probability by which it was
generated by the model.
retrieval techniques
Once indexes have been built through the four
steps described earlier, and both the collection of
documents and the user query have been indexed,
it is possible to perform retrieval. It is important
to note that also the query has to be analyzed and
indexed in order to retrieve relevant documents,
because the similarity between the query and the
documents is carried out using indexes only.
Different approaches can be applied to retriev-
al; the one that is more intuitive, and that has been
extensively applied in the experiments reported in
the following sections, is the Vector-Space Model
(VSM). Accordingly to the VSM, both documents
and queries are represented as K -variate vectors of
descriptor weights w t,d , provided that K is the total
number of unique descriptors or indexes. Then,
document d i is represented as d i = ( w if ,…, w iK ),
while query q is represented as q = ( q 1 ,…,q K ). The
weight w t,d of index term t within document d are
computed according to the tf idf scheme already
described. Query descriptor weights are usually
binary values, then q t = 1 if term t occurs within
query q , 0 otherwise.
The retrieval status value (RSV) is the cosine
of the angle between the query vector and the
document vector. That is:
d
q
RSV
(
d
,
q
)
=
cos(
d
,
q
)
=
d
q
where d and q are the document and the query
respectively, with their vectorial representations,
and | x | is the norm of vector x . As the cosine
function normalizes the RSV to the query and
document lengths, long documents have the same
chance of being retrieved than short ones.
In order to be comparable, both documents
and queries need to be transformed. This pro-
cess usually corresponds to the segmentation of
music documents in their lexical units, and to a
more complex query processing. The latter can
Search WWH ::




Custom Search