Information Technology Reference
In-Depth Information
show that all documents contain words such as
“music”, “computer”, “note”, “algorithm”, and so
on. Moreover, these words are probably evenly
distributed across the collection, and their con-
tribution to specify the content of a particular
document in respect to the others is very low. Also
in this case, a collection dependent stop-list can
be created, and words belonging to the stop-list
can be ignored in subsequent phases of document
indexing. The stop-list can be computed automati-
cally by analyzing a representative sample of the
collection, adding to the stop-list all the words that
consistently appear in all (or in a high percentage)
of the analyzed documents. Clearly, this kind
of analysis would highlight also the words that
have a grammatical function and no semantic as
described above, thus a two-step removal of stop-
words can be avoided. Nevertheless, the designer
of an IR system can choose to remove only the
frequent and uninformative words, keeping the
ones that are only frequent.
The choice of the particular stop-list to use, if
any, could be driven by both musicological and
computational motivations and by the character-
istics of the music collection itself. A statistical
analysis of the distribution of lexical units across
documents may highlight which are the potential
stop-words that can be used. It has to be noted
that this approach is not usually exploited in the
literature of music indexing and retrieval. The term
“stop-list” is quite infrequent in music retrieval,
and the common approach is to select carefully
the parameters to avoid the computation of lexical
units that are believed to be uninformative about
the document content. What it is important for
this discussion is to highlight the fact that not all
the lexical units are equally informative about the
document content and its differences with other
documents in the collection (which is aim of term
weighting described below) and that some lexical
units may be totally uninformative as a sort of
background noise.
Application to the Music Domain
stemming
It is difficult to state whether or not a musical
lexical unit has a meaning in order to create a
priori a stop-list of musical lexical units that can
be ignored during indexing. It is preferable to face
the problem considering how much a particular
unit is a good discriminator between different
music documents. For instance, in the case of
indexing of melodic intervals, a lexical unit of
two notes that form a major second is likely to
be present in almost all of the documents, and
thus not being a good index in the case of a col-
lection of “cantate” of tonal Western music, and
probably for any collection of music documents.
A single major chord is unlikely to be a good dis-
criminator as well. Depending on the particular
set of features used to index a music collection,
the designer of the indexing and retrieval engine
can make a number of choices about the possible
stop-list of lexical units.
Many words, though different in the way they are
spelled, can be considered as different variants that
stem from a common morphological root. This is
the case of the English words “music”, “musical”
(adjective and substantive), “musicology”, “mu-
sician”; the number of variants may increase if
singular and plural forms are taken into account,
together with the gender information (which does
not apply to English but applies to most European
languages) and other possible variants which are
peculiar of some languages. Moreover, in many
languages verbs are conjugated, that is the root
of the verb is varied depending on mode, person
and time. Thus a textual document may contain
different word variants, which are identified as
different from lexical analysis but share a similar
meaning. Intuitively, it can be considered that a
textual document could be relevant for a given
information need even if it does not contain the
Search WWH ::




Custom Search