Information Technology Reference
In-Depth Information
BasIc concepts of IndexIng
words. Attention has to be paid in some particular
cases, for example an acronym where letters are
separated by dots has to be considered as a single
term and not a sequence of one-letter terms. The
creation of a lexical analyzer for these languages
can be done through regular expressions, and
normally poses only implementation issues. For
other languages, such as Chinese and Japanese,
the written text does not have necessarily to be
divided in terms by special characters, and the
compounding of ideograms in different words has
to be inferred from the context. Automatic lexical
analysis for these languages is nontrivial, and it
has been a research area since years.
Early models and experiments on textual infor-
mation retrieval date back to the 1970'S. Textual
information retrieval, which for years has been
simply addressed information retrieval (IR) tout-
court , has a long history, where many different
approaches have been experimented and tested.
Being indexing one of the core elements of an IR
system, many approaches have been proposed to
optimize indexing from the computational cost,
memory storage, and retrieval effectiveness points
of view and, most of all, these approaches have
been extensively tested and validated experimen-
tally using standard test collections, in particular
in the framework of the Text REtrieval Conference
(TREC). For this reason, the main ideas underlying
textual indexing will be reviewed, together with
possible applications to music indexing.
Textual indexing is based on four main sub-
sequent steps, which are respectively:
Application to the Music Domain
As discussed in the previous section, the first issue
on music indexing is the choice of the dimensions
to be used as content descriptors. This choice influ-
ences also the approaches to the lexical analysis.
For instance, if rhythm is used to index music
documents, attack time of the different notes has
to be automatically detected and filtered, which is
an easy task for symbolic documents and can be
carried out with good results also for documents
in audio format. On the other hand, if harmony
is used to compute indexes, lexical analysis has
to rely on complex techniques for the automatic
extraction of chords from a polyphonic music
document, which is still an error prone task
especially in the case of audio documents, even
though encouraging results have been obtained
(Gómez & Herrera, 2004). The automatic extrac-
tion of high level features from symbolic and audio
music formats is a very interesting research area,
studied by a very active research community, but
it is beyond the aims of this discussion. For sim-
plicity, it is assumed that a sequence of features
is already available, describing some high-level
characteristics of a music documents, related to
one or more of its dimensions. It is also assumed
that the feature extraction is affected by errors that
should be taken into account during the design
of the indexing scheme.
Lexical analysis
Stop-words removal
Stemming
Term weighting
It has to be noted that existing IR systems
may not follow all these steps; in particular the
effectiveness of stemming has been often debated,
at least for languages with a simple morphology
such as English.
lexical analysis
The first step of indexing consists in the analysis
of the content of a document in order to find its
candidate index terms. In the case of textual
documents, index terms are the words that form
the document, thus lexical analysis corresponds to
document parsing for highlighting its individual
words. Lexical analysis is straightforward with
European languages, where blanks, commas,
dots, are clear separators between two subsequent
Search WWH ::




Custom Search