Information Technology Reference
In-Depth Information
to the microphone (which may also change
over the course of a single input sample).
database which needs to be examined with the
more complex search and matching algorithms.
Our own work with REPRED shows the value of
the independently-developed MPEG-7 Melody
Sequence DS representation as a suitable means
of encoding and representing hummed input
queries.
There have been some efforts to assemble test
suites of data as a common resource for MIR re-
searchers, such as the vocal input query collection
of Unal et al. (2003), the compendium of music
collections maintained by Byrd (2006), and the
audio description contest associated with the 2004
ISMIR symposium (Cano, Gomez, Gouyon, Her-
rera, Koppenberger, 2006; ISMIR, 2004). MIR
systems and their relative performance advantages
can be evaluated more objectively when they are
compared using common criteria, databases or
test input. The MIR community as a whole will
benefit from continuing this trend to cooperate
in the testing and evaluation of disparate systems
and algorithms.
The duration of a typical extraneous note in
a transcription is long enough that it cannot
be distinguished from genuine short notes
as vocalized by the subject.
In addition to these likely sources of errors,
additional errors from the human input source
must also be anticipated.
Pitch and duration contour representations are
not suitable for vocal input due to the relatively
high number of input errors in a typical query.
While this type of encoding affords fast and
efficient search algorithms, query strings must
have a lower error rate for this type of search to
be effective.
Even in the absence of errors introduced by
the transcription process, the reliability of pitch
contour is dependent upon the user's familiarity
with the phrase to be vocalized. For unfamiliar
melodies, even a simple ternary pitch contour will
likely contain several errors, regardless of the
musical skills of the user. Contour representations
with more than three gradations are unlikely to
correctly capture the intended melody in the vocal
rendition of a nonexpert user.
Rather than using actual note durations, sys-
tems can more accurately represent the rhythm
of the user's input by encoding the times of the
onsets of individual notes instead. Addition-
ally, long notes are shortened, and the degree of
compression depends on the musical ability of the
user; representing onset times on a logarithmic
scale rather than a linear scale reduces the nega-
tive effects of this phenomenon.
As we have seen, the most recent MIR systems
which include query-by-humming components
still utilize more complex approximate matching
algorithms in order to perform their searches.
Continued improvements in accuracy and in
computational efficiency can be expected by find-
ing new data representations, parallel searching
and pruning techniques to reduce the area of the
acknoWledgment
The author's work reported here was supported in
part by research grants awarded by the National
Science Foundation under contracts EIA-9214887,
EIA-9214892, IIS-9213823, CCR-9527151 and
EIA-9634485. The author gratefully acknowl-
edges Naoko Kosugi of NTT Laboratories for
providing a set of sample input queries from his
group's research studies.
references
Attneave, F. & Olson, R. K. (1971). Pitch as a me-
dium: A new approach to psychophysical scaling.
American Journal of Psychology , 84 , 147-166.
Bainbridge, D., Nevill-Manning, C. G., Witten, I.
H., Smith, L. A. & McNab, R. J. (1999). Towards
Search WWH ::




Custom Search