Audio Analysis Applications for Music (information science)

Introduction

The last decade has seen a revolution in the use of digital audio: The CD, which one decade earlier had taken over the home audio market, is starting to be replaced by electronic media which are distributed over the Internet and stored on computers or portable devices in compressed formats. The need has arisen for software to manage and manipulate the gigabytes of data in these music collections, and with the continual increase in computer speed, memory and disk storage capacity, the development of many previously infeasible applications has become possible.

This article provides a brief review of automatic analysis of digital audio recordings with musical content, a rapidly expanding research area which finds numerous applications. One application area is the field ofmusic information retrieval, where content-based indexing, classification and retrieval of audio data are needed in order to manage multimedia databases and libraries, as well as being useful in music retailing and commercial information services. Another application area is music software for the home and studio, where automatic beat tracking and transcription of music are much desired goals. In systematic musicology, audio analysis algorithms are being used in the study of expressive interpretation of music. Other emerging applications which make use of audio analysis are music recommender systems, playlist generators, visualisation systems, and software for automatic synchronisation of audio with other media and/or devices.


We illustrate recent developments with three case studies of systems which analyse specific aspects of music (Dixon, 2004). The first system is BeatRoot (Dixon, 2001a, 2001c), a beat tracking system that finds the temporal location of musical beats in an audio recording, analogous to the way that people tap their feet in time to music. The second system is JTranscriber, an interactive automatic transcription system based on (Dixon, 2000a, 2000b), which recognizes musical notes and converts them into MIDI format, displaying the audio data as a spectrogram with the MIDI data overlaid in piano roll notation, and allowing interactive monitoring and correction of the extracted MIDI data. The third system is the Performance Worm (Dixon, Goebl, & Widmer, 2002), a real-time system for visualisation of musical expression, which presents in real time a two dimensional animation of variations in tempo and loudness (Langner & Goebl, 2002, 2003).

Space does not permit the description of the many other music content analysis applications, such as: audio fingerprinting, where recordings can be uniquely identified with a high degree of accuracy, even with poor sound quality and in noisy environments (Wang, 2003); music summarisation, where important parts of songs such as choruses are identified automatically; instrument identification, using machine learning techniques to classify sounds by their source instruments; and melody and bass line extraction, essential components of query-by-example systems, where music databases can be searched by singing or whistling a small part of the desired piece. At the end of the article, we discuss emerging and future trends and research opportunities in audio content analysis.

background

Early research in musical audio analysis is reviewed by Roads (1996). The problems that received the most attention were pitch detection, spectral analysis and rhythm recognition, areas which correspond respectively to the three most important aspects of music: melody, harmony and rhythm.

Pitch detection is the estimation of the fundamental frequency of a signal, usually assuming it to be monophonic. Methods include: time domain algorithms such as counting of zero-crossings and autocorrelation; frequency domain methods such as Fourier analysis and the phase vocoder; and auditory models which combine time and frequency domain information based on an understanding of human auditory processing. Recent work extends these methods to find the predominant pitch (e.g., the melody note) in a polyphonic mixture (Gomez, Klapuri, & Meudic, 2003; Goto & Hayamizu, 1999).

Spectral analysis is a well-understood research area with many algorithms available for analysing various classes of signals, such as the short time Fourier transform, wavelets and other more signal-specific time-frequency distributions. Building upon these methods, the specific application of automatic music transcription has a long research history (Chafe, Jaffe, Kashima, Mont-Reynaud, & Smith, 1985; Dixon, 2000a, 2000b; Kashino, Nakadai, Kinoshita, & Tanaka, 1995; Klapuri, 1998, 2003; Klapuri, Virtanen, & Holm, 2000; Marolt, 1997, 1998, 2001; Martin, 1996; Mont-Reynaud, 1985; Moorer, 1975; Piszczalski & Galler, 1977; Sterian, 1999; Watson, 1985). Certain features are common to many of these systems: producing a time-frequency representation of the signal, finding peaks in the frequency dimension, tracking these peaks over the time dimension to produce a set of partials, and combining the partials to produce a set of notes. The differences between systems are usually related to the assumptions made about the input signal (e.g., the number of simultaneous notes, types of instruments, fastest notes, or musical style), and the means of decision making (e.g., using heuristics, neural nets or probabilistic reasoning).

The problem of extracting rhythmic content from a musical performance, and in particular finding the rate and temporal location of musical beats, has also attracted considerable interest in recent times (Allen & Dannenberg, 1990; Cemgil, Kappen, Desain, & Honing, 2000; Desain, 1993; Desain & Honing, 1989; Dixon, 2001a; Eck, 2000; Goto & Muraoka, 1995, 1999; Large & Kolen, 1994; Longuet-Higgins, 1987; Rosenthal, 1992; Scheirer, 1998; Schloss, 1985). Previous work had concentrated on rhythmic parsing of musical scores, lacking the tempo and timing variations that are characteristic of performed music, but recent tempo and beat tracking systems work quite successfully on a wide range of performed music.

Music performance research is only starting to take advantage of the possibility of audio analysis software, following work such as Scheirer (1995) and Dixon (2000a). Previously, general purpose signal visualisation tools combined with human judgement had been used to extract performance parameters from audio data. The main problem in music signal analysis is the development of algorithms to extract sufficiently high level content, since it requires the type of musical knowledge possessed by a musically literate human listener. Such “musical intelligence” is difficult to encapsulate in rules or algorithms that can be incorporated into computer programs. In the following sections, three systems are presented which take the approach of encoding as much as possible of this intelligence in the software and then presenting the results in an intuitive format which can be edited via a graphical user interface, so that the systems can be used in practical settings even when not 100% correct. This approach has proved to be very successful in performance research (Dixon et al., 2002; Goebl & Dixon, 2001; Widmer, 2002; Widmer, Dixon, Goebl, Pampalk, & Tobudic, 2003).

beatroot

Compared with complex cognitive tasks such as playing chess, beat tracking (identifying the basic rhythmic pulse of a piece of music) does not appear to be particularly difficult, as it is performed by people with little or no musical training, who tap their feet, clap their hands or dance in time with music. However, while chess programs compete with world champions, no computer program has been developed which approaches the beat tracking ability of an average musician, although recent systems are approaching this target. In this section, we describe BeatRoot, a system which estimates the rate and times of musical beats in expressively performed music (for a full description, see Dixon, 2001a, 2001c).

BeatRoot models the perception of beat by two interacting processes (see Figure 1): The first finds the rate of the beats (tempo induction), and the second synchronises a pulse sequence with the music (beat tracking). At any time, there may exist multiple hypotheses regarding each of these processes; these are modelled by a multiple agent architecture in which agents representing each hypothesis compete and cooperate in order to find the best solution. The user interface presents a graphical representation of the music and the extracted beats, and allows the user to edit and recalculate results based on the editing. Input to BeatRoot is either digital audio or symbolic music data such as MIDI. This data is processed off-line to detect salient rhythmic events, using an onset detection algorithm which finds peaks in the slope of the amplitude envelope of the signal (or a set of frequency bands of the signal). The timing of these events is then analysed to generate hypotheses of the tempo at various metrical levels.

Figure 1. System architecture of BeatRoot

System architecture of BeatRoot

First, inter-onset intervals (IOIs), the time differences between pairs of onsets, are calculated, and then a clustering algorithm is used to find groups of similar IOIs which represent the various musical units (e.g., half notes, quarter notes). Information about the clusters is combined by identifying near integer relationships between clusters, in order to produce a ranked list of tempo hypotheses, which is then passed to the beat tracking subsystem.

The beat tracking subsystem uses a multiple agent architecture to find sequences of events which match the various tempo hypotheses, and rates each sequence to determine the most likely sequence of beat times. Each agent represents a specific hypothesis about the rate and the timing of the beats, which is updated as the agent matches the detected onsets to predicted beat times. The agent also evaluates its beat tracking, based on how evenly the beat times are spaced, how many predicted beats correspond to actual events, and the salience of the matched events, which is calculated from the signal amplitude at the time of the onset. At the end of processing, the agent with the highest score outputs its sequence of beats as the solution to the beat tracking problem.

BeatRoot is written in Linux/C++, and comprises about 10,000 lines of code, with a graphical user interface consisting of 1,000 lines of Java. The user interface allows playback of the music with the beat times marked by clicks, and provides a graphical display of the signal and the beats with editing functions for correction of errors or selection of alternate metrical levels (Figure 2).

The lack of a standard corpus for testing beat tracking creates a difficulty for making an objective evaluation of the system. The automatic beat tracking algorithm has been tested on several sets of data: a set of 13 complete piano sonatas, a large collection of solo piano performances of two Beatles songs and a small set of pop songs. In each case, the system found an average of over 90% of the beats (Dixon, 2001a), and compared favourably to another state of the art tempo tracker (Dixon, 2001b). Tempo induction results were almost always correct, so the errors were usually related to the phase of the beat, such as choosing as beats onsets half way between the correct beat times.

Figure 2. Screen shot of BeatRoot processing the first five seconds of a Mozart piano sonata, showing the inter-beat intervals in ms (top), calculated beat times (long vertical lines), spectrogram (centre), waveform (below) marked with detected onsets (short vertical lines) and the control panel (bottom)

Screen shot of BeatRoot processing the first five seconds of a Mozart piano sonata, showing the inter-beat intervals in ms (top), calculated beat times (long vertical lines), spectrogram (centre), waveform (below) marked with detected onsets (short vertical lines) and the control panel (bottom)

Presently, BeatRoot is being used in a large scale study of interpretation in piano performance (Widmer, 2002; Widmer et al., 2003), to extract symbolic data from audio CDs for automatic analysis.

JTranscriber

The goal of an automatic music transcription system is to create, from an audio recording, some form of symbolic notation (usually common music notation) representing the piece that was played. For classical music, this should be the same as the score from which the performer played the piece. There are several reasons why this goal can never be fully reached, for example, that there is no one-to-one correspondence between scores and performances, and that masking makes it impossible to measure everything that occurs in a musical performance. Recent attempts at transcription report note detection rates around 90% for solo piano music (Dixon, 2000a ; Klapuri, 1998; Marolt, 2001), which is sufficient to be somewhat useful to musicians.

A full transcription system is normally conceptualised in two stages: the signal processing stage, in which the pitch and timing of all notes is detected, producing a symbolic representation (often in MIDI format), and the notation stage, in which the symbolic data is interpreted in musical terms and presented as a score. This second stage involves tasks such as finding the key signature and time signature, following tempo changes, quantising the onset and offset times of the notes, choosing suitable enharmonic spellings for notes, assigning notes to voices in polyphonic passages, and finally laying out the musical symbols on the page. Here, we address only the first stage of the problem, detecting the pitch and timing of all notes, or in more concrete terms, converting audio data to MIDI.

The data is processed according to Figure 3: The audio data is averaged to a single channel and downsampled to increase processing speed. A short time Fourier transform (STFT) is used to create a time-frequency image of the signal, with the user selecting the type, size and spacing of the windows. Using a technique developed for the phase vocoder (Flanagan & Golden, 1966), a more accurate estimate of the sinusoidal energy in each frequency bin can be calculated from the rate of change of phase in each bin.

Figure 3. Data processing steps in JTranscriber

Data processing steps in JTranscriber

The next step is to calculate the peaks in the magnitude spectrum, and to combine the frequency estimates to give a set of time-frequency atoms, which represent packets of energy localised in time and frequency. These are then combined with the atoms from neighbouring frames (time slices), to create a set of frequency tracks, representing the partials of musical notes. Frequency tracks are assigned to musical notes by estimating the most likely set of fundamental frequencies that would give rise to the observed tracks, and the pitch, onset time, duration and amplitude of each note are estimated from its constituent partials.

An example of the output is displayed in Figure 4, showing a spectrogram representation of the signal using a logarithmic frequency scale, labelled with the corresponding musical note names, and the transcribed notes superimposed over the spectrogram in piano roll notation. (The piano roll notation is colour and partially transparent, whereas the spectrogram is black and white, which makes the data easily distinguishable on the screen. In the grey-scale diagram, the coloured notes are difficult to see; here they are surrounded by a solid frame to help identify them.) An interactive editing system allows the user to correct any errors made by the automatic transcription system, and also to assign notes to different voices (different colours) and insert high level musical structure information. It is also possible to listen to the original and reconstructed signals (separately or simultaneously) for comparison.

An earlier version of the transcription system was written in C++, however the current version is implemented entirely in Java. The system was tested on a large database of solo piano music consisting of professional performances of 13 Mozart piano sonatas, or around 100,000 notes (Dixon, 2000a), with the results that approximately 10-15% of the notes were missed, and a similar number of the reported notes were false. The most typical errors made by the system are thresholding errors (discarding played notes because they are below the threshold set by the user, or including spurious notes which are above the given threshold) and octave errors (or more generally, where a harmonic of one tone is taken to be the fundamental of another, and vice versa).

THE PERFORMANCE WORM

Skilled musicians communicate high-level information such as musical structure and emotion when they shape the music by the continuous modulation of aspects such as tempo and loudness. That is, artists go beyond what is prescribed in the score, and express their interpretation of the music and their individuality by varying certain musical parameters within acceptable limits. This is referred to as expressive music performance, and is an important part of Western art music, particularly classical music. The Performance Worm (Dixon et al., 2002) is a real-time system for tracking and visualising the tempo and dynamics of a performance in an appealing graphical format which provides insight into the expressive patterns applied by skilled artists. This representation also forms the basis for automatic recognition of performers’ style (Widmer, 2002; Widmer et al., 2003).

Figure 4. Transcription of the opening 10s of the second movement of Mozart’s Piano Sonata K332. The transcribed notes are superimposed over the spectrogram of the audio signal (see text). It is not possible to distinguish fundamental frequencies from harmonics of notes merely by viewing the spectrogram.

Transcription of the opening 10s of the second movement of Mozart's Piano Sonata K332. The transcribed notes are superimposed over the spectrogram of the audio signal (see text). It is not possible to distinguish fundamental frequencies from harmonics of notes merely by viewing the spectrogram.

The system takes input from the sound card (or from a file), and measures the dynamics and tempo, displaying them as a trajectory in a 2-dimensional performance space (Langner & Goebl, 2002, 2003). The measurement of dynamics is straightforward: It can be calculated directly as the RMS energy expressed in decibels, or, by applying a standard psychoacoustic calculation (Zwicker & Fastl, 1999), the perceived loudness can be computed and expressed in sones. The difficulty lies in creating a tempo tracking system which is robust to timing perturbations yet responsive to changes in tempo. This is performed by an adaptation of the tempo induction subsystem of BeatRoot, modified to work in real time. The major difference is the online IOI clustering algorithm, which continuously outputs a tempo estimate based only on the musical data up to the time of processing. The clustering algorithm finds groups of IOIs of similar duration in the most recent eight seconds of music, and calculates a weighted average IOI representing the tempo for each cluster. The tempo estimates are adjusted to accommodate information from musically-related clusters, and then smoothed over time by matching each cluster with previous tempo hypotheses. Figure 5 shows the development over time of the highest ranked tempo hypothesis with the corresponding dynamics.

The Performance Worm is implemented in about 4,000 lines of Java, and runs in real time on standard desktop computers. The graphical user interface provides buttons for scaling and translating the axes, selecting the metrical level, setting parameters, loading and saving files, and playing, pausing and stopping the animation.

Apart from the real-time visualisation of performance data, the Worm can also load data from other programs, such as the more accurate beat tracking data produced by BeatRoot. This function enables the accurate comparison of different performers playing the same piece, in order to characterise the individual interpretive style of the performer. Current investigations include the use of AI pattern matching algorithms to learn to recognize performers by the typical trajectories that their playing produces.

FUTURE TRENDS

Research in music content analysis is progressing rapidly, making it difficult to summarise the various branches of investigation. One major initiative addresses the possibility of interacting with music at the semantic level, which involves the automatic generation of metadata, using machine learning and data mining techniques to discover relationships between low-level features and high-level concepts. Another important trend is the automatic computation of musical similarity for organising and navigating large music collections.

Figure 5. Screen shot ofthe performance worm showing the trajectory to bar 30 of Rachmaninov’s Prelude op.23 no.6 played by Vladimir Ashkenazy. The horizontal axis shows tempo in beats per minute, and the vertical axis shows loudness in sones. The most recent points are largest and darkest; the points shrink and fade into the background as the animation proceeds.

Screen shot ofthe performance worm showing the trajectory to bar 30 of Rachmaninov's Prelude op.23 no.6 played by Vladimir Ashkenazy. The horizontal axis shows tempo in beats per minute, and the vertical axis shows loudness in sones. The most recent points are largest and darkest; the points shrink and fade into the background as the animation proceeds.

conclusion

The three systems discussed are research prototypes, whose performance could be improved in several ways, for example, by specialisation to suit music of a particular style or limited complexity, or by the incorporation of high-level knowledge of the piece being analysed.

This is particularly relevant to performance research, where the musical score is usually known. By supplying a beat tracking or performance analysis system with the score, most ambiguities are resolved, giving the possibility of a fully automatic and accurate analysis.

Both dynamic programming and Bayesian approaches have proved successful in score following (e.g., for automatic accompaniment, Raphael, 2001), and it is likely that one of these approaches would also be adequate for our purposes. A more complex alternative would be a learning system which automatically extracts the high-level knowledge required for the system to fine-tune itself to the input data (Dixon, 1996). In any case, the continuing rapid growth in computing power and processing techniques ensures that content-based analysis of music will play an increasingly important role in many areas of human interaction with music.

KEY TERMS

Automatic Transcription: The process of extracting the musical content from an audio signal and representing it in standard music notation.

Beat Tracking: The process of finding the times of musical beats in an audio signal, including following tempo changes, similar to the way that people tap their feet in time to music.

Clustering Algorithm: An algorithm which sorts data into groups of similar items, where the category boundaries are not known in advance.

Frequency Domain: The representation of a signal as a function of frequency, for example as the sum of sinusoidal waves of different amplitudes and frequencies.

Music Content Analysis: The analysis of an audio signal in terms of higher-level (cognitive) properties such as melody, harmony and rhythm, or in terms of a description of the signal’s component sounds and the sound sources which generated them.

Music Information Retrieval: The research field concerning the automation of access to music information through the use of digital computers.

Onset Detection: The process of finding the start times of notes in an audio signal.

Time Domain: The representation of a signal, such as the amplitude or pressure of a sound wave, as a function of time.

Next post:

Previous post: