Research of Daniele Salvati

Audio and Acoustic Signal Processing - Computer Audition


Audio and Acoustic Signal Processing

Audio and acoustic signal processing involves the study, modeling, and manipulation of audio and acoustic signals. This field includes recording and playback, data compression, filtering, enhancement, and recognition. Key tasks in audio and acoustic signal processing are:

Audio Signal Processing in the 21st Century

Computer Audition

Computer Audition is a subfield of Audio and Acoustic Signal Processing focused on the automated analysis, understanding, and interpretation of auditory information. This interdisciplinary area combines principles from signal processing, machine learning, and psychoacoustics. Practical applications of computer audition include acoustic scene analysis, music analysis, human-computer interaction, and surveillance.

Acoustic Scene Analysis

Acoustic Scene Analysis involves the automatic analysis and understanding of the acoustic environment. The main tasks of an acoustic scene analysis system include: acoustic event detection (when), acoustic source localization (where), acoustic event classification (who), acoustic scene classification.

Acoustic Source Localization

The aim of an acoustic source localization system is to estimate the position of acoustic sources in space by analyzing the sound field with a microphone array, which is a set of microphones arranged to capture the spatial information of sound. Spatial localization of acoustic sources has certainly received large attention, and baseline techniques are now available that offer appreciable performances in a wide number of real-world conditions, including indoor/outdoor scenarios, reverberant and noisy environment, near-field/far-field monitoring.

At today, acoustic source localization methods can be broadly classified in two classes: TDOA-based indirect methods, and direct methods. The indirect methods aim at estimating the time difference of the acoustic wavefront arrivals between microphone pairs and then the source position using geometric considerations. The generalized cross-correlation (GCC) is considered a baseline practical method for TDOA estimation, but often improved versions are used in practice. The multichannel cross-correlation coefficient (MCCC), for example, is based on TDOAs estimation obtained by the GCC paired with a prediction of the spatial error to provide a more robust estimate of the source position. Direct methods, on the other hand, estimate the source position of an acoustic source in a single step by exploiting some power density function representing the spatially-relevant information distribution, and they are considered in general more robust under noisy and reverberation conditions if compared to the TDOA-based methods. The SRP localization involves computing the output power of a beamformer steered towards each target position of interest. The conventional SRP is performed with the delay and sum beamformer, which consists in the synchronization of the array signals to steer the array in a certain direction, and of summing the signals to estimate the power of the spatial filter. The SRP phase transform (SRP-PHAT) is a widely used filtered SRP beamforming. The PHAT filter assigns equal importance to each frequency by dividing the spectrum by its magnitude. The SRP-PHAT can be efficiently computed by the global coherent field (GCF) approach, that coherently sums the GCC-PHAT from the microphone pairs for each possible point of interest. Among conventional beamformers, the minimum variance distortionless response (MVDR) filter is a well-known data-dependent beamformer that provides better resolution if compared to the conventional beamformer. Yet another class of high resolution methods is based on subspace analysis and decomposition. The multiple signal classification (MUSIC) method exploits the subspace orthogonality property to build the spatial spectrum and to localize the DOA sources. The estimation of signal parameters via rotational invariance techniques (ESPRIT) is also based on subspace decomposition exploiting the rotational invariance. The diagonal unloading beamforming (DUB) provides high resolution with low-complexity by subtracting an opportune diagonal matrix from the covariance matrix, removing as much as possible the signals subspace from the covariance matrix.