Research of Daniele Salvati

Audio and Acoustic Signal Processing - Computer Audition


Audio and Acoustic Signal Processing

Audio and acoustic signal processing concerns the study, modeling, and manipulation of audio and acoustic signals. This field deals with recording and playback, data compression, filtering, enhancement, and recognition. Audio and acoustic signal processing involves tasks such as audio modeling/coding/transmission, spatial audio recording and reproduction, signal enhancement, source separation, system identification and reverberation reduction, echo reduction, acoustic sensor array processing, auditory modeling, detection and classification of acoustic scenes and events, analysis and synthesis of acoustic environments, analysis/processing/synthesis of musical signals, music information retrieval, bioacoustics and medical acoustics, audio security.

Audio Signal Processing in the 21st Century

Computer Audition

Computer Audition is a subfield of Audio and Acoustic Signal Processing that deals with the automated analysis, understanding, and interpretation of auditory information. It is an interdisciplinary field that combines principles from signal processing, machine learning, and psychoacoustics. Computer audition has practical applications in acoustic scene analysis, music analysis, human-computer interaction, surveillance.

Acoustic Scene Analysis

Acoustic Scene Analysis refers to the process of automatically analyzing and understanding the acoustic environment. The principal tasks of an acoustic scene analysis system are: acoustic event detection (when), acoustic source localization (where), acoustic event classification (who), acoustic scene classification.

Acoustic Source Localization

The aim of an acoustic source localization system is to estimate the position of acoustic sources in space by analyzing the sound field with a microphone array, which is a set of microphones arranged to capture the spatial information of sound. Spatial localization of acoustic sources has certainly received large attention, and baseline techniques are now available that offer appreciable performances in a wide number of real-world conditions, including indoor/outdoor scenarios, reverberant and noisy environment, near-field/far-field monitoring.

At today, acoustic source localization methods can be broadly classified in two classes: TDOA-based indirect methods, and direct methods. The indirect methods aim at estimating the time difference of the acoustic wavefront arrivals between microphone pairs and then the source position using geometric considerations. The generalized cross-correlation (GCC) is considered a baseline practical method for TDOA estimation, but often improved versions are used in practice. The multichannel cross-correlation coefficient (MCCC), for example, is based on TDOAs estimation obtained by the GCC paired with a prediction of the spatial error to provide a more robust estimate of the source position. Direct methods, on the other hand, estimate the source position of an acoustic source in a single step by exploiting some power density function representing the spatially-relevant information distribution, and they are considered in general more robust under noisy and reverberation conditions if compared to the TDOA-based methods. The SRP localization involves computing the output power of a beamformer steered towards each target position of interest. The conventional SRP is performed with the delay and sum beamformer, which consists in the synchronization of the array signals to steer the array in a certain direction, and of summing the signals to estimate the power of the spatial filter. The SRP phase transform (SRP-PHAT) is a widely used filtered SRP beamforming. The PHAT filter assigns equal importance to each frequency by dividing the spectrum by its magnitude. The SRP-PHAT can be efficiently computed by the global coherent field (GCF) approach, that coherently sums the GCC-PHAT from the microphone pairs for each possible point of interest. Among conventional beamformers, the minimum variance distortionless response (MVDR) filter is a well-known data-dependent beamformer that provides better resolution if compared to the conventional beamformer. Yet another class of high resolution methods is based on subspace analysis and decomposition. The multiple signal classification (MUSIC) method exploits the subspace orthogonality property to build the spatial spectrum and to localize the DOA sources. The estimation of signal parameters via rotational invariance techniques (ESPRIT) is also based on subspace decomposition exploiting the rotational invariance. The diagonal unloading beamforming (DUB) provides high resolution with low-complexity by subtracting an opportune diagonal matrix from the covariance matrix, removing as much as possible the signals subspace from the covariance matrix.