Neuromorphic Architectures for Real-Time Speech Processing in Noisy Environments
Abstract
The processing of speech in acoustic cluttered environments (such as the cocktail party problem) continues to be an extremely difficult problem to the artificial auditory system. In this case, the human listeners excel at isolating and parsing speech through the use of biologically efficient neural networks, as opposed to conventional digital signal processing systems which are characterized by latency, energy dissipation and poor performance in multitalking or noisy conditions. In order to overcome these inadequacies, neuromorphic computing has emerged as one of the new radiant approaches by simulating the biological auditory pathway with respect to its structure and functionality. The framework that integrates a biologically-inspired cochlear front-end, attentional spiking neural units, and attention-directed temporal fidelity that has recently been proposed in this paper is called the Attention-Gated Spiking-FullSubNet (AGS-FSN). To achieve localization and separation of sounds, Cochlear unit performs this frequency decomposition in a neuromorphic filter bank whose phase and timing representations are crucial to sound localization and separation. Events Implemented and Gated spiking neuron, event-driven paradigm reduces computational cost and is very energy-efficient. FullSubNet-Like attention mechanism permits the selective amplification of the pertinent parts of the speech of the background speech interference to the side. Extensive experiments on noisy real world datasets reveal that AGS-FSN demonstrates state-of-the-art hearing enhancement and recognition capability, easily surpassing the conventional deep learning models by a large margin in terms of accuracies, and power consumption. The design is aimed at the neuromorphic hardware devices such as field-programmable gate arrays and event-based silicon processors, which offers a reasonable path to low-power and real-time edge AI via audition. AGS-FSN is a new milestone of robust/scalable and biologically-inspired speech processing in diverse acoustic conditions.