Creating ears for AI: speech enhancement techniques for listening to natural human conversations with distant microphones
Shoko Araki
NTT Communication Science Laboratories, Japan
Plenary Lecture 5 / 11:20-12:10, Oct. 27 (Thu)
CV
Abstract
AI technology continues to rapidly develop. The area that uses voice interfaces in real environments has been expanding year by year. This leads to a growing demand for speech recognition and communication analysis of everyday multi-speaker conversations in various types of noisy environments, such as offices, living rooms and public areas. This makes speech signal processing, or more specifically, speech enhancement technology, increasingly important.
When we capture speech signals by distant microphones in daily natural sound environments like conversations at meetings, such interfering sounds as ambient noise, reverberation, and extraneous speakers' voices are included in the captured signals and deteriorate the quality of the speech signal of a target speaker. Speech enhancement technologies such as noise reduction, dereverberation, and source separation remove such interfering sounds from the recorded sound and clearly extract the target speaker's voice; these key technologies make voice interfaces usable in everyday environments. Although speech enhancement for a single speaker in noisy reverberant environments has been proposed and achieved high performance, it remains challenging for a multi-speaker scenario, such as conversational situations.
In this talk, I first revisit typical elemental core technologies, including multi-channel speech separation and dereverberation, and then show that the joint optimization of them achieves high performance. I will also introduce some new concepts in speech enhancement, such as selective hearing, which extracts only the speech signal of a single target speaker from a complex mixture of sounds.
Non-linear signal processing for underwater acoustics: theory and oceanographic applications
Julien Bonnel (ICA Early Career Award Recipient )
Woods Hole Oceanographic Institution, USA
Plenary Lecture 6 / 10:30-11:20, Oct. 28 (Fri)
CV
Abstract
Lobsters, whales and submarines have little in common. Except that they produce low-frequency sounds, like many other marine occupants that use sound for communication, foraging, navigation and other purposes. However, unraveling and using the underwater cacophony is not at all simple. This is particularly true for low-frequency (f<500 Hz) propagation in coastal water (water depth D<200 m), because the environment acts as a dispersive waveguide: the acoustic field is described by a set of modes that propagate with frequency-dependent speeds. In this context, to extract relevant information from acoustic recording, one needs to understand the propagation and to use physics-based processing. In this presentation, we will show how to analyze low-frequency data recorded on a single hydrophone. We will notably review modal propagation and time-frequency analysis. We will then show how those can be combined into a non-linear signal processing method dedicated to extract modal information from single receiver, and how such information can be used to localize sound sources and/or characterize the oceanic environment. The whole method will be illustrated on several experimental examples, including geoacoustic inversion on the New England Mud Patch and baleen whale localization in the Arctic.