356 0

Cross-Tower Network for Jointly Suppressing Acoustic Echo and Background Noise

Title
Cross-Tower Network for Jointly Suppressing Acoustic Echo and Background Noise
Author
박송규
Advisor(s)
장준혁
Issue Date
2021. 2
Publisher
한양대학교
Degree
Doctor
Abstract
In general, acoustic echo and background noise are representative factors that degrade speech quality in various applications including speech communication and speech recognition. In order to eliminate these interferences, many acoustic echo and background noise reduction algorithms have been studied over the past decades, especially when both echo and noise coexist. However, when estimating these two interferences without considering each other then processing them with separate modules, accurate estimation and removal of the interferences become difficult due to the correlation between the echo and noise. In addition, as the number of microphones used in various smart devices increases, an integrated echo and noise reduction algorithm that utilizes spatial information to multi-channel processing is needed. Therefore, in this thesis, we propose cross-tower network that can suppresses acoustic echo and background noise while considering each other. In the proposed cross-tower network, acoustic echo and background noise to be removed are estimated in parallel from different towers, respectively, and echo and noise from the microphone input are suppressed together. Conventional echo and noise reduction methods have the disadvantage of estimating echo and noise without considering each other, but in the proposed cross-tower networks, the echo and noise are repeatedly estimated, and the estimated results are shared with adjacent towers to considering each other continuously. For this, additional loss functions are added to the entire network for estimating echo and noise, which are intermediate outputs of the cross-tower network. In this regard, this thesis presents two algorithms for jointly suppressing acoustic echo and background noise: 1) frequency-domain cross-tower network with adversarial training; 2) time-domain cross-tower network with attention masks to eliminate acoustic echo and background noise. The proposed frequency-domain cross-tower network is constructed using fully-connected layers, and the microphone input and the far-end log-power spectrum (LPS) were used as the first input of the network. After that, the estimated echo and noise LPS from each tower are concatenated as intermediate input features. In order to satisfy the assumption that speech, echo, and noise are uncorrelated, each tower is designed to focus on what needs to be estimated by using adversarial training so that each tower constituting the cross-tower network can focus on its mission. The proposed time-domain cross-tower network is designed as an end-to-end structure by changing Conv-TasNet, which was recently introduced as state-of-art algorithm in speeech separation areas. Unlike the conventional frequency-domain methods, the intermediate input feature is created by multiplying the latent features of the estimated echo or noise without concatenation to emphasizing the part that should be estimated, and subtracting the latent features that do not need to be considered. To attempt effective learning and performance improvement by extracting only the relevant elements of the two specific latent features, attention masks are applied by adding attention mechanisms to the encoder and decoder parts. Also, unlike the proposed frequency-domain cross-tower network, the proposed time-domain cross-tower network is expanded to multi-channel processing that jointly suppresses echo and noise using spatial information learned through compressed spatial cue blocks. The proposed cross-tower network shows much better performance than conventional methods. In particular, the time-domain cross-tower network improved PESQ by 0.11, ERLE by 9.26, STOI by 0.1, and SDR by 3.1 for evaluation DB sets compared to the latest conventional algorithm, convolutional recurrent network (CRN). When the time-domain cross-tower network is expanded to multi-channel, the improvement is even greater. The proposed cross tower network can be used as a pre-processor for voice quality improvement in various environments where echo and noise coexist.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/159354http://hanyang.dcollection.net/common/orgView/200000485416
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > ELECTRONICS AND COMPUTER ENGINEERING(전자컴퓨터통신공학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE