209 136

Full metadata record

DC FieldValueLanguage
dc.contributor.author장준혁-
dc.date.accessioned2022-10-19T05:26:35Z-
dc.date.available2022-10-19T05:26:35Z-
dc.date.issued2021-01-
dc.identifier.citationIEEE ACCESS, v. 9, page. 8954-8960en_US
dc.identifier.issn2169-3536en_US
dc.identifier.urihttps://ieeexplore.ieee.org/document/9312676/en_US
dc.identifier.urihttps://repository.hanyang.ac.kr/handle/20.500.11754/175538-
dc.description.abstractSpeech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.en_US
dc.description.sponsorshipThis work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant by Korean Government through the Ministry of Science and ICT (MSIT) (Deep learning multi-speaker prosody and emotion cloning technology based on a high quality end-to-end model using small amount of data) under Grant 2020-0-00059.en_US
dc.language.isoenen_US
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INCen_US
dc.subjectSpeech synthesis; document-level neural TTS; curriculum learning; attention masking; Tacotron2; MelGAN; DeepVoice3; ParaNet; MultiSpeechen_US
dc.titleDocument-Level Neural TTS Using Curriculum Learning and Attention Maskingen_US
dc.typeArticleen_US
dc.relation.volume9-
dc.identifier.doi10.1109/ACCESS.2020.3049073en_US
dc.relation.page8954-8960-
dc.relation.journalIEEE ACCESS-
dc.contributor.googleauthorHwang, Sung-Woong-
dc.contributor.googleauthorChang, Joon-Hyuk-
dc.relation.code2021000011-
dc.sector.campusS-
dc.sector.daehakCOLLEGE OF ENGINEERING[S]-
dc.sector.departmentSCHOOL OF ELECTRONIC ENGINEERING-
dc.identifier.pidjchang-
dc.identifier.orcidhttps://orcid.org/0000-0003-2610-2323-


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE