FOLLOWUS
1.College of Electrical Engineering, Zhejiang University, Hangzhou310027, China
2.Institute of Robotics, Zhejiang University, Yuyao315400, China
E-mail: weizhao_ee@zju.edu.cn;
‡Corresponding authors
Published:23 July 2022,
Published Online:31 May 2022,
Received:21 October 2021,
Accepted:2022-01-09
Scan QR Code
WEI ZHAO, LI XU. Efficient decoding self-attention for end-to-end speech synthesis. [J]. Frontiers of information technology & electronic engineering, 2022, 23(7): 1127-1138.
WEI ZHAO, LI XU. Efficient decoding self-attention for end-to-end speech synthesis. [J]. Frontiers of information technology & electronic engineering, 2022, 23(7): 1127-1138. DOI: 10.1631/FITEE.2100501.
自注意力网络由于其并行结构和强大的序列建模能力,被广泛应用于语音合成(TTS)领域。然而,当使用自回归解码方法进行端到端语音合成时,由于序列长度的二次复杂性,其推理速度相对较慢。当部署设备未配备图形处理器(GPU)时,该效率问题更加严重。为解决该问题,提出一种高效解码自注意力网络(EDSA)作为替代。通过一个动态规划解码过程,有效加速TTS模型推理,使其具有线性计算复杂度。基于普通话和英文数据集的实验结果表明,所提EDSA模型在中央处理器(CPU)和GPU上的推理速度分别提高720%和50%,而性能几乎相同。因此,在GPU资源有限的情况下,该方法可使此类模型的部署更加容易。此外,所提模型在域外语言处理上可能比基线Transformer TTS性能更好。
Self-attention has been innovatively applied to text-to-speech (TTS) because of its parallel structure and superior strength in modeling sequential data. However
when used in end-to-end speech synthesis with an autoregressive decoding scheme
its inference speed becomes relatively low due to the quadratic complexity in sequence length. This problem becomes particularly severe on devices without graphics processing units (GPUs). To alleviate the dilemma
we propose an efficient decoding self-attention (EDSA) module as an alternative. Combined with a dynamic programming decoding procedure
TTS model inference can be effectively accelerated to have a linear computation complexity. We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve
<math id="M1"><mrow><mn>720</mn><mi>%</mi></mrow></math>
and
<math id="M2"><mrow><mn>50</mn><mi>%</mi></mrow></math>
higher inference speed on the central processing unit (CPU) and GPU respectively
with almost the same performance. Thus
this method may make the deployment of such models easier when there are limited GPU resources. In addition
our model may perform better than the baseline Transformer TTS on out-of-domain utterances.
高效解码端到端自注意力网络语音合成
Efficient decodingEnd-to-endSelf-attentionSpeech synthesis
Ainslie J, Ontanon S, Alberti C, et al., 2020. ETC: encoding long and structured inputs in Transformers. Proc Conf on Empirical Methods in Natural Language Processing, p.268-284. doi: 10.18653/v1/2020.emnlp-main.19http://doi.org/10.18653/v1/2020.emnlp-main.19
Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450https://arxiv.org/abs/1607.06450
Bahdanau D, Cho K, Bengio Y, 2015. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473v6https://arxiv.org/abs/1409.0473v6
Beltagy I, Peters ME, Cohan A, 2020. Longformer: the long-document transformer. https://arxiv.org/abs/2004.05150https://arxiv.org/abs/2004.05150
Child R, Gray S, Radford A, et al., 2019. Generating long sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509https://arxiv.org/abs/1904.10509
Choromanski KM, Likhosherstov V, Dohan D, et al., 2020. Rethinking attention with performers. https://arxiv.org/abs/2009.14794https://arxiv.org/abs/2009.14794
Dai ZH, Yang ZL, Yang YM, et al., 2019. Transformer-XL: attentive language models beyond a fixed-length context. Proc 57th Annual Meeting of the Association for Computational Linguistics, p.2978-2988.
DataBaker , 2019. Chinese Standard Mandarin Speech Copus. https://www.data-baker.comhttps://www.data-baker.com [Accessed on June 1, 2020].
Hayashi T, Yamamoto R, Inoue K, et al., 2020. Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7654-7658. doi: 10.1109/ICASSP40776.2020.9053512http://doi.org/10.1109/ICASSP40776.2020.9053512
He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. doi: 10.1109/CVPR.2016.90http://doi.org/10.1109/CVPR.2016.90
Ihm HR, Lee JY, Choi BJ, et al., 2020. Reformer-TTS: neural speech synthesis with reformer network. Proc Interspeech 21st Annual Conf of the Int Speech Communication Association, p.2012-2016. doi: 10.21437/Interspeech.2020-2189http://doi.org/10.21437/Interspeech.2020-2189
Ito K, Johnson L, 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/https://keithito.com/LJ-Speech-Dataset/ [Accessed on June 1, 2020].
Katharopoulos A, Vyas A, Pappas N, et al., 2020. Transformers are RNNs: fast autoregressive transformers with linear attention. Proc 37th Int Conf on Machine Learning, p.5156-5165.
Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980https://arxiv.org/abs/1412.6980
Kitaev N, Kaiser L, Levskaya A, 2020. Reformer: the efficient Transformer. https://arxiv.org/abs/2001.04451v1https://arxiv.org/abs/2001.04451v1
Lee J, Lee Y, Kim J, et al., 2019. Set Transformer: a framework for attention-based permutation-invariant neural networks. Proc 36th Int Conf on Machine Learning, p.3744-3753.
Li NH, Liu SJ, Liu YQ, et al., 2019. Neural speech synthesis with Transformer network. Proc AAAI Conf on Artificial Intelligence, p.6706-6713. doi: 10.1609/aaai.v33i01.33016706http://doi.org/10.1609/aaai.v33i01.33016706
Lim D, Jang W, OG , et al., 2020. JDI-T: Jointly trained Duration Informed Transformer for text-to-speech without explicit alignment. Proc Conf of the Int Speech Communication Association, p.4004-4008.
Park K, 2019. g2pC. GitHub. https://github.com/Kyubyong/g2pChttps://github.com/Kyubyong/g2pC [Accessed on June 1, 2020].
Park K, Kim J, 2019. g2pE. GitHub. https://github.com/Kyubyong/g2phttps://github.com/Kyubyong/g2p [Accessed on June 1, 2020].
Ping W, Peng KN, Gibiansky A, et al., 2018. Deep Voice 3: scaling text-to-speech with convolutional sequence learning. https://arxiv.org/abs/1710.07654https://arxiv.org/abs/1710.07654
Prenger R, Valle R, Catanzaro B, 2019. WaveGlow: a flow-based generative network for speech synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3617-3621.
Ren Y, Ruan YJ, Tan X, et al., 2019. FastSpeech: fast, robust and controllable text to speech. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.3165-3174.
Ren Y, Hu CX, Tan X, et al., 2021. FastSpeech 2: fast and high-quality end-to-end text to speech. https://arxiv.org/abs/2006.04558v3https://arxiv.org/abs/2006.04558v3
Shen J, Pang RM, Weiss RJ, et al., 2018. Natural TTS synthesis by conditioning wavenet on Mel spectrogram predictions. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4779-4783.
Tachibana H, Uenoyama K, Aihara S, 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4784-4788.
Tay Y, Dehghani M, Bahri D, et al., 2020. Efficient transformers: a survey. https://arxiv.org/abs/2009.06732https://arxiv.org/abs/2009.06732
Tay Y, Bahri D, Metzler D, et al., 2021. Synthesizer: rethinking self-attention for Transformer models. Proc 38th Int Conf on Machine Learning, p.10183-10192.
Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems, p.5998-6008.
Wang SN, Li BZ, Khabsa M, et al., 2020. Linformer: self-attention with linear complexity. https://arxiv.org/abs/2006.04768https://arxiv.org/abs/2006.04768
Wang YX, Skerry-Ryan RJ, Stanton D, et al., 2017. Tacotron: towards end-to-end speech synthesis. Proc Interspeech 18th Annual Conf of the Int Speech Communication Association, p.4006-4010. doi: 10.21437/Interspeech.2017-1452http://doi.org/10.21437/Interspeech.2017-1452
Wu F, Fan A, Baevski A, et al., 2019. Pay less attention with lightweight and dynamic convolutions. https://arxiv.org/abs/1901.10430v2https://arxiv.org/abs/1901.10430v2
Yang ZL, Dai ZH, Yang YM, et al., 2019. XLNet: generalized autoregressive pretraining for language understanding. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.5754-5764.
Zaheer M, Guruganesh G, Dubey KA, et al., 2020. Big Bird: Transformers for longer sequences. https://arxiv.org/abs/2007.14062https://arxiv.org/abs/2007.14062
Zeng Z, Wang JZ, Cheng N, et al., 2020. AlignTTS: efficient feed-forward text-to-speech system without explicit alignment. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.6714-6718.
Zhang B, Xiong DY, Su JS, 2018. Accelerating neural transformer via an average attention network. Proc 56th Annual Meeting of the Association for Computational Linguistics, p.1789-1798.
Zhao W, He T, Xu L, 2021. Enhancing local dependencies for Transformer-based text-to-speech via hybrid lightweight convolution. IEEE Access, 9:42762-42770. doi: 10.1109/ACCESS.2021.3065736http://doi.org/10.1109/ACCESS.2021.3065736
Publicity Resources
Related Articles
Related Author
Related Institution