Efficient decoding self-attention for end-to-end speech synthesis

Wei ZHAO; Li XU

doi:10.1631/FITEE.2100501

Your Location：

Home >

Browse articles >

Efficient decoding self-attention for end-to-end speech synthesis

Regular Papers | Updated：2022-09-08

- Efficient decoding self-attention for end-to-end speech synthesis
  Enhanced Publication
- 一种端到端语音合成中的高效解码自注意力网络
- Frontiers of Information Technology & Electronic Engineering Vol. 23, Issue 7, Pages: 1127-1138(2022)
- Affiliations：
  
  1.College of Electrical Engineering, Zhejiang University, Hangzhou310027, China
  2.Institute of Robotics, Zhejiang University, Yuyao315400, China
- Author bio：
  
  E-mail: weizhao_ee@zju.edu.cn;
  ‡Corresponding authors
- Funds：
- DOI：10.1631/FITEE.2100501
  CLC： TN912.3
- Published：23 July 2022，
  
  Published Online：31 May 2022，
  
  Received：21 October 2021，
  
  Accepted：2022-01-09
- Accepted：
Scan QR Code
WEI ZHAO, LI XU. Efficient decoding self-attention for end-to-end speech synthesis. [J]. Frontiers of information technology & electronic engineering, 2022, 23(7): 1127-1138.
DOI：

WEI ZHAO, LI XU. Efficient decoding self-attention for end-to-end speech synthesis. [J]. Frontiers of information technology & electronic engineering, 2022, 23(7): 1127-1138. DOI： 10.1631/FITEE.2100501.

摘要

自注意力网络由于其并行结构和强大的序列建模能力，被广泛应用于语音合成（TTS）领域。然而，当使用自回归解码方法进行端到端语音合成时，由于序列长度的二次复杂性，其推理速度相对较慢。当部署设备未配备图形处理器（GPU）时，该效率问题更加严重。为解决该问题，提出一种高效解码自注意力网络（EDSA）作为替代。通过一个动态规划解码过程，有效加速TTS模型推理，使其具有线性计算复杂度。基于普通话和英文数据集的实验结果表明，所提EDSA模型在中央处理器（CPU）和GPU上的推理速度分别提高720%和50%，而性能几乎相同。因此，在GPU资源有限的情况下，该方法可使此类模型的部署更加容易。此外，所提模型在域外语言处理上可能比基线Transformer TTS性能更好。

Abstract

Self-attention has been innovatively applied to text-to-speech (TTS) because of its parallel structure and superior strength in modeling sequential data. However

when used in end-to-end speech synthesis with an autoregressive decoding scheme

its inference speed becomes relatively low due to the quadratic complexity in sequence length. This problem becomes particularly severe on devices without graphics processing units (GPUs). To alleviate the dilemma

we propose an efficient decoding self-attention (EDSA) module as an alternative. Combined with a dynamic programming decoding procedure

TTS model inference can be effectively accelerated to have a linear computation complexity. We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve

and

higher inference speed on the central processing unit (CPU) and GPU respectively

with almost the same performance. Thus

this method may make the deployment of such models easier when there are limited GPU resources. In addition

our model may perform better than the baseline Transformer TTS on out-of-domain utterances.

关键词

高效解码端到端自注意力网络语音合成

Keywords

Efficient decodingEnd-to-endSelf-attentionSpeech synthesis

references

Ainslie J, Ontanon S, Alberti C, et al., 2020. ETC: encoding long and structured inputs in Transformers. Proc Conf on Empirical Methods in Natural Language Processing, p.268-284. doi: 10.18653/v1/2020.emnlp-main.19http://doi.org/10.18653/v1/2020.emnlp-main.19

Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450https://arxiv.org/abs/1607.06450

Bahdanau D, Cho K, Bengio Y, 2015. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473v6https://arxiv.org/abs/1409.0473v6

Beltagy I, Peters ME, Cohan A, 2020. Longformer: the long-document transformer. https://arxiv.org/abs/2004.05150https://arxiv.org/abs/2004.05150

Child R, Gray S, Radford A, et al., 2019. Generating long sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509https://arxiv.org/abs/1904.10509

Choromanski KM, Likhosherstov V, Dohan D, et al., 2020. Rethinking attention with performers. https://arxiv.org/abs/2009.14794https://arxiv.org/abs/2009.14794

Dai ZH, Yang ZL, Yang YM, et al., 2019. Transformer-XL: attentive language models beyond a fixed-length context. Proc 57th Annual Meeting of the Association for Computational Linguistics, p.2978-2988.

DataBaker , 2019. Chinese Standard Mandarin Speech Copus. https://www.data-baker.comhttps://www.data-baker.com [Accessed on June 1, 2020].

Hayashi T, Yamamoto R, Inoue K, et al., 2020. Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7654-7658. doi: 10.1109/ICASSP40776.2020.9053512http://doi.org/10.1109/ICASSP40776.2020.9053512

He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. doi: 10.1109/CVPR.2016.90http://doi.org/10.1109/CVPR.2016.90

Ihm HR, Lee JY, Choi BJ, et al., 2020. Reformer-TTS: neural speech synthesis with reformer network. Proc Interspeech 21st Annual Conf of the Int Speech Communication Association, p.2012-2016. doi: 10.21437/Interspeech.2020-2189http://doi.org/10.21437/Interspeech.2020-2189

Ito K, Johnson L, 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/https://keithito.com/LJ-Speech-Dataset/ [Accessed on June 1, 2020].

Katharopoulos A, Vyas A, Pappas N, et al., 2020. Transformers are RNNs: fast autoregressive transformers with linear attention. Proc 37th Int Conf on Machine Learning, p.5156-5165.

Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980https://arxiv.org/abs/1412.6980

Kitaev N, Kaiser L, Levskaya A, 2020. Reformer: the efficient Transformer. https://arxiv.org/abs/2001.04451v1https://arxiv.org/abs/2001.04451v1

Lee J, Lee Y, Kim J, et al., 2019. Set Transformer: a framework for attention-based permutation-invariant neural networks. Proc 36th Int Conf on Machine Learning, p.3744-3753.

Li NH, Liu SJ, Liu YQ, et al., 2019. Neural speech synthesis with Transformer network. Proc AAAI Conf on Artificial Intelligence, p.6706-6713. doi: 10.1609/aaai.v33i01.33016706http://doi.org/10.1609/aaai.v33i01.33016706

Lim D, Jang W, OG , et al., 2020. JDI-T: Jointly trained Duration Informed Transformer for text-to-speech without explicit alignment. Proc Conf of the Int Speech Communication Association, p.4004-4008.

Park K, 2019. g2pC. GitHub. https://github.com/Kyubyong/g2pChttps://github.com/Kyubyong/g2pC [Accessed on June 1, 2020].

Park K, Kim J, 2019. g2pE. GitHub. https://github.com/Kyubyong/g2phttps://github.com/Kyubyong/g2p [Accessed on June 1, 2020].

Ping W, Peng KN, Gibiansky A, et al., 2018. Deep Voice 3: scaling text-to-speech with convolutional sequence learning. https://arxiv.org/abs/1710.07654https://arxiv.org/abs/1710.07654

Prenger R, Valle R, Catanzaro B, 2019. WaveGlow: a flow-based generative network for speech synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3617-3621.

Ren Y, Ruan YJ, Tan X, et al., 2019. FastSpeech: fast, robust and controllable text to speech. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.3165-3174.

Ren Y, Hu CX, Tan X, et al., 2021. FastSpeech 2: fast and high-quality end-to-end text to speech. https://arxiv.org/abs/2006.04558v3https://arxiv.org/abs/2006.04558v3

Shen J, Pang RM, Weiss RJ, et al., 2018. Natural TTS synthesis by conditioning wavenet on Mel spectrogram predictions. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4779-4783.

Tachibana H, Uenoyama K, Aihara S, 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4784-4788.

Tay Y, Dehghani M, Bahri D, et al., 2020. Efficient transformers: a survey. https://arxiv.org/abs/2009.06732https://arxiv.org/abs/2009.06732

Tay Y, Bahri D, Metzler D, et al., 2021. Synthesizer: rethinking self-attention for Transformer models. Proc 38th Int Conf on Machine Learning, p.10183-10192.

Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems, p.5998-6008.

Wang SN, Li BZ, Khabsa M, et al., 2020. Linformer: self-attention with linear complexity. https://arxiv.org/abs/2006.04768https://arxiv.org/abs/2006.04768

Wang YX, Skerry-Ryan RJ, Stanton D, et al., 2017. Tacotron: towards end-to-end speech synthesis. Proc Interspeech 18th Annual Conf of the Int Speech Communication Association, p.4006-4010. doi: 10.21437/Interspeech.2017-1452http://doi.org/10.21437/Interspeech.2017-1452

Wu F, Fan A, Baevski A, et al., 2019. Pay less attention with lightweight and dynamic convolutions. https://arxiv.org/abs/1901.10430v2https://arxiv.org/abs/1901.10430v2

Yang ZL, Dai ZH, Yang YM, et al., 2019. XLNet: generalized autoregressive pretraining for language understanding. Proc Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems, p.5754-5764.

Zaheer M, Guruganesh G, Dubey KA, et al., 2020. Big Bird: Transformers for longer sequences. https://arxiv.org/abs/2007.14062https://arxiv.org/abs/2007.14062

Zeng Z, Wang JZ, Cheng N, et al., 2020. AlignTTS: efficient feed-forward text-to-speech system without explicit alignment. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.6714-6718.

Zhang B, Xiong DY, Su JS, 2018. Accelerating neural transformer via an average attention network. Proc 56th Annual Meeting of the Association for Computational Linguistics, p.1789-1798.

Zhao W, He T, Xu L, 2021. Enhancing local dependencies for Transformer-based text-to-speech via hybrid lightweight convolution. IEEE Access, 9:42762-42770. doi: 10.1109/ACCESS.2021.3065736http://doi.org/10.1109/ACCESS.2021.3065736

Views

Downloads

CSCD

Alert me when the article has been cited

Submit

Tools

Publicity Resources

End-to-end delay analysis for networked systems

Related Author

SHEN Jie

HE Wen-bo

LIU Xue

WANG Zhi-bo

WANG Zhi

YAO Jian-guo

Related Institution

Department of Control Science and Engineering, Zhejiang University

School of Computer Science, McGill University, Montreal H3A0E9

School of Computer, Wuhan University

Su zhou Institute of Wuhan University

School of Software, Shanghai Jiao Tong University

Address：Zhejiang University Press, 148 Tianmushan Road, Hangzhou, China Postal code：310028
Tel：+86-571-88273162 Email：fitee@zju.edu.cn
It is recommended to read the content of this site in Chrome&IE9+. Please switch to extreme mode in browser 360.
Cookies We use cookies to help provide and enhance our service and tailor content. By continuing, you agree to the use of cookies.

⁰