FOLLOWUS
School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang 212013, China
[ "Duolin HUANG, E-mail: 2211708034@stmail.ujs.edu.cn" ]
Qirong MAO, E-mail: mao_qr@ujs.edu.cn
[ "Zhongchen MA, E-mail: zhongchen_ma@ujs.edu.cn" ]
[ "Zhishen ZHENG, E-mail: 1209103822@qq.com" ]
[ "Sidheswar ROUTRYAR, E-mail: sidheswar69@gmail.com" ]
[ "Elias-Nii-Noi OCQUAYE, E-mail: eocquaye@ujs.edu.cn" ]
纸质出版日期:2021-05,
网络出版日期:2021-01-29,
收稿日期:2019-12-10,
修回日期:2020-11-18,
Scan QR Code
黄多林, 毛启容, 马忠臣, 等. 用于说话人识别的潜在可区分性表征学习[J]. 信息与电子工程前沿(英文), 2021,22(5):697-708.
DUOLIN HUANG, QIRONG MAO, ZHONGCHEN MA, et al. Latent discriminative representation learning for speaker recognition. [J]. Frontiers of information technology & electronic engineering, 2021, 22(5): 697-708.
黄多林, 毛启容, 马忠臣, 等. 用于说话人识别的潜在可区分性表征学习[J]. 信息与电子工程前沿(英文), 2021,22(5):697-708. DOI: 10.1631/FITEE.1900690.
DUOLIN HUANG, QIRONG MAO, ZHONGCHEN MA, et al. Latent discriminative representation learning for speaker recognition. [J]. Frontiers of information technology & electronic engineering, 2021, 22(5): 697-708. DOI: 10.1631/FITEE.1900690.
从语音信号中提取特定说话人的可区分性表征,并将其转换为固定长度的向量是说话人识别和验证系统的关键步骤。提出一种潜在的可区分性表征学习方法,用于说话人识别。我们认为所学表征不仅具有可区分性,还具有相关性。具体来说,引入附加说话人嵌入查找表以探索同一说话人不同语音之间的相关性。此外,引入一个重构约束用于学习线性映射矩阵,使表征更具可区分性。实验结果表明,所提方法在INTERSPEECH2019会议的Fearless Step Challenge挑战赛的Apollo数据集和TIMIT数据集上的性能优于目前最先进方法。
Extracting discriminative speaker-specific representations from speech signals and transforming them into fixed length vectors are key steps in speaker identification and verification systems. In this study
we propose a latent discriminative representation learning method for speaker recognition. We mean that the learned representations in this study are not only discriminative but also relevant. Specifically
we introduce an additional speaker embedded lookup table to explore the relevance between different utterances from the same speaker. Moreover
a reconstruction constraint intended to learn a linear mapping matrix is introduced to make representation discriminative. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods based on the Apollo dataset used in the Fearless Steps Challenge in INTERSPEECH2019 and the TIMIT dataset.
说话人识别潜在可区分性表征学习说话人嵌入查找表线性映射矩阵
Speaker recognitionLatent discriminative representation learningSpeaker embedding lookup tableLinear mapping matrix
MTS Al-Kaltakchi, , , WL Woo, , , SS Dlay, , , 等. . Study of statistical robust closed set speaker identification with feature and score-based fusion. . IEEE Statistical Signal Processing Workshop, , 2016. . p.1--5. . DOI:10.1109/SSP.2016.7551807http://doi.org/10.1109/SSP.2016.7551807..
MTS Al-Kaltakchi, , , WL Woo, , , SS Dlay, , , 等. . Speaker identification evaluation based on the speech biometric and i-vector model using the TIMIT and NTIMIT databases. . Proc 5th Int Workshop on Biometrics and Forensics, , 2017. . p. 1--6. . DOI:10.1109/IWBF.2017.7935102http://doi.org/10.1109/IWBF.2017.7935102..
NX Chen, , , YM Qian, , , K Yu. . Multi-task learning for text-dependent speaker verification. . Proc 16th Annual Conf of the Int Speech Communication Association, , 2015. . p. 185--189. . ..
XB Chen, , , YF Cai, , , L Chen, , , 等. . Discriminant feature extraction for image recognition using complete robust maximum margin criterion. . Mach Vis Appl, , 2015. . 26((7-8):):857--870. . DOI:10.1007/s00138-015-0709-7http://doi.org/10.1007/s00138-015-0709-7..
S Cumani, , , O Plchot, , , P Laface. . Probabilistic linear discriminant analysis of i-vector posterior distributions. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2013. . p. 7644--7648. . DOI:10.1109/ICASSP.2013.6639150http://doi.org/10.1109/ICASSP.2013.6639150..
S Davis, , , P Mermelstein. . Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. . IEEE Trans Acoust Speech Signal Process, , 1980. . 28((4):):357--366. . DOI:10.1109/TASSP.1980.1163420http://doi.org/10.1109/TASSP.1980.1163420..
N Dehak, , , PJ Kenny, , , R Dehak, , , 等. . Front-end factor analysis for speaker verification. . IEEE Trans Audio Speech Lang Process, , 2011. . 19((4):):788--798. . DOI:10.1109/TASL.2010.2064307http://doi.org/10.1109/TASL.2010.2064307..
D Desai, , , M Joshi. . Speaker recognition using MFCC and hybrid model of VQ and GMM. . Proc 2nd Int Symp on Intelligent Informatics, , 2013. . p. 53--63. . DOI:10.1007/978-3-319-01778-5_6http://doi.org/10.1007/978-3-319-01778-5_6..
S Dey, , , P Motlicek, , , S Madikeri, , , 等. . Templatematching for text-dependent speaker verification. . Speech Commun, , 2017. . 8896--105. . DOI:10.1016/j.specom.2017.01.009http://doi.org/10.1016/j.specom.2017.01.009..
A Fisusi, , , T Yesufu. . Speaker recognition systems: a tutorial. . Afr J Inform Commun Technol, , 2007. . 3((2):):42--52. . DOI:10.5130/ajict.v3i2.508http://doi.org/10.5130/ajict.v3i2.508..
JS Garofolo, , , LF Lamel, , , WM Fisher, , , 等. . DARPA TIMIT acoustic-phonetic continous speech corpus CDROM. . NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report N, , 1993. . 9327403..
JHL Hansen, , , A Sangwan, , , A Joglekar, , , 等. . Fearless steps: Apollo-11 corpus advancements for speech technologies from Earth to the Moon. . Proc 19th Annual Conf of the Int Speech Communication Association, , 2018. . p.2758--2762. . DOI:10.21437/Interspeech.2018-1942http://doi.org/10.21437/Interspeech.2018-1942..
G Heigold, , , I Moreno, , , S Bengio, , , 等. . End-to-end text-dependent speaker verification. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2016. . p. 5115--5119. . DOI:10.1109/ICASSP.2016.7472652http://doi.org/10.1109/ICASSP.2016.7472652..
H Hermansky. . Perceptual linear predictive (PLP) analysis of speech. . J Acoust Soc Am, , 1990. . 87((4):):1738--1752. . DOI:10.1121/1.399423http://doi.org/10.1121/1.399423..
XD Huang, , , A Acero, , , HW Hon. . Spoken Language Processing: a Guide to Theory, Algorithm and System Development. . Upper Saddle River, Prentice Hall PTR, USA, , 2001. ..
HJ Jiang, , , RP Wang, , , SG Shan, , , 等. . Learning discriminative latent attributes for zero-shot classification. . IEEE Int Conf on Computer Vision, , 2017. . p.4233--4242. . DOI:10.1109/ICCV.2017.453http://doi.org/10.1109/ICCV.2017.453..
P Kenny, , , G Boulianne, , , P Ouellet, , , 等. . Speaker and session variability in GMM-based speaker verification. . IEEE Trans Audio Speech Lang Process, , 2007. . 15((4):):1448--1460. . DOI:10.1109/TASL.2007.894527http://doi.org/10.1109/TASL.2007.894527..
MJ Kim, , , IH Yang, , , MS Kim, , , 等. . Histogram equalization using a reduced feature set of background speakers' utterances for speaker recognition. . Front Inform Technol Electron Eng, , 2017. . 18((5):):738--750. . DOI:10.1631/FITEE.1500380http://doi.org/10.1631/FITEE.1500380..
R Kumar, , , V Yeruva, , , S Ganapathy. . On convolutional LSTM modeling for joint wake-word detection and text dependent speaker verification. . Proc 19th Annual Conf of the Int Speech Communication Association, , 2018. . p. 1121--1125. . DOI:10.21437/Interspeech.2018-1759http://doi.org/10.21437/Interspeech.2018-1759..
Y Lei, , , N Scheffer, , , L Ferrer, , , 等. . A novel scheme for speaker recognition using a phonetically-aware deep neural network. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2014. . p. 1695--1699. . DOI:10.1109/ICASSP.2014.6853887http://doi.org/10.1109/ICASSP.2014.6853887..
C Li, , , XK Ma, , , B Jiang, , , 等. . Deep speaker: an end-to-end neural speaker embedding system, , https://arxiv.org/abs/1705.02304https://arxiv.org/abs/1705.02304, , 2017. ..
Y Luo, , , Y Liu, , , Y Zhang, , , 等. . Speech bottleneck feature extraction method based on overlapping group lasso sparse deep neural network. . Speech Commun, , 2018. . 9956--61. . DOI:10.1016/j.specom.2018.02.005http://doi.org/10.1016/j.specom.2018.02.005..
QR Mao, , , M Dong, , , ZW Huang, , , 等. . Learning salient features for speech emotion recognition using convolutional neural networks. . IEEE Trans Multim, , 2014. . 16((8):):2203--2213. . DOI:10.1109/TMM.2014.2360798http://doi.org/10.1109/TMM.2014.2360798..
R Peri, , , M Pal, , , A Jati, , , 等. . Robust speaker recognition using unsupervised adversarial invariance, , 2019. . https://arxiv.org/abs/1911.00940https://arxiv.org/abs/1911.00940, , ..
LR Rabiner. . A tutorial on hidden Markov models and selected applications in speech recognition. . Proc IEEE, , 1989. . 77((2):):257--286. . ..
DA Reynolds, , , RC Rose. . Robust text-independent speaker identification using Gaussian mixture speaker models. . IEEE Trans Speech Audio Process, , 1995. . 3((1):):72--83. . DOI:10.1109/89.365379http://doi.org/10.1109/89.365379..
DA Reynolds, , , TF Quatieri, , , RB Dunn. . Speaker verification using adapted Gaussian mixture models. . Dig Signal Process, , 2000. . 10((1-3):):19--41. . DOI:10.1006/dspr.1999.0361http://doi.org/10.1006/dspr.1999.0361..
SO Sadjadi, , , M Slaney, , , L Heck, , , 等. . MSR Identity Toolbox v1.0: a MATLAB Toolbox for Speaker Recognition Research. . Microsoft Research Technical Report, Piscataway, NJ, USA, , 2013. ..
F Schroff, , , D Kalenichenko, , , J Philbin. . FaceNet: a unified embedding for face recognition and clustering. . IEEE Conf on Computer Vision and Pattern Recognition, , 2015. . p.815--823. . DOI:10.1109/CVPR.2015.7298682http://doi.org/10.1109/CVPR.2015.7298682..
S Singh, , , EG Rajan. . Vector quantization approach for speaker recognition using MFCC and inverted MFCC. . Int J Comput Appl, , 2011. . 17((1):):1--7. . ..
D Snyder, , , P Ghahremani, , , D Povey, , , 等. . Deep neural network-based speaker embeddings for end-to-end speaker verification. . IEEE Spoken Language Technology Workshop, , 2016. . p.165--170. . DOI:10.1109/SLT.2016.7846260http://doi.org/10.1109/SLT.2016.7846260..
D Snyder, , , D Garcia-Romero, , , D Povey, , , 等. . Deep neural network embeddings for text-independent speaker verification. . Proc 18th Annual Conf of the Int Speech Communication Association, , 2017. . p. 999--1003. . DOI:10.21437/Interspeech.2017-620http://doi.org/10.21437/Interspeech.2017-620..
D Snyder, , , D Garcia-Romero, , , G Sell, , , 等. . X-vectors: robust DNN embeddings for speaker recognition. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2018. . p. 5329--5333. . DOI:10.1109/ICASSP.2018.8461375http://doi.org/10.1109/ICASSP.2018.8461375..
R Togneri, , , D Pullella. . An overview of speaker identification: accuracy and robustness issues. . IEEE Circ Syst Mag, , 2011. . 11((2):):23--61. . DOI:10.1109/MCAS.2011.941079http://doi.org/10.1109/MCAS.2011.941079..
DA van Leeuwen, , , R Saeidi. . Knowing the non-target speakers: the effect of the i-vector population for PLDA training in speaker recognition. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2013. . p. 6778--6782. . DOI:10.1109/ICASSP.2013.6638974http://doi.org/10.1109/ICASSP.2013.6638974..
E Variani, , , X Lei, , , E McDermott, , , 等. . Deep neural networks for small footprint text-dependent speaker verification. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2014. . p. 4052--4056. . DOI:10.1109/ICASSP.2014.6854363http://doi.org/10.1109/ICASSP.2014.6854363..
V Wan, , , WM Campbell. . Support vector machines for speaker verification and identification. . Neural Networks for Signal Processing X. Proc IEEE Signal Processing Society Workshop, , 2000. . p. 775--784. . DOI:10.1109/NNSP.2000.890157http://doi.org/10.1109/NNSP.2000.890157..
YD Wen, , , KP Zhang, , , ZF Li, , , 等. . A discriminative feature learning approach for deep face recognition. . Proc 14th European Conf on Computer Vision, , 2016. . p. 499--515. . DOI:10.1007/978-3-319-46478-7_31http://doi.org/10.1007/978-3-319-46478-7_31..
S Yadav, , , A Rai. . Learning discriminative features for speaker identification and verification. . Proc 19th Annual Conf of the Int Speech Communication Association, , 2018. . p. 2237--2241. . DOI:10.21437/Interspeech.2018-1015http://doi.org/10.21437/Interspeech.2018-1015..
T Yoshimura, , , N Koike, , , K Hashimoto, , , 等. . Discriminative feature extraction based on sequential variational autoencoder for speaker recognition. . Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, , 2018. . p. 1742--1746. . DOI:10.23919/APSIPA.2018.8659722http://doi.org/10.23919/APSIPA.2018.8659722..
S Young. . The HTK Hidden Markov Model Toolkit: Design and Philosophy. . Department of Engineering, Cambridge University, Cambridge, , 1993. ..
K Yu, , , J Mason, , , J Oglesby. . Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. . IEE Proc Vis Image Signal Process, , 1995. . 142((5):):313--318. . DOI:10.1049/ip-vis:19952144http://doi.org/10.1049/ip-vis:19952144..
C Zhang, , , K Koishida. . End-to-end text-independent speaker verification with triplet loss on short utterances. . Proc 18th Annual Conf of the Int Speech Communication Association, , 2017. . p.1487--1491. . DOI:10.21437/Interspeech.2017-1608http://doi.org/10.21437/Interspeech.2017-1608..
FF Zhang, , , TZ Zhang, , , QR Mao, , , 等. . Joint pose and expression modeling for facial expression recognition. . IEEE Conf on Computer Vision and Pattern Recognition, , 2018. . p.3359--3368. . DOI:10.1109/CVPR.2018.00354http://doi.org/10.1109/CVPR.2018.00354..
关联资源
相关文章
相关作者
相关机构