FOLLOWUS
1.School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
2.School of Mathematics and Statistics, Qingdao University, Qingdao 266071, China
E-mail: jinyi_g@njust.edu.cn
‡Corresponding author
纸质出版日期:2023-10-0 ,
收稿日期:2022-10-27,
录用日期:2023-02-16
Scan QR Code
郭金一, 丁洁玉. 基于对齐自修正的鲁棒跨模态检索[J]. 信息与电子工程前沿(英文), 2023,24(10):1403-1415.
JINYI GUO, JIEYU DING. Robust cross-modal retrieval with alignment refurbishment. [J]. Frontiers of information technology & electronic engineering, 2023, 24(10): 1403-1415.
郭金一, 丁洁玉. 基于对齐自修正的鲁棒跨模态检索[J]. 信息与电子工程前沿(英文), 2023,24(10):1403-1415. DOI: 10.1631/FITEE.2200514.
JINYI GUO, JIEYU DING. Robust cross-modal retrieval with alignment refurbishment. [J]. Frontiers of information technology & electronic engineering, 2023, 24(10): 1403-1415. DOI: 10.1631/FITEE.2200514.
跨模态检索通过为不同模态数据建立一致的对齐方式来实现模态间的相互检索。目前多种跨模态检索方法已被提出并取得良好性能。这些方法使用干净对齐的跨模态数据进行训练。虽然这些数据在语义上是匹配的,但相较于互联网上容易获得的噪声对齐的数据(即成对但在语义上不匹配),标注成本很高。当用噪声对齐的数据训练这些模型时,它们的性能会急剧下降。因此,本文提出一种对齐自修正的鲁棒跨模态检索算法(RCAR),显著降低了噪声数据对模型的影响。具体来说,RCAR首先进行多任务学习,减缓模型对噪声数据的过拟合,使数据分离。然后,利用两成分的贝塔混合模型将数据分为干净数据和噪声数据,并根据后验概率修正对齐标签。此外,在噪声对齐范式中定义两种噪声类型:部分噪声数据和完全噪声数据。实验结果表明,与当下流行的跨模态检索方法相比,RCAR在两种类型的噪声下都能取得更稳健的性能。
Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data. Currently
many cross-modal retrieval methods have been proposed and have achieved excellent results; however
these are trained with clean cross-modal pairs
which are semantically matched but costly
compared with easily available data with noise alignment (i.e.
paired but mismatched in semantics). When training these methods with noise-aligned data
the performance degrades dramatically. Therefore
we propose a robust cross-modal retrieval with alignment refurbishment (RCAR)
which significantly reduces the impact of noise on the model. Specifically
RCAR first conducts multi-task learning to slow down the overfitting to the noise to make data separable. Then
RCAR uses a two-component beta-mixture model to divide them into clean and noise alignments and refurbishes the label according to the posterior probability of the noise-alignment component. In addition
we define partial and complete noises in the noise-alignment paradigm. Experimental results show that
compared with the popular cross-modal retrieval methods
RCAR achieves more robust performance with both types of noise.
跨模态检索鲁棒学习对齐修正贝塔混合模型
Cross-modal retrievalRobust learningAlignment correctionBeta-mixture model
Arazo E, Ortego D, Albert P, et al., 2019. Unsupervised label noise modeling and loss correction. Proc 36th Int Conf on Machine Learning, p.312-321.
Chang HS, Learned-Miller E, McCallum A, 2017. Active bias: training more accurate neural networks by emphasizing high variance samples. Proc 31st Int Conf on Neural Information Processing Systems, p.1003-1013.
Chen H, Ding GG, Liu XD, et al., 2020. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12652-12660. 10.1109/CVPR42600.2020.01267https://doi.org/10.1109/CVPR42600.2020.01267
Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: universal image-text representation learning. Proc 16th European Conf on Computer Vision, p.104-120. 10.1007/978-3-030-58577-8_7https://doi.org/10.1007/978-3-030-58577-8_7
Chung J, Gulcehre C, Cho KH, et al., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555https://arxiv.org/abs/1412.3555
Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. 10.18653/v1/n19-1423https://doi.org/10.18653/v1/n19-1423
Diao HW, Zhang Y, Ma L, et al., 2021. Similarity reasoning and filtration for image-text matching. Proc AAAI 35th Conf on Artificial Intelligence, p.1218-1226. 10.1609/aaai.v35i2.16209https://doi.org/10.1609/aaai.v35i2.16209
Faghri F, Fleet DJ, Kiros JR, et al., 2018. VSE++: improving visual-semantic embeddings with hard negatives. British Machine Vision Conf, Article 12.
Geigle G, Pfeiffer J, Reimers N, et al., 2022. Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans Assoc Comput Ling, 10:503-521. 10.1162/tacl_a_00473https://doi.org/10.1162/tacl_a_00473
Ghosh A, Kumar H, Sastry PS, 2017. Robust loss functions under label noise for deep neural networks. Proc 31st Conf on Artificial Intelligence, p.1919-1925.
Han B, Yao QM, Yu XRet al., 2018. Co-teaching: robust training of deep neural networks with extremely noisy labels. Proc 32nd Int Conf on Neural Information Processing Systems, p.8536-8546.
He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. 10.1109/CVPR.2016.90https://doi.org/10.1109/CVPR.2016.90
Huiskes MJ, Lew MS, 2008. The MIR flickr retrieval evaluation. Proc 1st ACM Int Conf on Multimedia Information Retrieval, p.39-43. 10.1145/1460096.1460104https://doi.org/10.1145/1460096.1460104
Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38th Int Conf on Machine Learning, p.4904-4916.
Jiang L, Zhou ZY, Leung T, et al., 2018. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. Proc 35th Int Conf on Machine Learning, p.2309-2318.
Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128-3137. 10.1109/CVPR.2015.7298932https://doi.org/10.1109/CVPR.2015.7298932
Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3rd Int Conf on Learning Representations.
Lee KH, Chen X, Hua G, et al., 2018. Stacked cross attention for image–text matching. Proc 15th European Conf on Computer Vision, p.212-228. 10.1007/978-3-030-01225-0_13https://doi.org/10.1007/978-3-030-01225-0_13
Li KP, Zhang YL, Li K, et al., 2019. Visual semantic reasoning for image-text matching. IEEE/CVF Int Conf on Computer Vision, p.4653-4661. 10.1109/ICCV.2019.00475https://doi.org/10.1109/ICCV.2019.00475
Li XJ, Yin X, Li CY, et al., 2020. UNITER: universal image-text representation learning. Proc 16th European Conf on Computer Vision, p.121-137. 10.1007/978-3-030-58577-8_8https://doi.org/10.1007/978-3-030-58577-8_8
Lin TY, Maire M, Belongie S, et al., 2014. Stacked cross attention for image–text matching. Proc 13th European Conf on Computer Vision, p.740-755. 10.1007/978-3-319-10602-1_48https://doi.org/10.1007/978-3-319-10602-1_48
Lin XY, Bhattacharjee D, El Helou M, et al., 2021. Fidelity estimation improves noisy-image classification with pretrained networks. IEEE Signal Process Lett, 28:1719-1723. 10.1109/LSP.2021.3104769https://doi.org/10.1109/LSP.2021.3104769
Liu TL, Tao DC, 2016. Classification with noisy labels by importance reweighting. IEEE Trans Patt Anal Mach Intell, 38(3):447-461. 10.1109/TPAMI.2015.2456899https://doi.org/10.1109/TPAMI.2015.2456899
Lu JS, Batra D, Parikh D, et al., 2019. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc 33rd Int Conf on Neural Information Processing Systems, p.13-23.
Lyu YM, Tsang IW, 2020. Curriculum loss: robust learning and generalization against label corruption. Proc 8th Int Conf on Learning Representations.
Ma X, Huang H, Wang Y, et al., 2020. Normalized loss functions for deep learning with noisy labels. Proc 37th Int Conf on Machine Learning, p.6543-6553.
Ma XJ, Wang YS, Houle ME, et al., 2018. Dimensionality-driven learning with noisy labels. Proc 35th Int Conf on Machine Learning, p.3361-3370.
Ma ZY, Leijon A, 2011. Bayesian estimation of beta mixture models with variational inference. IEEE Trans Patt Anal Mach Intell, 33(11):2160-2173. 10.1109/TPAMI.2011.63https://doi.org/10.1109/TPAMI.2011.63
Manwani N, Sastry PS, 2013. Noise tolerance under risk minimization. IEEE Trans Cybern, 43(3):1146-1151. 10.1109/TSMCB.2012.2223460https://doi.org/10.1109/TSMCB.2012.2223460
Messina N, Amato G, Esuli A, et al., 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl, 17(4):128. 10.1145/3451390https://doi.org/10.1145/3451390
Niwattanakul S, Singthongchai J, Naenudorn E, et al., 2013. Using of jaccard coefficient for keywords similarity. Proc Int MultiConf of Engineers and Computer Scientists, p.380-384.
Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748-8763.
Reed SE, Lee H, Anguelov D, et al., 2015. Training deep neural networks on noisy labels with bootstrapping. Proc 3rd Int Conf on Learning Representations.
Ren SQ, He KM, Girshick R, et al., 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Patt Anal Mach Intell, 39(6):1137-1149. 10.1109/TPAMI.2016.2577031https://doi.org/10.1109/TPAMI.2016.2577031
Ruder S, 2017. An overview of multi-task learning in deep neural networks. https://arxiv.org/abs/1706.05098https://arxiv.org/abs/1706.05098
Song H, Kim M, Lee JG, 2019. SELFIE: refurbishing unclean samples for robust deep learning. Proc 36th Int Conf on Machine Learning, p.5907-5915.
Song H, Kim M, Park D, et al., 2020. Learning from noisy labels with deep neural networks: a survey. https://arxiv.org/abs/2007.08199https://arxiv.org/abs/2007.08199
van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(86):2579-2605.
Wang KY, Yin QY, Wang W, et al., 2016. A comprehensive survey on cross-modal retrieval. https://arxiv.org/abs/1607.06215https://arxiv.org/abs/1607.06215
Wang RX, Liu TL, Tao DC, 2018. Multiclass learning with partially corrupted labels. IEEE Trans Neur Netw Learn Syst, 29(6):2568-2580. 10.1109/TNNLS.2017.2699783https://doi.org/10.1109/TNNLS.2017.2699783
Yang J, Duan J, Tran S, et al., 2022. Vision-language pre-training with triple contrastive learning. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15650-15659. 10.1109/CVPR52688.2022.01522https://doi.org/10.1109/CVPR52688.2022.01522
Zhang HY, Xing XM, Liu L, 2021. DualGraph: a graph-based method for reasoning about label noise. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9654-9663. 10.1109/CVPR46437.2021.00953https://doi.org/10.1109/CVPR46437.2021.00953
关联资源
相关文章
相关作者
相关机构