基于对齐自修正的鲁棒跨模态检索

郭金一; 丁洁玉

doi:10.1631/FITEE.2200514

Your Location：

Home >

Browse articles >

基于对齐自修正的鲁棒跨模态检索

常规文章 | Updated：2023-10-25

- 基于对齐自修正的鲁棒跨模态检索
- Robust cross-modal retrieval with alignment refurbishment
- 信息与电子工程前沿（英文） 2023年24卷第10期页码：1403-1415
- Affiliations：
  
  1.School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
  2.School of Mathematics and Statistics, Qingdao University, Qingdao 266071, China
- Author bio：
  
  E-mail: jinyi_g@njust.edu.cn
  ‡Corresponding author
- Funds：
  
  National Natural Science Foundation of China(12172186)
- DOI：10.1631/FITEE.2200514
  中图分类号： TP391
- 纸质出版日期：2023-10-0 ，
  
  收稿日期：2022-10-27，
  
  录用日期：2023-02-16
- Accepted：
Scan QR Code
郭金一, 丁洁玉. 基于对齐自修正的鲁棒跨模态检索[J]. 信息与电子工程前沿（英文）, 2023,24(10):1403-1415.

JINYI GUO, JIEYU DING. Robust cross-modal retrieval with alignment refurbishment. [J]. Frontiers of information technology & electronic engineering, 2023, 24(10): 1403-1415.
郭金一, 丁洁玉. 基于对齐自修正的鲁棒跨模态检索[J]. 信息与电子工程前沿（英文）, 2023,24(10):1403-1415. DOI： 10.1631/FITEE.2200514.

JINYI GUO, JIEYU DING. Robust cross-modal retrieval with alignment refurbishment. [J]. Frontiers of information technology & electronic engineering, 2023, 24(10): 1403-1415. DOI： 10.1631/FITEE.2200514.

摘要

跨模态检索通过为不同模态数据建立一致的对齐方式来实现模态间的相互检索。目前多种跨模态检索方法已被提出并取得良好性能。这些方法使用干净对齐的跨模态数据进行训练。虽然这些数据在语义上是匹配的，但相较于互联网上容易获得的噪声对齐的数据（即成对但在语义上不匹配），标注成本很高。当用噪声对齐的数据训练这些模型时，它们的性能会急剧下降。因此，本文提出一种对齐自修正的鲁棒跨模态检索算法（RCAR），显著降低了噪声数据对模型的影响。具体来说，RCAR首先进行多任务学习，减缓模型对噪声数据的过拟合，使数据分离。然后，利用两成分的贝塔混合模型将数据分为干净数据和噪声数据，并根据后验概率修正对齐标签。此外，在噪声对齐范式中定义两种噪声类型：部分噪声数据和完全噪声数据。实验结果表明，与当下流行的跨模态检索方法相比，RCAR在两种类型的噪声下都能取得更稳健的性能。

Abstract

Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data. Currently

many cross-modal retrieval methods have been proposed and have achieved excellent results; however

these are trained with clean cross-modal pairs

which are semantically matched but costly

compared with easily available data with noise alignment (i.e.

paired but mismatched in semantics). When training these methods with noise-aligned data

the performance degrades dramatically. Therefore

we propose a robust cross-modal retrieval with alignment refurbishment (RCAR)

which significantly reduces the impact of noise on the model. Specifically

RCAR first conducts multi-task learning to slow down the overfitting to the noise to make data separable. Then

RCAR uses a two-component beta-mixture model to divide them into clean and noise alignments and refurbishes the label according to the posterior probability of the noise-alignment component. In addition

we define partial and complete noises in the noise-alignment paradigm. Experimental results show that

compared with the popular cross-modal retrieval methods

RCAR achieves more robust performance with both types of noise.

关键词

跨模态检索鲁棒学习对齐修正贝塔混合模型

Keywords

Cross-modal retrievalRobust learningAlignment correctionBeta-mixture model

references

Arazo E, Ortego D, Albert P, et al., 2019. Unsupervised label noise modeling and loss correction. Proc 36th Int Conf on Machine Learning, p.312-321.

Chang HS, Learned-Miller E, McCallum A, 2017. Active bias: training more accurate neural networks by emphasizing high variance samples. Proc 31st Int Conf on Neural Information Processing Systems, p.1003-1013.

Chen H, Ding GG, Liu XD, et al., 2020. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12652-12660. 10.1109/CVPR42600.2020.01267https://doi.org/10.1109/CVPR42600.2020.01267

Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: universal image-text representation learning. Proc 16th European Conf on Computer Vision, p.104-120. 10.1007/978-3-030-58577-8_7https://doi.org/10.1007/978-3-030-58577-8_7

Chung J, Gulcehre C, Cho KH, et al., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555https://arxiv.org/abs/1412.3555

Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. 10.18653/v1/n19-1423https://doi.org/10.18653/v1/n19-1423

Diao HW, Zhang Y, Ma L, et al., 2021. Similarity reasoning and filtration for image-text matching. Proc AAAI 35th Conf on Artificial Intelligence, p.1218-1226. 10.1609/aaai.v35i2.16209https://doi.org/10.1609/aaai.v35i2.16209

Faghri F, Fleet DJ, Kiros JR, et al., 2018. VSE++: improving visual-semantic embeddings with hard negatives. British Machine Vision Conf, Article 12.

Geigle G, Pfeiffer J, Reimers N, et al., 2022. Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans Assoc Comput Ling, 10:503-521. 10.1162/tacl_a_00473https://doi.org/10.1162/tacl_a_00473

Ghosh A, Kumar H, Sastry PS, 2017. Robust loss functions under label noise for deep neural networks. Proc 31st Conf on Artificial Intelligence, p.1919-1925.

Han B, Yao QM, Yu XRet al., 2018. Co-teaching: robust training of deep neural networks with extremely noisy labels. Proc 32nd Int Conf on Neural Information Processing Systems, p.8536-8546.

He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. 10.1109/CVPR.2016.90https://doi.org/10.1109/CVPR.2016.90

Huiskes MJ, Lew MS, 2008. The MIR flickr retrieval evaluation. Proc 1st ACM Int Conf on Multimedia Information Retrieval, p.39-43. 10.1145/1460096.1460104https://doi.org/10.1145/1460096.1460104

Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38th Int Conf on Machine Learning, p.4904-4916.

Jiang L, Zhou ZY, Leung T, et al., 2018. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. Proc 35th Int Conf on Machine Learning, p.2309-2318.

Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128-3137. 10.1109/CVPR.2015.7298932https://doi.org/10.1109/CVPR.2015.7298932

Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3rd Int Conf on Learning Representations.

Lee KH, Chen X, Hua G, et al., 2018. Stacked cross attention for image–text matching. Proc 15th European Conf on Computer Vision, p.212-228. 10.1007/978-3-030-01225-0_13https://doi.org/10.1007/978-3-030-01225-0_13

Li KP, Zhang YL, Li K, et al., 2019. Visual semantic reasoning for image-text matching. IEEE/CVF Int Conf on Computer Vision, p.4653-4661. 10.1109/ICCV.2019.00475https://doi.org/10.1109/ICCV.2019.00475

Li XJ, Yin X, Li CY, et al., 2020. UNITER: universal image-text representation learning. Proc 16th European Conf on Computer Vision, p.121-137. 10.1007/978-3-030-58577-8_8https://doi.org/10.1007/978-3-030-58577-8_8

Lin TY, Maire M, Belongie S, et al., 2014. Stacked cross attention for image–text matching. Proc 13th European Conf on Computer Vision, p.740-755. 10.1007/978-3-319-10602-1_48https://doi.org/10.1007/978-3-319-10602-1_48

Lin XY, Bhattacharjee D, El Helou M, et al., 2021. Fidelity estimation improves noisy-image classification with pretrained networks. IEEE Signal Process Lett, 28:1719-1723. 10.1109/LSP.2021.3104769https://doi.org/10.1109/LSP.2021.3104769

Liu TL, Tao DC, 2016. Classification with noisy labels by importance reweighting. IEEE Trans Patt Anal Mach Intell, 38(3):447-461. 10.1109/TPAMI.2015.2456899https://doi.org/10.1109/TPAMI.2015.2456899

Lu JS, Batra D, Parikh D, et al., 2019. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc 33rd Int Conf on Neural Information Processing Systems, p.13-23.

Lyu YM, Tsang IW, 2020. Curriculum loss: robust learning and generalization against label corruption. Proc 8th Int Conf on Learning Representations.

Ma X, Huang H, Wang Y, et al., 2020. Normalized loss functions for deep learning with noisy labels. Proc 37th Int Conf on Machine Learning, p.6543-6553.

Ma XJ, Wang YS, Houle ME, et al., 2018. Dimensionality-driven learning with noisy labels. Proc 35th Int Conf on Machine Learning, p.3361-3370.

Ma ZY, Leijon A, 2011. Bayesian estimation of beta mixture models with variational inference. IEEE Trans Patt Anal Mach Intell, 33(11):2160-2173. 10.1109/TPAMI.2011.63https://doi.org/10.1109/TPAMI.2011.63

Manwani N, Sastry PS, 2013. Noise tolerance under risk minimization. IEEE Trans Cybern, 43(3):1146-1151. 10.1109/TSMCB.2012.2223460https://doi.org/10.1109/TSMCB.2012.2223460

Messina N, Amato G, Esuli A, et al., 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl, 17(4):128. 10.1145/3451390https://doi.org/10.1145/3451390

Niwattanakul S, Singthongchai J, Naenudorn E, et al., 2013. Using of jaccard coefficient for keywords similarity. Proc Int MultiConf of Engineers and Computer Scientists, p.380-384.

Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748-8763.

Reed SE, Lee H, Anguelov D, et al., 2015. Training deep neural networks on noisy labels with bootstrapping. Proc 3rd Int Conf on Learning Representations.

Ren SQ, He KM, Girshick R, et al., 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Patt Anal Mach Intell, 39(6):1137-1149. 10.1109/TPAMI.2016.2577031https://doi.org/10.1109/TPAMI.2016.2577031

Ruder S, 2017. An overview of multi-task learning in deep neural networks. https://arxiv.org/abs/1706.05098https://arxiv.org/abs/1706.05098

Song H, Kim M, Lee JG, 2019. SELFIE: refurbishing unclean samples for robust deep learning. Proc 36th Int Conf on Machine Learning, p.5907-5915.

Song H, Kim M, Park D, et al., 2020. Learning from noisy labels with deep neural networks: a survey. https://arxiv.org/abs/2007.08199https://arxiv.org/abs/2007.08199

van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(86):2579-2605.

Wang KY, Yin QY, Wang W, et al., 2016. A comprehensive survey on cross-modal retrieval. https://arxiv.org/abs/1607.06215https://arxiv.org/abs/1607.06215

Wang RX, Liu TL, Tao DC, 2018. Multiclass learning with partially corrupted labels. IEEE Trans Neur Netw Learn Syst, 29(6):2568-2580. 10.1109/TNNLS.2017.2699783https://doi.org/10.1109/TNNLS.2017.2699783

Yang J, Duan J, Tran S, et al., 2022. Vision-language pre-training with triple contrastive learning. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15650-15659. 10.1109/CVPR52688.2022.01522https://doi.org/10.1109/CVPR52688.2022.01522

Zhang HY, Xing XM, Liu L, 2021. DualGraph: a graph-based method for reasoning about label noise. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9654-9663. 10.1109/CVPR46437.2021.00953https://doi.org/10.1109/CVPR46437.2021.00953

浏览量

Downloads

CSCD

文章被引用时，请邮件提醒。

Submit

工具集

关联资源

暂无数据