

FOLLOWUS
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
Tianjin Key Lab of Machine Learning, Tianjin University, Tianjin 300350, China
School of Computer Science, University of Technology Sydney, Sydney 2007, Australia
Yahong HAN, E-mail: yahong@tju.edu.cn
Received:25 December 2020,
Revised:2021-;4-22,
Published:2021-05
Scan QR Code
Yahong HAN, Aming WU, Linchao ZHU, et al. Visual commonsense reasoning with directional visual connections[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(5): 625-637.
Yahong HAN, Aming WU, Linchao ZHU, et al. Visual commonsense reasoning with directional visual connections[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(5): 625-637. DOI: 10.1631/FITEE.2000722.
为推动认知层面视觉内容理解的研究,即基于视觉细节的深入理解做出精确推理,视觉常识推理的概念被提出。相比仅需模型正确回答问题的传统视觉问答,视觉常识推理不仅需要模型正确地回答问题,还需给出相应解释。最近关于人类认知的研究指出大脑认知可以看作局部神经元连接的全局动态集成,有助于解决特定的认知任务。受其启发,本文提出有向连接网络。通过使用问题和答案的语义来情景化视觉神经元从而动态重组神经元连接,以及借助方向信息增强推理能力,所提方法能有效实现视觉常识推理。具体地,首先开发一个GraphVLAD模块来捕捉能够充分表达视觉内容相关性的视觉神经元连接。然后提出一个情景化模型来融合视觉和文本表示。最后,基于情景化连接的输出设计有向连接来推断答案及对应解释,其中包含了ReasonVLAD模块。实验结果和可视化分析证明了所提方法的有效性。
To boost research into cognition-level visual understanding
i.e.
making an accurate inference based on a thorough understanding of visual details
visual commonsense reasoning (VCR) has been proposed. Compared with traditional visual question answering which requires models to select correct answers
VCR requires models to select not only the correct answers
but also the correct rationales. Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity
which is helpful in solving specific cognition tasks. Inspired by this idea
we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability. Specifically
we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations. Then
a contextualization process is proposed to fuse sentence representations with visual neuron representations. Finally
based on the output of contextualized connectivity
we propose directional connectivity to infer answers and rationales
which includes a ReasonVLAD module. Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.
P Anderson , , , XD He , , , C Buehler , , , 等 . . Bottom-up and top-down attention for image captioning and visual question answering . . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , , 2018 . . p.6077 - - 6086 . . DOI: 10.1109/CVPR.2018.00636 http://doi.org/10.1109/CVPR.2018.00636 . .
S Antol , , , A Agrawal , , , JS Lu , , , 等 . . VQA: visual question answering . . Proc IEEE Int Conf on Computer Vision , , 2015 . . p. 2425 - - 2433 . . DOI: 10.1109/ICCV.2015.279 http://doi.org/10.1109/ICCV.2015.279 . .
R Arandjelović , , , P Gronat , , , A Torii , , , 等 . . NetVLAD: CNN architecture for weakly supervised place recognition . . IEEE Trans Patt Anal Mach Intell , , 2018 . . 40 ( ( 6 ): ): 1437 - - 1451 . . DOI: 10.1109/TPAMI.2017.2711011 http://doi.org/10.1109/TPAMI.2017.2711011 . .
V Badrinarayanan , , , A Kendall , , , R Cipolla . . SegNet: a deep convolutional encoder-decoder architecture for image segmentation . . IEEE Trans Patt Anal Mach Intell , , 2017 . . 39 ( ( 12 ): ): 2481 - - 2495 . . DOI: 10.1109/TPAMI.2016.2644615 http://doi.org/10.1109/TPAMI.2016.2644615 . .
A Bansal , , , YT Zhang , , , R Chellappa . . Visual question answering on image sets . . European Conf on Computer Vision , , 2020 . . p. 51 - - 67 . . DOI: 10.1007/978-3-030-58589-1_4 http://doi.org/10.1007/978-3-030-58589-1_4 . .
H Ben-younes , , , R Cadene , , , M Cord , , , 等 . . MUTAN: multimodal tucker fusion for visual question answering . . Proc IEEE Int Conf on Computer Vision , , 2017 . . p. 2631 - - 2639 . . DOI: 10.1109/ICCV.2017.285 http://doi.org/10.1109/ICCV.2017.285 . .
M Bola , , , BA Sabel . . Dynamic reorganization of brain functional networks during cognition . . NeuroImage , , 2015 . . 114 398 - - 413 . . DOI: 10.1016/j.neuroimage.2015.03.057 http://doi.org/10.1016/j.neuroimage.2015.03.057 . .
R Cadene , , , H Ben-younes , , , M Cord , , , 等 . . MUREL: multimodal relational reasoning for visual question answering . . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , , 2019 . . p. 1989 - - 1998 . . DOI: 10.1109/CVPR.2019.00209 http://doi.org/10.1109/CVPR.2019.00209 . .
L Chen , , , X Yan , , , J Xiao , , , 等 . . Counterfactual samples synthesizing for robust visual question answering . . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , , 2020 . . p. 10797 - - 10806 . . DOI: 10.1109/CVPR42600.2020.01081 http://doi.org/10.1109/CVPR42600.2020.01081 . .
LC Chen , , , G Papandreou , , , I Kokkinos , , , 等 . . DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs . . IEEE Trans Patt Anal Mach Intell , , 2018 . . 40 ( ( 4 ): ): 834 - - 848 . . DOI: 10.1109/TPAMI.2017.2699184 http://doi.org/10.1109/TPAMI.2017.2699184 . .
YP Chen , , , M Rohrbach , , , ZC Yan , , , 等 . . Graph-based global reasoning networks . . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , , 2019 . . p. 433 - - 442 . . DOI: 10.1109/CVPR.2019.00052 http://doi.org/10.1109/CVPR.2019.00052 . .
J Devlin , , , MW Chang , , , K Lee , , , 等 . . BERT: pre-training of deep bidirectional transformers for language understanding . . Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , , 2019 . . p. 4171 - - 4186 . . DOI: 10.18653/v1/N19-1423 http://doi.org/10.18653/v1/N19-1423 . .
PJ Feltovich , , , KM Ford , , , RR Hoffman . . Expertise in Context: Human and Machine . . Cambridge, MA, USA: MIT Press , , 1997 . . p. 67 - - 99 . . . .
P Gao , , , H Li , , , S Li , , , 等 . . Question-guided hybrid convolution for visual question answering . . European Conf on Computer Vision , , 2018 . . p.485 - - 501 . . DOI: 10.1007/978-3-030-01246-5_29 http://doi.org/10.1007/978-3-030-01246-5_29 . .
R Girshick . . Fast R-CNN . . Proc IEEE Int Conf on Computer Vision , , 2015 . . p. 1440 - - 1448 . . DOI: 10.1109/ICCV.2015.169 http://doi.org/10.1109/ICCV.2015.169 . .
Y Goyal , , , T Khot , , , D Summers-Stay , , , 等 . . Making the V in VQA matter: elevating the role of image understanding in visual question answering . . Proc IEEE Conf on Computer Vision and Pattern Recognition , , 2017 . . p. 6325 - - 6334 . . . .
KM He , , , XY Zhang , , , SQ Ren , , , 等 . . Deep residual learning for image recognition . . Proc IEEE Conf on Computer Vision and Pattern Recognition , , 2016 . . p.770 - - 778 . . . .
S Hochreiter , , , J Schmidhuber . . Long short-term memory . . Neur Comput , , 1997 . . 9 ( ( 8 ): ): 1735 - - 1780 . . DOI: 10.1162/neco.1997.9.8.1735 http://doi.org/10.1162/neco.1997.9.8.1735 . .
H Jégou , , , M Douze , , , C Schmid , , , 等 . . Aggregating local descriptors into a compact image representation . . Proc IEEE Computer Society Conf on Computer Vision and Pattern Recognition , , 2010 . . p. 3304 - - 3311 . . DOI: 10.1109/CVPR.2010.5540039 http://doi.org/10.1109/CVPR.2010.5540039 . .
KM Kim , , , SH Choi , , , JH Kim , , , 等 . . Multimodal dual attention memory for video story question answering . . 2018 . . https://arxiv.org/abs/1809.07999 https://arxiv.org/abs/1809.07999 , , . .
TN Kipf , , , M Welling . . Semi-supervised classification with graph convolutional networks . . 2016 . . https://arxiv.org/abs/1609.02907v4 https://arxiv.org/abs/1609.02907v4 , , . .
TM Le , , , V Le , , , S Venkatesh , , , 等 . . Hierarchical conditional relation networks for video question answering . . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , , 2020 . . p. 9969 - - 9978 . . DOI: 10.1109/CVPR42600.2020.00999 http://doi.org/10.1109/CVPR42600.2020.00999 . .
G Li , , , N Duan , , , YJ Fang , , , 等 . . Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training . . Proc AAAI Conf on Artificial Intelligence , , 2020 . . p. 11336 - - 11344 . . DOI: 10.1609/aaai.v34i07.6795 http://doi.org/10.1609/aaai.v34i07.6795 . .
LH Li , , , M Yatskar , , , D Yin , , , 等 . . VisualBERT: a simple and performant baseline for vision and language . . 2019 . . https://arxiv.org/abs/1908.03557 https://arxiv.org/abs/1908.03557 , , . .
W Liu , , , D Anguelov , , , D Erhan , , , 等 . . SSD: single shot multibox detector . . European Conf on Computer Vision , , 2016 . . p. 21 - - 37 . . DOI: 10.1007/978-3-319-46448-0_2 http://doi.org/10.1007/978-3-319-46448-0_2 . .
JS Lu , , , CM Xiong , , , D Parikh , , , 等 . . Knowing when to look: adaptive attention via a visual sentinel for image captioning . . Proc IEEE Conf on Computer Vision and Pattern Recognition , , 2017 . . p. 3242 - - 3250 . . DOI: 10.1109/CVPR.2017.345 http://doi.org/10.1109/CVPR.2017.345 . .
JS Lu , , , D Batra , , , D Parikh , , , 等 . . ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks . . 2019 . . https://arxiv.org/abs/1908.02265 https://arxiv.org/abs/1908.02265 , , . .
M Malinowski , , , C Doersch , , , A Santoro , , , 等 . . Learning visual question answering by bootstrapping hard attention . . European Conf on Computer Vision , , 2018 . . p. 3 - - 20 . . DOI: 10.1007/978-3-030-01231-1_1 http://doi.org/10.1007/978-3-030-01231-1_1 . .
F Monti , , , D Boscaini , , , J Masci , , , 等 . . Geometric deep learning on graphs and manifolds using mixture model CNNs . . Proc IEEE Conf on Computer Vision and Pattern Recognition , , 2017 . . p. 5425 - - 5434 . . DOI: 10.1109/CVPR.2017.576 http://doi.org/10.1109/CVPR.2017.576 . .
M Narasimhan , , , S Lazebnik , , , AG Schwing . . Out of the box: reasoning with graph convolution nets for factual visual question answering . . Proc 32 nd Int Conf on Neural Information Processing Systems , , 2018 . . p. 2659 - - 2670 . . . .
W Norcliffe-Brown , , , ES Vafeias , , , S Parisot . . Learning conditioned graph structures for interpretable visual question answering . . 2018 . . https://arxiv.org/abs/1806.07243 https://arxiv.org/abs/1806.07243 , , . .
YH Pan . . On visual knowledge . . Front Inform Technol Electron Eng , , 2019 . . 20 ( ( 8 ): ): 1021 - - 1025 . . DOI: 10.1631/FITEE.1910001 http://doi.org/10.1631/FITEE.1910001 . .
YH Pan . . Miniaturized five fundamental issues about visual knowledge . . Front Inform Technol Electron Eng, online , , 2020 . . DOI: 10.1631/FITEE.2040000 http://doi.org/10.1631/FITEE.2040000 . .
HJ Park , , , K Friston . . Structural and functional brain networks: from connections to cognition . . Science , , 2013 . . 342 ( ( 6158 ): ): 1238411 DOI: 10.1126/science.1238411 http://doi.org/10.1126/science.1238411 . .
E Perez , , , F Strub , , , H de Vries , , , 等 . . FiLM: visual reasoning with a general conditioning layer . . 2017 . . https://arxiv.org/abs/1709.07871v2 https://arxiv.org/abs/1709.07871v2 , , . .
I Schwartz , , , S Yu , , , T Hazan , , , 等 . . Factor graph attention . . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , , 2019 . . p. 2039 - - 2048 . . DOI: 10.1109/CVPR.2019.00214 http://doi.org/10.1109/CVPR.2019.00214 . .
WJ Su , , , XZ Zhu , , , Y Cao , , , 等 . . VL-BERT: pre-training of generic visual-linguistic representations . . 2019 . . https://arxiv.org/abs/1908.08530v1 https://arxiv.org/abs/1908.08530v1 , , . .
L van der Maaten , , , G Hinton . . Visualizing data using t-SNE . . J Mach Learn Res , , 2008 . . 9 2579 - - 2605 . . . .
A Vaswani , , , N Shazeer , , , N Parmar , , , 等 . . Attention is all you need . . Proc 31 st Int Conf on Neural Information Processing Systems , , 2017 . . p.6000 - - 6010 . . . .
P Veličković , , , G Cucurull , , , A Casanova , , , 等 . . Graph attention networks . . Proc Int Conf on Learning Representations , , 2018 . . .
AM Wu , , , LC Zhu , , , YH Han , , , 等 . . Connective cognition network for directional visual commonsense reasoning . . Proc 33 rd Conf on Neural Information Processing Systems , , 2019 . . p.5669 - - 5679 . . . .
K Xu , , , JL Ba , , , R Kiros , , , 等 . . Show, attend and tell: neural image caption generation with visual attention . . Proc 32 nd Int Conf on Machine Learning , , 2015 . . p.2048 - - 2057 . . . .
K Xu , , , LF Wu , , , ZG Wang , , , 等 . . Exploiting rich syntactic information for semantic parsing with graph-to-sequence model . . Proc Conf on Empirical Methods in Natural Language Processing , , 2018 . . p.918 - - 924 . . . .
R Zellers , , , Y Bisk , , , A Farhadi , , , 等 . . From recognition to cognition: visual commonsense reasoning . . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , , 2019 . . p.6713 - - 6724 . . DOI: 10.1109/CVPR.2019.00688 http://doi.org/10.1109/CVPR.2019.00688 . .
J Zhou , , , GQ Cui , , , ZY Zhang , , , 等 . . Graph neural networks: a review of methods and applications . . 2018 . . https://arxiv.org/abs/1812.08434v3 https://arxiv.org/abs/1812.08434v3 , , . .
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621