FOLLOWUS
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
The State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
LI Ping, E-mail: patriclouis.lee@gmail.com
纸质出版日期:2021-06,
收稿日期:2020-08-25,
修回日期:2021-04-01,
Scan QR Code
李平, 唐超, 徐向华. 基于图卷积注意力网络的视频摘要方法[J]. 信息与电子工程前沿(英文), 2021,22(6):902-913.
PING LI, CHAO TANG, XIANGHUA XU. Video summarization with a graph convolutional attention network. [J]. Frontiers of information technology & electronic engineering, 2021, 22(6): 902-913.
李平, 唐超, 徐向华. 基于图卷积注意力网络的视频摘要方法[J]. 信息与电子工程前沿(英文), 2021,22(6):902-913. DOI: 10.1631/FITEE.2000429.
PING LI, CHAO TANG, XIANGHUA XU. Video summarization with a graph convolutional attention network. [J]. Frontiers of information technology & electronic engineering, 2021, 22(6): 902-913. DOI: 10.1631/FITEE.2000429.
视频摘要已成为生成浓缩简洁视频的一种基础技术,该技术有利于管理和浏览大规模视频数据。已有方法并未充分考虑各视频帧之间的局部和全局关系,导致摘要性能下降。提出一种基于图卷积注意力网络(graph convolutional attention network
GCAN)的视频摘要方法。GCAN由嵌入学习和上下文融合两部分组成,其中嵌入学习包括时序分支和图分支。具体而言,GCAN使用空洞时序卷积对局部线索和时序自注意力建模,能有效利用各视频帧的全局线索;同时利用多层图卷积网络学习图嵌入,反映视频帧样本的本征结构。上下文融合部分将时序分支和图分支的输出信息流合并形成视频帧的上下文表示,然后计算其重要性得分,据此选择具有代表性的帧,生成视频摘要。在两个基准数据集SumMe和TVSum上的实验结果表明,相比其他多种先进方法,GCAN方法在3种不同评测环境下取得更优越的性能。
Video summarization has established itself as a fundamental technique for generating compact and concise video
which alleviates managing and browsing large-scale video data. Existing methods fail to fully consider the local and global relations among frames of video
leading to a deteriorated summarization performance. To address the above problem
we propose a graph convolutional attention network (GCAN) for video summarization. GCAN consists of two parts
embedding learning and context fusion
where embedding learning includes the temporal branch and graph branch. In particular
GCAN uses dilated temporal convolution to model local cues and temporal self-attention to exploit global cues for video frames. It learns graph embedding via a multi-layer graph convolutional network to reveal the intrinsic structure of frame samples. The context fusion part combines the output streams from the temporal branch and graph branch to create the context-aware representation of frames
on which the importance scores are evaluated for selecting representative frames to generate video summary. Experiments are carried out on two benchmark databases
SumMe and TVSum
showing that the proposed GCAN approach enjoys superior performance compared to several state-of-the-art alternatives in three evaluation settings.
时序学习自注意力机制图卷积网络上下文融合视频摘要
Temporal learningSelf-attention mechanismGraph convolutional networkContext fusionVideo summarization
A Aner, , , JR Kender. . Video summaries through mosaicbased shot and scene clustering. . Proc 7th European Conf on Computer Vision, , 2002. . p388--402. . DOI:10.1007/3-540-47979-1_26http://doi.org/10.1007/3-540-47979-1_26..
M Basavarajaiah, , , P Sharma. . Survey of compressed domain video summarization techniques. . ACM Comput Surv, , 2019. . 52((6):):116DOI:10.1145/3355398http://doi.org/10.1145/3355398..
YW Chen, , , YH Tsai, , , YY Lin, , , 等. . VOSTR: video object segmentation via transferable representations. . Int J Comput Vis, , 2020. . 128((4):):931--949. . DOI:10.1007/s11263-019-01224-xhttp://doi.org/10.1007/s11263-019-01224-x..
WS Chu, , , YL Song, , , A Jaimes. . Video co-summarization: video summarization by visual co-occurrence. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2015. . p3584--3592. . DOI:10.1109/CVPR.2015.7298981http://doi.org/10.1109/CVPR.2015.7298981..
Cisco. . Cisco Global Networking Trends Report. . 2020. . https://www.cisco.com/c/m/en_us/solutions/enterprisenetworks/networking-report.htmlhttps://www.cisco.com/c/m/en_us/solutions/enterprisenetworks/networking-report.html, , ..
Y Cong, , , JS Yuan, , , JB Luo. . Towards scalable summarization of consumer videos via sparse dictionary selection. . IEEE Trans Multim, , 2012. . 14((1):):66--75. . DOI:10.1109/TMM.2011.2166951http://doi.org/10.1109/TMM.2011.2166951..
Y Cong, , , J Liu, , , G Sun, , , 等. . Adaptive greedy dictionary selection for web media summarization. . IEEE Trans Image Process, , 2017. . 26((1):):185--195. . DOI:10.1109/TIP.2016.2619260http://doi.org/10.1109/TIP.2016.2619260..
SEF de Avila, , , APB Lopes, , , AJr da Luz, , , 等. . VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. . Patt Recogn Lett, , 2011. . 32((1):):56--68. . DOI:10.1016/j.patrec.2010.08.004http://doi.org/10.1016/j.patrec.2010.08.004..
E Elhamifar, , , G Sapiro, , , R Vidal. . See all by looking at a few: sparse modeling for finding representative objects. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2012. . p1600--1607. . DOI:10.1109/CVPR.2012.6247852http://doi.org/10.1109/CVPR.2012.6247852..
BQ Gong, , , WL Chao, , , K Grauman, , , 等. . Diverse sequential subset selection for supervised video summarization. . Proc 27th Int Conf on Neural Information Processing Systems, , 2014. . p2069--2077. . ..
GL Guan, , , ZY Wang, , , SY Lu, , , 等. . Keypoint-based keyframe selection. . IEEE Trans Circ Syst Video Technol, , 2013. . 23((4):):729--734. . DOI:10.1109/TCSVT.2012.2214871http://doi.org/10.1109/TCSVT.2012.2214871..
M Gygli, , , H Grabner, , , H Riemenschneider, , , 等. . Creating summaries from user videos. . Proc 13th European Conf on Computer Vision, , 2014. . p505--520. . DOI:10.1007/978-3-319-10584-0_33http://doi.org/10.1007/978-3-319-10584-0_33..
R Hannane, , , A Elboushaki, , , K Afdel, , , 等. . An efficient method for video shot boundary detection and keyframe extraction using SIFT-point distribution histogram. . Int J Multim Inform Retr, , 2016. . 5((2):):89--104. . DOI:10.1007/s13735-016-0095-6http://doi.org/10.1007/s13735-016-0095-6..
JH Huang, , , XG Di, , , JD Wu, , , 等. . A novel convolutional neural network method for crowd counting. . Front Inform Technol Electron Eng, , 2020. . 21((8):):1150--1160. . DOI:10.1631/FITEE.1900282http://doi.org/10.1631/FITEE.1900282..
Z Ji, , , KL Xiong, , , YW Pang, , , 等. . Video summarization with attention-based encoder-decoder networks. . IEEE Trans Circ Syst Video Technol, , 2020. . 30((6):):1709--1717. . DOI:10.1109/TCSVT.2019.2904996http://doi.org/10.1109/TCSVT.2019.2904996..
Y Jung, , , D Cho, , , D Kim, , , 等. . Discriminative feature learning for unsupervised video summarization. . Proc AAAI Conf on Artificial Intelligence, , 2019. . p8537--8544. . DOI:10.1609/aaai.v33i01.33018537http://doi.org/10.1609/aaai.v33i01.33018537..
TN Kipf, , , M Welling. . Semi-supervised classification with graph convolutional networks. . Int Conf on Learning Representations, , 2017. . p1--14. . ..
SK Kuanar, , , R Panda, , , AS Chowdhury. . Video key frame extraction through dynamic Delaunay clustering with a structural constraint. . J Vis Commun Image Represent, , 2013. . 24((7):):1212--1227. . DOI:10.1016/j.jvcir.2013.08.003http://doi.org/10.1016/j.jvcir.2013.08.003..
SS Lei, , , G Xie, , , GW Yan. . A novel key-frame extraction approach for both video summary and video index. . Sci World J, , 2014. . 2014695168DOI:10.1155/2014/695168http://doi.org/10.1155/2014/695168..
JN Li, , , SL Zhang, , , JD Wang, , , 等. . Global-local temporal representations for video person re-identification. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p3957--3966. . DOI:10.1109/ICCV.2019.00406http://doi.org/10.1109/ICCV.2019.00406..
P Li, , , QH Ye, , , LM Zhang, , , 等. . Exploring global diverse attention via pairwise temporal relation for video summarization. . Patt Recogn, , 2021. . 111107677DOI:10.1016/j.patcog.2020.107677http://doi.org/10.1016/j.patcog.2020.107677..
YD Li, , , LQ Wang, , , TB Yang, , , 等. . How local is the local diversity. . Reinforcing sequential determinantal point processes with dynamic ground sets for supervised video summarization. Proc 15th European Conf on Computer Vision, , 2018. . p156--174. . DOI:10.1007/978-3-030-01237-3_10http://doi.org/10.1007/978-3-030-01237-3_10..
SY Lu, , , ZY Wang, , , T Mei, , , 等. . A bag-of-importance model with locality-constrained coding based feature learning for video summarization. . IEEE Trans Multim, , 2014. . 16((6):):1497--1509. . DOI:10.1109/TMM.2014.2319778http://doi.org/10.1109/TMM.2014.2319778..
Q Luan, , , ML Song, , , CY Liau, , , 等. . Video summarization based on nonnegative linear reconstruction. . IEEE Int Conf on Multimedia and Expo, , 2014. . p1--6. . DOI:10.1109/ICME.2014.6890332http://doi.org/10.1109/ICME.2014.6890332..
B Mahasseni, , , M Lam, , , S Todorovic. . Unsupervised video summarization with adversarial LSTM networks. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2017. . p2982--2991. . DOI:10.1109/CVPR.2017.318http://doi.org/10.1109/CVPR.2017.318..
KM Mahmoud, , , NM Ghanem, , , MA Ismail. . VGRAPH: an effective approach for generating static video summaries. . Proc IEEE Int Conf on Computer Vision Workshops, , 2013. . p811--818. . DOI:10.1109/ICCVW.2013.111http://doi.org/10.1109/ICCVW.2013.111..
SH Mei, , , GL Guan, , , ZY Wang, , , 等. . Video summarization via minimum sparse reconstruction. . Patt Recogn, , 2015. . 48((2):):522--533. . DOI:10.1016/j.patcog.2014.08.002http://doi.org/10.1016/j.patcog.2014.08.002..
D Potapov, , , M Douze, , , Z Harchaoui, , , 等. . Categoryspecific video summarization. . Proc 14th European Conf on Computer Vision, , 2014. . p540--555. . DOI:10.1007/978-3-319-10599-4_35http://doi.org/10.1007/978-3-319-10599-4_35..
M Rochan, , , Y Wang. . Video summarization by learning from unpaired data. . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, , 2019. . p7894--7903. . DOI:10.1109/CVPR.2019.00809http://doi.org/10.1109/CVPR.2019.00809..
M Rochan, , , LW Ye, , , Y Wang. . Video summarization using fully convolutional sequence networks. . Proc 15th European Conf on Computer Vision, , 2018. . p358--374. . DOI:10.1007/978-3-030-01258-8_22http://doi.org/10.1007/978-3-030-01258-8_22..
T Shen, , , TY Zhou, , , GD Long, , , 等. . Bi-directional block self-attention for fast and memory-efficient sequence modeling. . Proc 6th Int Conf on Learning Representations, , 2018. . p1--18. . ..
YL Song, , , J Vallmitjana, , , A Stent, , , 等. . TVSum: summarizing web videos using titles. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2015. . p5179--5187. . DOI:10.1109/CVPR.2015.7299154http://doi.org/10.1109/CVPR.2015.7299154..
C Szegedy, , , W Liu, , , YQ Jia, , , 等. . Going deeper with convolutions. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2015. . p1--9. . DOI:10.1109/CVPR.2015.7298594http://doi.org/10.1109/CVPR.2015.7298594..
HW Wei, , , BB Ni, , , YC Yan, , , 等. . Video summarization via semantic attended networks. . Proc AAAI Conf on Artificial Intelligence, , 2018. . p216--223. . ..
F Yu, , , V Koltun. . Multi-scale context aggregation by dilated convolutions. . 2016. . http://arxiv.org/abs/1511.07122http://arxiv.org/abs/1511.07122, , ..
L Yuan, , , FE Tay, , , P Li, , , 等. . Cycle-SUM: cycleconsistent adversarial LSTM networks for unsupervised video summarization. . Proc AAAI Conf on Artificial Intelligence, , 2019. . p9143--9150. . DOI:10.1609/aaai.v33i01.33019143http://doi.org/10.1609/aaai.v33i01.33019143..
YT Yuan, , , T Mei, , , P Cui, , , 等. . Video summarization by learning deep side semantic embedding. . IEEE Trans Circ Syst Video Technol, , 2019. . 29((1):):226--237. . DOI:10.1109/tcsvt.2017.2771247http://doi.org/10.1109/tcsvt.2017.2771247..
K Zhang, , , WL Chao, , , F Sha, , , 等. . Video summarization with long short-term memory. . Proc 14th European Conf on Computer Vision, , 2016. . p766--782. . DOI:10.1007/978-3-319-46478-7_47http://doi.org/10.1007/978-3-319-46478-7_47..
B Zhao, , , EP Xing. . Quasi real-time summarization for consumer videos. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2014. . p2513--2520. . DOI:10.1109/CVPR.2014.322http://doi.org/10.1109/CVPR.2014.322..
B Zhao, , , XL Li, , , XQ Lu. . HSA-RNN: hierarchical structure-adaptive RNN for video summarization. . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, , 2018. . p7405--7414. . DOI:10.1109/CVPR.2018.00773http://doi.org/10.1109/CVPR.2018.00773..
B Zhao, , , XL Li, , , XQ Lu. . Property-constrained dual learning for video summarization. . IEEE Trans Neur Netw Learn Syst, , 2020. . 31((10):):3989--4000. . DOI:10.1109/TNNLS.2019.2951680http://doi.org/10.1109/TNNLS.2019.2951680..
KY Zhou, , , Y Qiao, , , T Xiang. . Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. . Proc AAAI Conf on Artificial Intelligence, , 2018. . p7582--7589. . ..
YT Zhuang, , , Y Rui, , , TS Huang, , , 等. . Adaptive key frame extraction using unsupervised clustering. . Proc Int Conf on Image Processing, , 1998. . p866--870. . DOI:10.1109/ICIP.1998.723655http://doi.org/10.1109/ICIP.1998.723655..
关联资源
相关文章
相关作者
相关机构