基于场景自适应概念学习的无监督目标检测

浦世亮; 赵暐; 陈伟杰; 杨世才; 谢迪; 潘云鹤

doi:10.1631/FITEE.2000567

Your Location：

Home >

Browse articles >

基于场景自适应概念学习的无监督目标检测

视觉知识专栏 | Updated：2022-05-19

- 基于场景自适应概念学习的无监督目标检测
  Enhanced Publication
- Unsupervised object detection with scene-adaptive concept learning
- 信息与电子工程前沿（英文） 2021年22卷第5期页码：638-651
- Affiliations：
  
  Hikvision Research Institute, Hangzhou 310051, China
  College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
- Author bio：
  
  Di XIE, E-mail: xiedi@hikvision.com
- Funds：
  
  Project supported by the National Key R&D Program of China (No. 2020AAA010400X) and the Hikvision Open Fund, China
- DOI：10.1631/FITEE.2000567
  中图分类号： TP391
- 纸质出版日期：2021-05，
  
  收稿日期：2020-10-20，
  
  修回日期：2021-04-01，
- Accepted：
Scan QR Code
浦世亮, 赵暐, 陈伟杰, 等. 基于场景自适应概念学习的无监督目标检测[J]. 信息与电子工程前沿（英文）, 2021,22(5):638-651.

SHILIANG PU, WEI ZHAO, WEIJIE CHEN, et al. Unsupervised object detection with scene-adaptive concept learning. [J]. Frontiers of information technology & electronic engineering, 2021, 22(5): 638-651.
浦世亮, 赵暐, 陈伟杰, 等. 基于场景自适应概念学习的无监督目标检测[J]. 信息与电子工程前沿（英文）, 2021,22(5):638-651. DOI： 10.1631/FITEE.2000567.

SHILIANG PU, WEI ZHAO, WEIJIE CHEN, et al. Unsupervised object detection with scene-adaptive concept learning. [J]. Frontiers of information technology & electronic engineering, 2021, 22(5): 638-651. DOI： 10.1631/FITEE.2000567.

摘要

目标检测是机器视觉领域最热门的研究方向之一，在学术界已取得令人瞩目的成果，在工业界也存在许多有价值的应用。然而，主流的检测方法仍有两个缺陷：（1）即使是经过大量数据有效训练的模型，仍然无法很好地泛化到新场景中；（2）模型一旦部署到位，则无法随着不断累积的无标注数据自主进化。为克服上述问题，受视觉知识理论启发，提出一种场景自适应进化的无监督视频目标检测算法，该算法可利用目标群体概念，降低场景变化带来的不利影响。首先通过预训练检测模型从无标注数据中提取大量候选目标，然后对候选目标聚类，构建目标概念的视觉知识字典，其中各个聚类中心代表一种目标原型。其次，通过研究不同目标簇和不同群体目标信息之间的关系，提出基于图的群体信息传播策略以判断目标概念的归属，可有效区分候选目标。最终，利用收集到的伪类标微调预训练模型，实现算法对新场景的自适应。算法的有效性得到多个不同实验的验证，且性能提升显著。

Abstract

Object detection is one of the hottest research directions in computer vision

has already made impressive progress in academia

and has many valuable applications in the industry. However

the mainstream detection methods still have two shortcomings: (1) even a model that is well trained using large amounts of data still cannot generally be used across different kinds of scenes; (2) once a model is deployed

it cannot autonomously evolve along with the accumulated unlabeled scene data. To address these problems

and inspired by visual knowledge theory

we propose a novel scene-adaptive evolution unsupervised video object detection algorithm that can decrease the impact of scene changes through the concept of object groups. We first extract a large number of object proposals from unlabeled data through a pre-trained detection model. Second

we build the visual knowledge dictionary of object concepts by clustering the proposals

in which each cluster center represents an object prototype. Third

we look into the relations between different clusters and the object information of different groups

and propose a graph-based group information propagation strategy to determine the category of an object concept

which can effectively distinguish positive and negative proposals. With these pseudo labels

we can easily fine-tune the pretrained model. The effectiveness of the proposed method is verified by performing different experiments

and the significant improvements are achieved.

关键词

视觉知识无监督视频目标检测场景自适应学习

Keywords

Visual knowledgeUnsupervised video object detectionScene-adaptive learning

references

MH Chen, , , Z Kira, , , G AlRegib, , , 等. . Temporal attentive alignment for large-scale video domain adaptation. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p.6320--6329. . DOI:10.1109/ICCV.2019.00642http://doi.org/10.1109/ICCV.2019.00642..

M Cordts, , , M Omran, , , S Ramos, , , 等. . The cityscapes dataset for semantic urban scene understanding. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2016. . p. 3213--3223. . DOI:10.1109/CVPR.2016.350http://doi.org/10.1109/CVPR.2016.350..

I Croitoru, , , SV Bogolin, , , M Leordeanu. . Unsupervised learning from video to detect foreground objects in single images. . Proc IEEE Int Conf on Computer Vision, , 2017. . p. 4345--4353. . DOI:10.1109/ICCV.2017.465http://doi.org/10.1109/ICCV.2017.465..

JF Dai, , , Y Li, , , KM He, , , 等. . R-FCN: object detection via region-based fully convolutional networks. . Proc 30th Int Conf on Neural Information Processing Systems, , 2016. . p. 379--387. . DOI:10.5555/3157096.3157139http://doi.org/10.5555/3157096.3157139..

JJ Deng, , , YW Pan, , , T Yao, , , 等. . Single shot video object detector. . IEEE Trans Multim, , 2020. . 23846--858. . DOI:10.1109/TMM.2020.2990070http://doi.org/10.1109/TMM.2020.2990070..

C Feichtenhofer, , , A Pinz, , , A Zisserman. . Detect to track and track to detect. . Proc IEEE Int Conf on Computer Vision, , 2017. . p. 3057--3065. . DOI:10.1109/ICCV.2017.330http://doi.org/10.1109/ICCV.2017.330..

A Geiger, , , P Lenz, , , R Urtasun. . Are we ready for autonomous driving. . The KITTI vision benchmark suite. Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2012. . p. 3354--3361. . DOI:10.1109/CVPR.2012.6248074http://doi.org/10.1109/CVPR.2012.6248074..

R Girshick. . Fast R-CNN. . Proc IEEE Int Conf on Computer Vision, , 2015. . p. 1440--1448. . DOI:10.1109/ICCV.2015.169http://doi.org/10.1109/ICCV.2015.169..

R Girshick, , , J Donahue, , , T Darrell, , , 等. . Rich feature hierarchies for accurate object detection and semantic segmentation. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2014. . p. 580--287. . DOI:10.1109/CVPR.2014.81http://doi.org/10.1109/CVPR.2014.81..

CX Guo, , , B Fan, , , J Gu, , , 等. . Progressive sparse local attention for video object detection. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p. 3908--3917. . DOI:10.1109/ICCV.2019.00401http://doi.org/10.1109/ICCV.2019.00401..

W Han, , , P Khorrami, , , T Le Paine, , , 等. . Seq-NMS for video object detection. . 2016. . https://arxiv.org/abs/1602.08465v1https://arxiv.org/abs/1602.08465v1, , ..

ZW He, , , L Zhang. . Multi-adversarial faster-RCNN for unrestricted object detection. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p. 6667--6676. . DOI:10.1109/ICCV.2019.00677http://doi.org/10.1109/ICCV.2019.00677..

KK Htike, , , DC Hogg. . Efficient non-iterative domain adaptation of pedestrian detectors to video scenes. . Proc 22nd Int Conf on Pattern Recognition, , 2014. . p. 654--659. . DOI:10.1109/ICPR.2014.123http://doi.org/10.1109/ICPR.2014.123..

M Johnson-Roberson, , , C Barto, , , R Mehta, , , 等. . Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks. . Proc IEEE Int Conf on Robotics and Automation, , 2016. . p. 746--753. . DOI:10.1109/ICRA.2017.7989092http://doi.org/10.1109/ICRA.2017.7989092..

K Kang, , , WL Ouyang, , , HS Li, , , 等. . Object detection from video tubelets with convolutional neural networks. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2016. . p. 817--825. . DOI:10.1109/CVPR.2016.95http://doi.org/10.1109/CVPR.2016.95..

K Kang, , , HS Li, , , T Xiao, , , 等. . Object detection in videos with tubelet proposal networks. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2017. . p. 889--897. . DOI:10.1109/CVPR.2017.101http://doi.org/10.1109/CVPR.2017.101..

K Kang, , , HS Li, , , JJ Yan, , , 等. . T-CNN: tubelets with convolutional neural networks for object detection from videos. . IEEE Trans Circ Syst Video Technol, , 2018. . 28((10):):2896--2907. . DOI:10.1109/TCSVT.2017.2736553http://doi.org/10.1109/TCSVT.2017.2736553..

M Khodabandeh, , , A Vahdat, , , M Ranjbar, , , 等. . A robust learning approach to domain adaptive object detection. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p. 480--490. . DOI:10.1109/ICCV.2019.00057http://doi.org/10.1109/ICCV.2019.00057..

TN Kipf, , , M Welling. . Semi-supervised classification with graph convolutional networks. . 2017. . https://arxiv.org/abs/1609.02907https://arxiv.org/abs/1609.02907, , ..

S Kwak, , , M Cho, , , I Laptev, , , 等. . Unsupervised object discovery and tracking in video collections. . Proc IEEE Int Conf on Computer Vision, , 2015. . p. 3173--3181. . DOI:10.1109/ICCV.2015.363http://doi.org/10.1109/ICCV.2015.363..

A Lahiri, , , SC Ragireddy, , , P Biswas, , , 等. . Unsupervised adversarial visual level domain adaptation for learning video object detectors from images. . Proc IEEE Winter Conf on Applications of Computer Vision, , 2019. . p. 1807--1815. . DOI:10.1109/WACV.2019.00197http://doi.org/10.1109/WACV.2019.00197..

H Law, , , J Deng. . CornerNet: detecting objects as paired keypoints. . Proc 15th European Conf on Computer Vision, , 2018. . p. 765--781. . DOI:10.1007/978-3-030-01264-9_45http://doi.org/10.1007/978-3-030-01264-9_45..

D Li, , , WC Hung, , , JB Huang, , , 等. . Unsupervised visual representation learning by graph-based consistent constraints. . Proc 14th European Conf on Computer Vision, , 2016. . p. 678--694. . DOI:10.1007/978-3-319-46493-0_41http://doi.org/10.1007/978-3-319-46493-0_41..

JN Li, , , XD Liang, , , SM Shen, , , 等. . Scale-aware fast R-CNN for pedestrian detection. . IEEE Trans Multim, , 2018. . 20((4):):985--996. . DOI:10.1109/TMM.2017.2759508http://doi.org/10.1109/TMM.2017.2759508..

NJ Li, , , FL Chang, , , CS Liu. . Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes. . IEEE Trans Multim, , 2020. . 23203--215. . DOI:10.1109/TMM.2020.2984093http://doi.org/10.1109/TMM.2020.2984093..

TY Lin, , , P Dollár, , , R Girshick, , , 等. . Feature pyramid networks for object detection. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2017a. . p. 936--944. . DOI:10.1109/CVPR.2017.106http://doi.org/10.1109/CVPR.2017.106..

TY Lin, , , P Goyal, , , R Girshick, , , 等. . Focal loss for dense object detection. . Proc IEEE Int Conf on Computer Vision, , 2017b. . p. 2999--3007. . DOI:10.1109/ICCV.2017.324http://doi.org/10.1109/ICCV.2017.324..

W Liu, , , D Anguelov, , , D Erhan, , , 等. . SSD: single shot multibox detector. . Proc 14th European Conf on Computer Vision, , 2016. . p. 21--37. . DOI:10.1007/978-3-319-46448-0_2http://doi.org/10.1007/978-3-319-46448-0_2..

XL Ma, , , XT Zhu, , , SG Gong, , , 等. . Person re-identification by unsupervised video matching. . Patt Recogn, , 2017. . 65197--210. . DOI:10.1016/j.patcog.2016.11.018http://doi.org/10.1016/j.patcog.2016.11.018..

YH Pan. . Heading toward artificial intelligence 2.0. . Engineering, , 2016. . 2((4):):409--413. . DOI:10.1016/J.ENG.2016.04.018http://doi.org/10.1016/J.ENG.2016.04.018..

YH Pan. . On visual knowledge. . Front Inform Technol Electron Eng, , 2019. . 20((8):):1021--1025. . DOI:10.1631/FITEE.1910001http://doi.org/10.1631/FITEE.1910001..

YH Pan. . Miniaturized five fundamental issues about visual knowledge. . Front Inform Technol Electron Eng, online, , 2020. . DOI:10.1631/FITEE.2040000http://doi.org/10.1631/FITEE.2040000..

JM Pang, , , K Chen, , , JP Shi, , , 等. . Libra R-CNN: towards balanced learning for object detection. . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, , 2019. . p. 821--830. . DOI:10.1109/CVPR.2019.00091http://doi.org/10.1109/CVPR.2019.00091..

J Redmon, , , A Farhadi. . YOLO9000: better, faster, stronger. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2017. . p. 6517--6525. . DOI:10.1109/CVPR.2017.690http://doi.org/10.1109/CVPR.2017.690..

J Redmon, , , S Divvala, , , R Girshick, , , 等. . You only look once: unified, real-time object detection. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2016. . p. 779--788. . DOI:10.1109/CVPR.2016.91http://doi.org/10.1109/CVPR.2016.91..

SQ Ren, , , KM He, , , R Girshick, , , 等. . Faster R-CNN: towards real-time object detection with region proposal networks. . Proc 28th Int Conf on Neural Information Processing Systems, , 2015. . p. 91--99. . DOI:10.5555/2969239.2969250http://doi.org/10.5555/2969239.2969250..

ZQ Shen, , , H Maheshwari, , , WC Yao, , , 等. . SCL: towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. . 2019. . https://arxiv.org/abs/1911.02559https://arxiv.org/abs/1911.02559, , ..

M Shvets, , , W Liu, , , A Berg. . Leveraging long-range temporal relationships between proposals for video object detection. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p. 9755--9763. . DOI:10.1109/ICCV.2019.00985http://doi.org/10.1109/ICCV.2019.00985..

A Subramaniam, , , A Nambiar, , , A Mittal. . Co-segmentation inspired attention networks for video-based person re-identification. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p. 562--572. . DOI:10.1109/ICCV.2019.00065http://doi.org/10.1109/ICCV.2019.00065..

K Tang, , , V Ramanathan, , , FF Li, , , 等. . Shifting weights: adapting object detectors from image to video. . Proc 25th Int Conf on Neural Information Processing Systems, , 2012. . p. 638--646. . DOI:10.5555/2999134.2999206http://doi.org/10.5555/2999134.2999206..

P Veličković, , , A Casanova, , , P Lio, , , 等. . Graph attention networks. . 2018. . https://arxiv.org/abs/1710.10903https://arxiv.org/abs/1710.10903, , ..

HW Wang, , , J Leskovec. . Unifying graph convolutional neural networks and label propagation. . 2019. . https://arxiv.org/abs/2002.06755https://arxiv.org/abs/2002.06755, , ..

SG Wang, , , J Cheng, , , HJ Liu, , , 等. . Pedestrian detection via body part semantic and contextual information with DNN. . IEEE Trans Multim, , 2018. . 20((11):):3148--3159. . DOI:10.1109/TMM.2018.2829602http://doi.org/10.1109/TMM.2018.2829602..

SY Wang, , , YC Zhou, , , JJ Yan, , , 等. . Fully motion-aware network for video object detection. . Proc 15th European Conf on Computer Vision, , 2018. . p. 557--573. . DOI:10.1007/978-3-030-01261-8_33http://doi.org/10.1007/978-3-030-01261-8_33..

SY Wang, , , A Group, , , HC Lu, , , 等. . Fast object detection in compressed video. . Proc IEEE/CVF Int Conf on Computer Vision, , 2019. . p. 7103--7112. . DOI:10.1109/ICCV.2019.00720http://doi.org/10.1109/ICCV.2019.00720..

F Wu, , , A Souza, , , TY Zhang, , , 等. . Simplifying graph convolutional networks. . Proc 36th Int Conf on Machine Learning, , 2019. . p. 6861--6871. . ..

FY Xiao, , , YJ Lee. . Track and segment: an iterative unsupervised approach for video object proposals. . Proc IEEE Conf on Computer Vision and Pattern Recognition, , 2016. . p.933--942. . DOI:10.1109/CVPR.2016.107http://doi.org/10.1109/CVPR.2016.107..

FY Xiao, , , YJ Lee. . Video object detection with an aligned spatial-temporal memory. . Proc 15th European Conf on Computer Vision, , 2018. . p.494--510. . DOI:10.1007/978-3-030-01237-3_30http://doi.org/10.1007/978-3-030-01237-3_30..

HK Yu, , , DZ Guo, , , ZP Yan, , , 等. . Unsupervised learning for large-scale fiber detection and tracking in microscopic material images. . 2018. . https://arxiv.org/abs/1805.10256https://arxiv.org/abs/1805.10256, , ..

XS Zhang, , , F Wan, , , C Liu, , , 等. . FreeAnchor: learning to match anchors for visual object detection. . 2019. . https://arxiv.org/abs/1909.02466https://arxiv.org/abs/1909.02466, , ..

ML Zhu, , , M Liu. . Mobile video object detection with temporally-aware feature maps. . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, , 2018. . p.5686--5695. . DOI:10.1109/CVPR.2018.00596http://doi.org/10.1109/CVPR.2018.00596..

XG Zhu, , , JM Pang, , , CY Yang, , , 等. . Adapting object detectors via selective cross-domain alignment. . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, , 2019. . p.687--696. . DOI:10.1109/CVPR.2019.00078http://doi.org/10.1109/CVPR.2019.00078..

XZ Zhu, , , YJ Wang, , , JF Dai, , , 等. . Flow-guided feature aggregation for video object detection. . Proc IEEE Int Conf on Computer Vision, , 2017. . p.408--417. . DOI:10.1109/ICCV.2017.52http://doi.org/10.1109/ICCV.2017.52..

浏览量

Downloads

CSCD

文章被引用时，请邮件提醒。

Submit

工具集

关联资源

暂无数据