FOLLOWUS
1.College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
2.Songshan Laboratory, Zhengzhou 450000, China
3.School of Electronic Engineering, Xidian University, Xi’an 710401, China
‡ Corresponding author
Received:12 May 2024,
Revised:18 September 2024,
Published:2025-06
Scan QR Code
Deng LI, Peng LI, Aming WU, et al. Prototype-guided cross-task knowledge distillation[J]. Frontiers of information technology & electronic engineering, 2025, 26(6): 912-929.
Deng LI, Peng LI, Aming WU, et al. Prototype-guided cross-task knowledge distillation[J]. Frontiers of information technology & electronic engineering, 2025, 26(6): 912-929. DOI: 10.1631/FITEE.2400383.
近年来,大规模预训练模型在各种任务中展现了其优势。然而,受繁重的计算和巨大的存储需求限制,大规模预训练模型难以部署于真实场景中。现有主流的知识蒸馏方法要求教师模型和学生模型共享相同的标签空间,这限制了预训练模型在真实场景的应用。为缓解不同标签空间的限制,本文提出一种原型引导的跨任务知识蒸馏(ProC-KD)方法,旨在将教师网络的本质物体表征知识迁移到各种下游任务场景中。首先,为更好地学习跨任务场景中的泛化知识,提出一个原型学习模块,从教师网络中学习物体的不变本质表示。其次,对于多样的下游任务,提出一个任务自适应特征增强模块,通过习得的泛化原型表示增强学生网络特征,并指导学生网络的学习以提高其泛化能力。在不同视觉任务上的实验验证了所提方法在跨任务知识蒸馏场景中的有效性。
Recently
large-scale pretrained models have revealed their benefits in various tasks. However
due to the enormous computation complexity and storage demands
it is challenging to apply large-scale models to real scenarios. Existing knowledge distillation methods require mainly the teacher model and the student model to share the same label space
which restricts their application in real scenarios. To alleviate the constraint of different label spaces
we propose a prototype-guided cross-task knowledge distillation (ProC-KD) method to migrate the intrinsic local-level object knowledge of the teacher network to various task scenarios. First
to better learn the generalized knowledge in cross-task scenarios
we present a prototype learning module to learn the invariant intrinsic local representation of objects from the teacher network. Second
for diverse downstream tasks
a task-adaptive feature augmentation module is proposed to enhance the student network features with the learned generalization prototype representations and guide the learning of the student network to improve its generalization ability. Experimental results on various visual tasks demonstrate the effectiveness of our approach for cross-task knowledge distillation scenarios.
Ahn S , Hu SX , Damianou A , et al. , 2019 . Variational information distillation for knowledge transfer . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 9163 - 9171 . https://doi.org/10.1109/CVPR.2019.00938 https://doi.org/10.1109/CVPR.2019.00938
Ba LJ , Caruana R , 2014 . Do deep nets really need to be deep ? Proc 27 th Int Conf on Neural Information Processing Systems , p. 2654 - 2662 .
Cao KD , Wei CL , Gaidon A , et al. , 2019 . Learning imbalanced datasets with label-distribution-aware margin loss . Proc 33 rd Int Conf on Neural Information Processing Systems , Article 140 .
Carion N , Massa F , Synnaeve G , et al. , 2020 . End-to-end object detection with transformers . Proc 16 th European Conf on Computer Vision , p. 213 - 229 . https://doi.org/10.1007/978-3-030-58452-8_13 https://doi.org/10.1007/978-3-030-58452-8_13
Chebotar Y , Waters A , 2016 . Distilling knowledge from ensembles of neural networks for speech recognition . Proc 17 th Annual Conf of the Int Speech Communication Association , p. 3439 - 3443 .
Chefer H , Gur S , Wolf L , 2021 . Transformer interpretability beyond attention visualization . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 782 - 791 . https://doi.org/10.1109/CVPR46437.2021.00084 https://doi.org/10.1109/CVPR46437.2021.00084
Chen DF , Mei JP , Zhang Y , et al. , 2021 . Cross-layer distillation with semantic calibration . Proc 35 th AAAI Conf on Artificial Intelligence , p. 7028 - 7036 . https://doi.org/10.1609/aaai.v35i8.16865 https://doi.org/10.1609/aaai.v35i8.16865
Chen GB , Choi W , Yu X , et al. , 2017 . Learning efficient object detection models with knowledge distillation . Proc 31 st Int Conf on Neural Information Processing Systems , p. 742 - 751 .
Chen YC , Li LJ , Yu LC , et al. , 2020 . UNITER: UNiversal Image-TExt Representation learning . Proc 16 th European Conf on Computer Vision , p. 104 - 120 . https://doi.org/10.1007/978-3-030-58577-8_7 https://doi.org/10.1007/978-3-030-58577-8_7
Cordts M , Omran M , Ramos S , et al. , 2016 . The Cityscapes dataset for semantic urban scene understanding . Proc IEEE Conf on Computer Vision and Pattern Recognition , p. 3213 - 3223 . https://doi.org/10.1109/CVPR.2016.350 https://doi.org/10.1109/CVPR.2016.350
Cui Y , Jia ML , Lin TY , et al. , 2019 . Class-balanced loss based on effective number of samples . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 9268 - 9277 . https://doi.org/10.1109/CVPR.2019.00949 https://doi.org/10.1109/CVPR.2019.00949
Deng J , Dong W , Socher R , et al. , 2009 . ImageNet: a large-scale hierarchical image database . Proc IEEE Conf on Computer Vision and Pattern Recognition , p. 248 - 255 . https://doi.org/10.1109/CVPR.2009.5206848 https://doi.org/10.1109/CVPR.2009.5206848
Deng JK , Guo J , Yang J , et al. , 2021 . Variational prototype learning for deep face recognition . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 11906 - 11915 . https://doi.org/10.1109/CVPR46437.2021.01173 https://doi.org/10.1109/CVPR46437.2021.01173
Dosovitskiy A , Beyer L , Kolesnikov A , et al. , 2021 . An image is worth 16 × 16 words: transformers for image recognition at scale . Proc 9 th Int Conf on Learning Representations .
Fu TJ , Li LJ , Gan Z , et al. , 2023 . An empirical study of end-to-end video-language transformers with masked visual modeling . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 22898 - 22909 . https://doi.org/10.1109/CVPR52729.2023.02193 https://doi.org/10.1109/CVPR52729.2023.02193
Gou JP , Yu BS , Maybank SJ , et al. , 2021 . Knowledge distillation: a survey . Int J Comput Vis , 129 ( 6 ): 1789 - 1819 . https://doi.org/10.1007/s11263-021-01453-z https://doi.org/10.1007/s11263-021-01453-z
Gou JP , Sun LY , Yu BS , et al. , 2023 . Multilevel attention-based sample correlations for knowledge distillation . IEEE Trans Ind Inform , 19 ( 5 ): 7099 - 7109 . https://doi.org/10.1109/TII.2022.3209672 https://doi.org/10.1109/TII.2022.3209672
Heo B , Kim J , Yun S , et al. , 2019a . A comprehensive overhaul of feature distillation . Proc IEEE/CVF Int Conf on Computer Vision , p. 1921 - 1930 . https://doi.org/10.1109/ICCV.2019.00201 https://doi.org/10.1109/ICCV.2019.00201
Heo B , Lee M , Yun S , et al. , 2019b . Knowledge transfer via distillation of activation boundaries formed by hidden neurons . Proc 33 rd AAAI Conf on Artificial Intelligence , p. 3779 - 3787 . https://doi.org/10.1609/aaai.v33i01.33013779 https://doi.org/10.1609/aaai.v33i01.33013779
Hinton G , Vinyals O , Dean J , 2015 . Distilling the knowledge in a neural network . https://arxiv.org/abs/1503.02531 https://arxiv.org/abs/1503.02531
Hur S , Shin I , Park K , et al. , 2023 . Learning classifiers of prototypes and reciprocal points for universal domain adaptation . Proc IEEE/CVF Winter Conf on Applications of Computer Vision , p. 531 - 540 . https://doi.org/10.1109/WACV56688.2023.00060 https://doi.org/10.1109/WACV56688.2023.00060
Jain J , Li JC , Chiu MT , et al. , 2023 . OneFormer: one transformer to rule universal image segmentation . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 2989 - 2998 . https://doi.org/10.1109/CVPR52729.2023.00292 https://doi.org/10.1109/CVPR52729.2023.00292
Jiao XQ , Yin YC , Shang LF , et al. , 2020 . TinyBERT: distilling BERT for natural language understanding . Proc Findings of the Association for Computational Linguistics , p. 4163 - 4174 . https://doi.org/10.18653/v1/2020.findings-emnlp.372 https://doi.org/10.18653/v1/2020.findings-emnlp.372
Kurata G , Saon G , 2020 . Knowledge distillation from offline to streaming RNN transducer for end-to-end speech recognition . Proc 21 st Annual Conf of the Int Speech Communication Association , p. 2117 - 2121 .
Li G , Jampani V , Sevilla-Lara L , et al. , 2021 . Adaptive prototype learning and allocation for few-shot segmentation . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 8334 - 8343 . https://doi.org/10.1109/CVPR46437.2021.00823 https://doi.org/10.1109/CVPR46437.2021.00823
Li LJ , Chen YC , Cheng Y , et al. , 2020 . HERO: hierarchical encoder for video+language omni-representation pre-training . Proc Conf on Empirical Methods in Natural Language Processing , p. 2046 - 2065 . https://doi.org/10.18653/v1/2020.emnlp-main.161 https://doi.org/10.18653/v1/2020.emnlp-main.161
Lin TY , Maire M , Belongie S , et al. , 2014 . Microsoft COCO: common objects in context . Proc 13 th European Conf on Computer Vision , p. 740 - 755 . https://doi.org/10.1007/978-3-319-10602-1_48 https://doi.org/10.1007/978-3-319-10602-1_48
Liu JL , Song L , Qin YQ , 2020 . Prototype rectification for few-shot learning . Proc 16 th European Conf on Computer Vision , p. 741 - 756 . https://doi.org/10.1007/978-3-030-58452-8_43 https://doi.org/10.1007/978-3-030-58452-8_43
Liu Z , Lin YT , Cao Y , et al. , 2021 . Swin Transformer: hierarchical vision transformer using shifted windows . Proc IEEE/CVF Int Conf on Computer Vision , p. 10012 - 10022 . https://doi.org/10.1109/ICCV48922.2021.00986 https://doi.org/10.1109/ICCV48922.2021.00986
Miles R , Mikolajczyk K , 2024 . Understanding the role of the projector in knowledge distillation . Proc 38 th AAAI Conf on Artificial Intelligence , p. 4233 - 4241 . https://doi.org/10.1609/aaai.v38i5.28219 https://doi.org/10.1609/aaai.v38i5.28219
Molchanov P , Tyree S , Karras T , et al. , 2017 . Pruning convolutional neural networks for resource efficient inference . Proc 5 th Int Conf on Learning Representations .
Müller R , Kornblith S , Hinton G , 2019 . When does label smoothing help ? Proc 33 rd Int Conf on Neural Information Processing Systems , Article 422 .
Park W , Kim D , Lu Y , et al. , 2019 . Relational knowledge distillation . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 3967 - 3976 . https://doi.org/10.1109/CVPR.2019.00409 https://doi.org/10.1109/CVPR.2019.00409
Passalis N , Tefas A , 2018 . Learning deep representations with probabilistic knowledge transfer . Proc 15 th European Conf on Computer Vision , p. 283 - 299 . https://doi.org/10.1007/978-3-030-01252-6_17 https://doi.org/10.1007/978-3-030-01252-6_17
Rebuffi SA , Bilen H , Vedaldi A , 2017 . Learning multiple visual domains with residual adapters . Proc 31 st Int Conf on Neural Information Processing Systems , p. 506 - 516 .
Romero A , Ballas N , Kahou SE , et al. , 2015 . FitNets: hints for thin deep nets . Proc 3 rd Int Conf on Learning Representations .
Sakaridis C , Dai DX , Van Gool L , 2018 . Semantic foggy scene understanding with synthetic data . Int J Comput Vis , 126 ( 9 ): 973 - 992 . https://doi.org/10.1007/s11263-018-1072-8 https://doi.org/10.1007/s11263-018-1072-8
Sanh V , Debut L , Chaumond J , et al. , 2019 . DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter . https://arxiv.org/abs/1910.01108 https://arxiv.org/abs/1910.01108
Shu CY , Liu YF , Gao JF , et al. , 2021 . Channel-wise knowledge distillation for dense prediction . Proc IEEE/CVF Int Conf on Computer Vision , p. 5311 - 5320 . https://doi.org/10.1109/ICCV48922.2021.00526 https://doi.org/10.1109/ICCV48922.2021.00526
Snell J , Swersky K , Zemel R , 2017 . Prototypical networks for few-shot learning . Proc 31 st Int Conf on Neural Information Processing Systems , p. 4080 - 4090 .
Sun SQ , Cheng Y , Gan Z , et al. , 2019 . Patient knowledge distillation for BERT model compression . Proc Conf on Empirical Methods in Natural Language Processing and the 9 th Int Joint Conf on Natural Language Processing , p. 4322 - 4331 . https://doi.org/10.18653/v1/D19-1441 https://doi.org/10.18653/v1/D19-1441
Touvron H , Cord M , Douze M , et al. , 2021 . Training data-efficient image transformers & distillation through attention . Proc 38 th Int Conf on Machine Learning , p. 10347 - 10357 .
van der Maaten L , Weinberger K , 2012 . Stochastic triplet embedding . Proc IEEE Int Workshop on Machine Learning for Signal Processing , p. 1 - 6 . https://doi.org/10.1109/MLSP.2012.6349720 https://doi.org/10.1109/MLSP.2012.6349720
Vaswani A , Shazeer N , Parmar N , et al. , 2017 . Attention is all you need . Proc 31 st Int Conf on Neural Information Processing Systems , p. 6000 - 6010 .
Venkateswara H , Eusebio J , Chakraborty S , et al. , 2017 . Deep hashing network for unsupervised domain adaptation . Proc IEEE Conf on Computer Vision and Pattern Recognition , p. 5018 - 5027 . https://doi.org/10.1109/CVPR.2017.572 https://doi.org/10.1109/CVPR.2017.572
Wang JH , Cao MD , Shi SW , et al. , 2022 . Attention probe: vision transformer distillation in the wild . Proc IEEE Int Conf on Acoustics, Speech and Signal Processing , p. 2220 - 2224 . https://doi.org/10.1109/ICASSP43922.2022.9747484 https://doi.org/10.1109/ICASSP43922.2022.9747484
Wang T , Yuan L , Zhang XP , et al. , 2019 . Distilling object detectors with fine-grained feature imitation . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 4933 - 4942 . https://doi.org/10.1109/CVPR.2019.00507 https://doi.org/10.1109/CVPR.2019.00507
Wei YJ , Ye JX , Huang ZZ , et al. , 2023 . Online prototype learning for online continual learning . Proc IEEE/CVF Int Conf on Computer Vision , p. 18764 - 18774 . https://doi.org/10.1109/ICCV51070.2023.01720 https://doi.org/10.1109/ICCV51070.2023.01720
Wu AM , Liu R , Han YH , et al. , 2021 . Vector-decomposed disentanglement for domain-invariant object detection . Proc IEEE/CVF Int Conf on Computer Vision , p. 9342 - 9351 . https://doi.org/10.1109/ICCV48922.2021.00921 https://doi.org/10.1109/ICCV48922.2021.00921
Wu JX , Leng C , Wang YH , et al. , 2016 . Quantized convolutional neural networks for mobile devices . Proc IEEE Conf on Computer Vision and Pattern Recognition , p. 4820 - 4828 . https://doi.org/10.1109/CVPR.2016.521 https://doi.org/10.1109/CVPR.2016.521
Yang ZD , Li Z , Jiang XH , et al. , 2022a . Focal and global knowledge distillation for detectors . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 4643 - 4652 . https://doi.org/10.1109/CVPR52688.2022.00460 https://doi.org/10.1109/CVPR52688.2022.00460
Yang ZD , Li Z , Shao MQ , et al. , 2022b . Masked generative distillation . Proc 17 th European Conf on Computer Vision , p. 53 - 69 . https://doi.org/10.1007/978-3-031-20083-0_4 https://doi.org/10.1007/978-3-031-20083-0_4
Ye HJ , Lu S , Zhan DC , 2020 . Distilling cross-task knowledge via relationship matching . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 12396 - 12405 . https://doi.org/10.1109/CVPR42600.2020.01241 https://doi.org/10.1109/CVPR42600.2020.01241
Ye LW , Rochan M , Liu Z , et al. , 2019 . Cross-modal self-attention network for referring image segmentation . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 10502 - 10511 . https://doi.org/10.1109/CVPR.2019.01075 https://doi.org/10.1109/CVPR.2019.01075
Yim J , Joo D , Bae J , et al. , 2017 . A gift from knowledge distillation: fast optimization, network minimization and transfer learning . Proc IEEE Conf on Computer Vision and Pattern Recognition , p. 7130 - 7138 . https://doi.org/10.1109/CVPR.2017.754 https://doi.org/10.1109/CVPR.2017.754
Yoon JW , Lee H , Kim HY , et al. , 2021 . TutorNet: towards flexible knowledge distillation for end-to-end speech recognition . IEEE/ACM Trans Audio Speech Lang Process , 29 : 1626 - 1638 . https://doi.org/10.1109/TASLP.2021.3071662 https://doi.org/10.1109/TASLP.2021.3071662
Zagoruyko S , Komodakis N , 2017 . Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer . Proc 5 th Int Conf on Learning Representations .
Zhang LF , Ma KS , 2023 . Structured knowledge distillation for accurate and efficient object detection . IEEE Trans Patt Anal Mach Intell , 45 ( 12 ): 15706 - 15724 . https://doi.org/10.1109/TPAMI.2023.3300470 https://doi.org/10.1109/TPAMI.2023.3300470
Zhang Y , Chen WH , Lu YC , et al. , 2023 . Avatar knowledge distillation: self-ensemble teacher paradigm with uncertainty . Proc 31 st ACM Int Conf on Multimedia , p. 5272 - 5280 .
Zhao BR , Cui Q , Song RJ , et al. , 2022 . Decoupled knowledge distillation . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 11953 - 11962 . https://doi.org/10.1109/CVPR52688.2022.01165 https://doi.org/10.1109/CVPR52688.2022.01165
Zhou C , Zhang YN , Chen JX , et al. , 2023 . OcTr: octree-based transformer for 3D object detection . Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition , p. 5166 - 5175 . https://doi.org/10.1109/CVPR52729.2023.00500 https://doi.org/10.1109/CVPR52729.2023.00500
Zhu MH , Gupta S , 2018 . To prune, or not to prune: exploring the efficacy of pruning for model compression . Proc 6 th Int Conf on Learning Representations .
Zhu SL , Shang RH , Tang K , et al. , 2023 . BookKD: a novel knowledge distillation for reducing distillation costs by decoupling knowledge generation and learning . Knowl-Based Syst , 279 : 110916 . https://doi.org/10.1016/j.knosys.2023.110916 https://doi.org/10.1016/j.knosys.2023.110916
Zhu XZ , Su WJ , Lu LW , et al. , 2021 . Deformable DETR: deformable transformers for end-to-end object detection . Proc 9 th Int Conf on Learning Re presentations .
Publicity Resources
Related Articles
Related Author
Related Institution