
FOLLOWUS
State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China
Alibaba Group, Hangzhou 310024, China
[ "", "Yining QI received her BS degree in Control Science and Engineering at Zhejiang University, Hangzhou, China, in 2019. She is currently an MS candidate in Control Science and Engineering at Zhejiang University. Her research interests are cloud network, network diagnostics, and network measurement. E-mail: qyning710@gmail.com" ]
[ "Chongrong FANG, E-mail: chongrongfang.zju@gmail.com" ]
[ "Haoyu LIU, E-mail: haoyu_liu@zju.edu.cn" ]
[ "Daxiang KANG, E-mail: daxiang.kdx@alibaba-inc.com" ]
[ "Biao LYU, E-mail: lubiao.lb@alibaba-inc.com" ]
[ "", "Peng CHENG received his BS degree in Automation and his PhD degree in Control Science and Engineering at Zhejiang University, in 2004 and 2009, respectively. From 2012 to 2013, he worked as a research fellow in Information System Technology and Design Pillar at Singapore University of Technology and Design, Singapore. He is currently a professor at the College of Control Science and Engineering, Zhejiang University. He is a corresponding expert of Front Inform Technol Electron Eng. His research interests include networked sensing and control, cyber-physical systems, and control system security" ]
[ "", "Jiming CHEN received his BS and PhD degrees in Control Science and Engineering at Zhejiang University, in 2000 and 2005, respectively. He was a visiting researcher at the University of Waterloo, Canada, from 2008 to 2010. He is the Deputy Director of the State Key Laboratory of Industrial Control Technology and a member of the Academic Committee at Zhejiang University. He is now serving as an editor of Front Inform Technol Electron Eng. His research interests include the Internet of Things, sensor networks, networked control, and control system security. He was a recipient of the Fok Ying Tung Young Teacher Award of the Ministry of Education and the IEEE ComSoc Asia-Pacific Outstanding Young Researcher Award. He is an IEEE Vehicular Technology Society distinguished lecturer and an IEEE fellow. Jiming CHEN, E-mail: cjm@zju.edu.cn" ]
收稿:2020-04-06,
修回:2021-;5-;7,
纸质出版:2021-08
Scan QR Code
戚依宁, 方崇荣, 刘昊俣, 等. 云网络故障诊断系统及工具综述[J]. 信息与电子工程前沿(英文), 2021,22(8):1031-1045.
Yining QI, Chongrong FANG, Haoyu LIU, et al. A survey of cloud network fault diagnostic systems and tools[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(8): 1031-1045.
戚依宁, 方崇荣, 刘昊俣, 等. 云网络故障诊断系统及工具综述[J]. 信息与电子工程前沿(英文), 2021,22(8):1031-1045. DOI: 10.1631/FITEE.2000153.
Yining QI, Chongrong FANG, Haoyu LIU, et al. A survey of cloud network fault diagnostic systems and tools[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(8): 1031-1045. DOI: 10.1631/FITEE.2000153.
近年来,云网络已成为支撑人们正常生产生活的重要基础产业。然而,随着云网络日益复杂化,网络故障越来越容易出现,并且造成巨大经济损失。因此,为保障云网络性能,防止故障造成恶劣影响,云网络故障诊断已成为云服务提供商的重点研究技术之一。由于云网络的特性(例如虚拟化和多租户),将传统网络诊断工具移植到云网络面临不少困难。此外,许多现有工具无法解决云网络的独有问题。本文总结了近年提出的可用于云网络生产环境的最先进的云网络故障诊断系统及工具,并根据其特点分类。此外,根据云网络特点,分析了云网络故障诊断与传统网络故障诊断的区别。考虑到云网络的实际生产需求,提出设计云网络故障诊断工具时应注意的要点。此外,讨论了云网络故障诊断在未来发展中面临的机遇与挑战。
Recently
cloud computing has become a vital part that supports people's normal lives and production. However
accompanied by the increasing complexity of the cloud network
failures constantly keep coming up and cause huge economic losses. Thus
to guarantee the cloud network performance and prevent execrable effects caused by failures
cloud network diagnostics has become of great interest for cloud service providers. Due to the characteristics of cloud network (e.g.
virtualization and multi-tenancy)
transplanting traditional network diagnostic tools to the cloud network face several difficulties. Additionally
many existing tools cannot solve problems in the cloud network. In this paper
we summarize and classify the state-of-the-art technologies of cloud diagnostics which can be used in the production cloud network according to their features. Moreover
we analyze the differences between cloud network diagnostics and traditional network diagnostics based on the characteristics of the cloud network. Considering the operation requirements of the cloud network
we propose the points that should be cared about when designing a cloud network diagnostic tool. Also
we discuss the challenges that cloud network diagnostics will face in future development.
G Aceto , , , A Botta , , , W de Donato , , , 等 . . Cloud monitoring: a survey . . Comput Netw , , 2013 . . 57 ( ( 9 ): ): 2093 - - 2115 . . DOI: 10.1016/j.comnet.2013.04.001 http://doi.org/10.1016/j.comnet.2013.04.001 . .
A Andreyev . . Introducing Data Center Fabric, the Next-Generation Facebook Data Center Network , , 2014 . . https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/ https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/ , , . .
M Armbrust , , , A Fox , , , R Griffith , , , 等 . . A view of cloud computing . . Commun ACM , , 2010 . . 53 ( ( 4 ): ): 50 - - 58 . . DOI: 10.1145/1721654.1721672 http://doi.org/10.1145/1721654.1721672 . .
B Arzani , , , S Ciraci , , , BT Loo , , , 等 . . Taking the blame game out of data centers operations with NetPoirot . . Proc ACM SIGCOMM Conf , , 2016 . . p.440 - - 453 . . DOI: 10.1145/2934872.2934884 http://doi.org/10.1145/2934872.2934884 . .
B Arzani , , , S Ciraci , , , L Chamon , , , 等 . . 007: democratically finding the cause of packet drops . . Proc 15 th USENIX Conf on Networked Systems Design and Implementation , , 2018 . . p.419 - - 435 . . . .
P Bahl , , , R Chandra , , , A Greenberg , , , 等 . . Towards highly reliable enterprise network services via inference of multi-level dependencies . . Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communications , , 2007 . . p.13 - - 24 . . DOI: 10.1145/1282380.1282383 http://doi.org/10.1145/1282380.1282383 . .
F Bannour , , , S Souihi , , , A Mellouk . . Distributed SDN control: survey, taxonomy, and challenges . . IEEE Commun Surv Tutor , , 2018 . . 20 ( ( 1 ): ): 333 - - 354 . . DOI: 10.1109/COMST.2017.2782482 http://doi.org/10.1109/COMST.2017.2782482 . .
M Calder , , , M Schröder , , , R Gao , , , 等 . . Odin: Microsoft's scalable fault-tolerant CDN measurement system . . Proc 15 th USENIX Conf on Networked Systems Design and Implementation , , 2018 . . p.501 - - 517 . . . .
G Casella , , , RL Berger . . Statistical Inference , , 2 nd Ed : : Pacific Grove, USA Duxbury Press , , 2002 . . .
B Claise , , , G Sadasivan , , , V Valluri , , , 等 . . RFC 3954: Cisco Systems NetFlow Services Export Version 9 , , 2004 . . https://www.hjp.at/doc/rfc/rfc3954.html https://www.hjp.at/doc/rfc/rfc3954.html , , . .
A Dhamdhere , , , R Teixeira , , , C Dovrolis , , , 等 . . NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data . . Proc ACM CoNEXT Conf , , 2007 . . p.1 - - 12 . . DOI: 10.1145/1364654.1364677 http://doi.org/10.1145/1364654.1364677 . .
N Duffield , , , P Haffner , , , B Krishnamurthy , , , 等 . . Rule-based anomaly detection on IP flows . . IEEE INFOCOM , , 2009 . . p.424 - - 432 . . DOI: 10.1109/INFCOM.2009.5061947 http://doi.org/10.1109/INFCOM.2009.5061947 . .
CR Fang , , , HY Liu , , , M Miao , , , 等 . . VTrace: automatic diagnostic system for persistent packet loss in cloud-scale overlay network . . Proc Annual Conf of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication , , 2020 . . p.31 - - 43 . . DOI: 10.1145/3387514.3405851 http://doi.org/10.1145/3387514.3405851 . .
S Ganguli , , , T Corbett . . Gartner Magic Quadrant for Network Performance Monitoring and Diagnostics , , 2019 . . .
SL Garfinkel . . Architects of the Information Society: Thirty-Five Years of the Laboratory for Computer Science at MIT , , : : Cambridge, USA The MIT Press , , 1999 . . .
YL Geng , , , SY Liu , , , Z Yin , , , 等 . . SIMON: a simple and scalable method for sensing, inference and measurement in data center networks . . Proc 16 th USENIX Conf on Networked Systems Design and Implementation , , 2019 . . p.549 - - 564 . . . .
CY Gong , , , J Liu , , , Q Zhang , , , 等 . . The characteristics of cloud computing . . Proc 39 th Int Conf on Parallel Processing Workshops , , 2010 . . p.275 - - 279 . . DOI: 10.1109/ICPPW.2010.45 http://doi.org/10.1109/ICPPW.2010.45 . .
CX Guo , , , LH Yuan , , , D Xiang , , , 等 . . Pingmesh: a large-scale system for data center network latency measurement and analysis . . Proc ACM Conf on Special Interest Group on Data Communication , , 2015 . . p.139 - - 152 . . DOI: 10.1145/2785956.2787496 http://doi.org/10.1145/2785956.2787496 . .
H Herodotou , , , BL Ding , , , S Balakrishnan , , , 等 . . Scalable near real-time failure localization of data center networks . . Proc 20 th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining , , 2014 . . p.1689 - - 1698 . . DOI: 10.1145/2623330.2623365 http://doi.org/10.1145/2623330.2623365 . .
P Huang , , , CX Guo , , , LD Zhou , , , 等 . . Gray failure: the Achilles' heel of cloud-scale systems . . Proc 16 th Workshop on Hot Topics in Operating Systems , , 2017 . . p.150 - - 155 . . DOI: 10.1145/3102980.3103005 http://doi.org/10.1145/3102980.3103005 . .
YC Jin , , , S Renganathan , , , G Ananthanarayanan , , , 等 . . Zooming in on wide-area latencies to a global cloud provider . . Proc ACM Conf on Special Interest Group on Data Communication , , 2019 . . p.104 - - 116 . . DOI: 10.1145/3341302.3342073 http://doi.org/10.1145/3341302.3342073 . .
P Kanuparthy , , , C Dovrolis . . Pythia: diagnosing performance problems in wide area providers . . Proc USENIX Conf on USENIX Annual Technical Conference , , 2014 . . p.371 - - 382 . . . .
C Kim , , , P Bhide , , , E Doe , , , 等 . . In-Band Network Telemetry via Programmable Dataplanes . . Technical Specification P , , 2015 . . 4 2015 . .
Z Li , , , Q Cheng , , , K Hsieh , , , 等 . . Gandalf: an intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure . . Proc 17 th USENIX Symp on Networked Systems Design and Implementation , , 2020 . . p.389 - - 402 . . . .
S Marston , , , Z Li , , , S Bandyopadhyay , , , 等 . . Cloud computing-the business perspective . . Dec Support Syst , , 2011 . . 51 ( ( 1 ): ): 176 - - 189 . . DOI: 10.1016/j.dss.2010.12.006 http://doi.org/10.1016/j.dss.2010.12.006 . .
P Mell , , , T Grance . . The NIST Definition of Cloud Computing . . Gaithersburg: Computer Security Division, Information Technology Laboratory , , 2011 . . .
M Moshref , , , ML Yu , , , R Govindan , , , 等 . . Trumpet: timely and precise triggers in data centers . . Proc ACM SIGCOMM Conf , , 2016 . . p.129 - - 143 . . DOI: 10.1145/2934872.2934879 http://doi.org/10.1145/2934872.2934879 . .
VN Padmanabhan , , , S Ramabhadran , , , J Padhye . . Net-Profiler: profiling wide-area networks using peer cooperation . . Proc 4 th Int Conf on Peer-to-Peer Systems , , 2005 . . p.80 - - 92 . . DOI: 10.1007/11558989_8 http://doi.org/10.1007/11558989_8 . .
YH Peng , , , J Yang , , , C Wu , , , 等 . . deTector: a topology-aware monitoring system for data center networks . . Proc USENIX Conf on Usenix Annual Technical Conf , , 2017 . . p.55 - - 68 . . . .
J Roskind . . Quick UDP Internet Connections: Multiplexed Stream Transport over UDP , , 2013 . . https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/ https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/ , , . .
A Roy , , , HY Zeng , , , J Bagga , , , 等 . . Inside the social network's (datacenter) network . . Proc ACM Conf on Special Interest Group on Data Communication , , 2015 . . p.123 - - 137 . . DOI: 10.1145/2785956.2787472 http://doi.org/10.1145/2785956.2787472 . .
A Roy , , , HY Zeng , , , J Bagga , , , 等 . . Passive realtime datacenter fault detection and localization . . Proc 14 th USENIX Symp on Networked Systems Design and Implementation , , 2017 . . p.595 - - 612 . . . .
C Tan , , , Z Jin , , , CX Guo , , , 等 . . NetBouncer: active device and link failure localization in data center networks . . Proc 16 th USENIX Conf on Networked Systems Design and Implementation , , 2019 . . p.599 - - 614 . . . .
R Tibshirani . . Regression shrinkage and selection via the lasso . . J R Stat Soc Ser B , , 1996 . . 58 ( ( 1 ): ): 267 - - 288 . . DOI: 10.1111/j.2517-6161.1996.tb02080.x http://doi.org/10.1111/j.2517-6161.1996.tb02080.x . .
B Veloso , , , B Malheiro , , , JC Burguillo , , , 等 . . Impact of trust and reputation based brokerage on the CloudAnchor platform . . Int Conf on Practical Applications of Agents and Multi-agent Systems , , 2020 . . p.303 - - 314 . . . .
M Wang , , , BC Li , , , ZP Li . . sFlow: towards resource-efficient and agile service federation in service overlay networks . . Proc 24 th Int Conf on Distributed Computing Systems , , 2004 . . p.628 - - 635 . . DOI: 10.1109/ICDCS.2004.1281630 http://doi.org/10.1109/ICDCS.2004.1281630 . .
T Wang , , , WB Zhang , , , CY Ye , , , 等 . . FD4C: automatic fault diagnosis framework for web applications in cloud computing . . IEEE Trans Syst Man Cybern Syst , , 2016 . . 46 ( ( 1 ): ): 61 - - 75 . . DOI: 10.1109/TSMC.2015.2430834 http://doi.org/10.1109/TSMC.2015.2430834 . .
C Widanapathirana , , , J Li , , , YA Sekercioglu , , , 等 . . Intelligent automated diagnosis of client device bottlenecks in private clouds . . Proc 4 th IEEE Int Conf on Utility and Cloud Computing , , 2011 . . p.261 - - 266 . . DOI: 10.1109/UCC.2011.42 http://doi.org/10.1109/UCC.2011.42 . .
X Wu , , , D Turner , , , CC Chen , , , 等 . . NetPilot: automating datacenter network failure mitigation . . Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communication , , 2012 . . p.419 - - 430 . . DOI: 10.1145/2342356.2342438 http://doi.org/10.1145/2342356.2342438 . .
D Yu , , , YB Zhu , , , B Arzani , , , 等 . . dShark: a general, easy to program and scalable framework for analyzing in-network packet traces . . Proc 16 th USENIX Conf on Networked Systems Design and Implementation , , 2019 . . p.207 - - 220 . . . .
ML Yu , , , A Greenberg , , , D Maltz , , , 等 . . Profiling network performance for multi-tier data center applications . . Proc 8 th USENIX Conf on Networked Systems Design and Implementation , , 2011 . . p.57 - - 70 . . . .
HY Zeng , , , R Mahajan , , , N McKeown , , , 等 . . Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing . . Technical Report MSR-TR-2015-55 (Microsoft Research) , , 2015 . . .
Q Zhang , , , G Yu , , , CX Guo , , , 等 . . Deepview: virtual disk failure diagnosis and pattern detection for Azure . . Proc 15 th USENIX Conf on Networked Systems Design and Implementation , , 2018 . . p.519 - - 532 . . . .
YB Zhu , , , NX Kang , , , JX Cao , , , 等 . . Packet-level telemetry in large datacenter networks . . ACM SIGCOMM Comput Commun Rev , , 2015 . . p.479 - - 491 . . DOI: 10.1145/2829988.2787483 http://doi.org/10.1145/2829988.2787483 . .
DY Zhuo , , , M Ghobadi , , , R Mahajan , , , 等 . . Understanding and mitigating packet corruption in data center networks . . Proc ACM Conf on Special Interest Group on Data Communication , , 2017 . . p.362 - - 375 . . DOI: 10.1145/3098822.3098849 http://doi.org/10.1145/3098822.3098849 . .
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621