带有网络智能体的去中心化多智能体强化学习进展

张凯清; 杨卓然; Tamer Başar

doi:10.1631/FITEE.1900661

Your Location：

Home >

Browse articles >

带有网络智能体的去中心化多智能体强化学习进展

常规文章 | Updated：2022-06-06

- 带有网络智能体的去中心化多智能体强化学习进展
- Decentralized multi-agent reinforcement learning with networked agents: recent advances
- 信息与电子工程前沿（英文） 2021年22卷第6期页码：802-814
- Affiliations：
  
  Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, IL 61801, USA
  Department of Operations Research and Financial Engineering, Princeton University, NJ 08544, USA
- Author bio：
  
  Kaiqing ZHANG, E-mail: kzhang66@illinois.edu
  [ "Zhuoran YANG, E-mail: zy6@princeton.edu" ]
  [ "Tamer BAŞAR, E-mail: basar1@illinois.edu" ]
- Funds：
  
  Project supported in part by the US Army Research Laboratory (ARL) Cooperative Agreement (No. W911NF-17-2-0196), and in part by the Air Force Office of Scientific Research (AFOSR) Grant (No. FA9550-19-1-0353)
- DOI：10.1631/FITEE.1900661
  中图分类号：
- 收稿：2019-11-30，
  
  修回：2020-;4-29，
  
  纸质出版：2021-06
- Accepted：
Scan QR Code
张凯清, 杨卓然, Tamer Başar. 带有网络智能体的去中心化多智能体强化学习进展[J]. 信息与电子工程前沿（英文）, 2021,22(6):802-814.

Kaiqing ZHANG, Zhuoran YANG, Tamer BAŞAR. Decentralized multi-agent reinforcement learning with networked agents: recent advances[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(6): 802-814.
张凯清, 杨卓然, Tamer Başar. 带有网络智能体的去中心化多智能体强化学习进展[J]. 信息与电子工程前沿（英文）, 2021,22(6):802-814. DOI： 10.1631/FITEE.1900661.

Kaiqing ZHANG, Zhuoran YANG, Tamer BAŞAR. Decentralized multi-agent reinforcement learning with networked agents: recent advances[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(6): 802-814. DOI： 10.1631/FITEE.1900661.

摘要

多智能体强化学习长期以来一直是机器学习和控制领域的重要研究课题。最近在（单智能体）深度强化学习领域的进展重新唤醒了对多智能体强化学习的研究兴趣，尤其在理论分析方面。本文回顾这个大课题中的一个子领域：带有网络智能体的去中心化多智能体强化学习。在这一场景中，多个智能体在一个共同的环境中进行序贯决策，无需中心控制器的协调，且智能体被允许和它们在通信网络上的邻居交换信息。这样的一个模型在很多方向都有相关应用，包括机器人控制、无人车控制、移动传感器网络控制、智能电网，等等。本综述旨在覆盖和整理我们和其他科研人员在这一方向的相关工作。我们希望该综述能够激发更多研究热情，投入到这个激动人心却又充满挑战的领域。

Abstract

Multi-agent reinforcement learning (MARL) has long been a significant research topic in both machine learning and control systems. Recent development of (single-agent) deep reinforcement learning has created a resurgence of interest in developing new MARL algorithms

especially those founded on theoretical analysis. In this paper

we review recent advances on a sub-area of this topic: decentralized MARL with networked agents. In this scenario

multiple agents perform sequential decision-making in a common environment

and without the coordination of any central controller

while being allowed to exchange information with their neighbors over a communication network. Such a setting finds broad applications in the control and operation of robots

unmanned vehicles

mobile sensor networks

and the smart grid. This review covers several of our research endeavors in this direction

as well as progress made by other researchers along the line. We hope that this review promotes additional research efforts in this exciting yet challenging area.

关键词

Keywords

references

JL Adler , , , VJ Blue . . A cooperative multi-agent transportation management and route guidance system . . Transp Res Part C Emerg Technol , , 2002 . . 10 ( ( 5-6 ): ): 433 - - 454 . . DOI: 10.1016/S0968-090X(02)00030-X http://doi.org/10.1016/S0968-090X(02)00030-X . .

A Agarwal , , , JC Duchi . . Distributed delayed stochastic optimization . . Proc $24. {\rm th }$ Int Conf on Neural Information Processing Systems , , 2011 . . p. 873 - - 881 . . . .

A Antos , , , C Szepesvri , , , R Munos . . Fitted Q-iteration in continuous action-space MDPs . . Advances in Neural Information Processing Systems , , 2008a . . p. 9 - - 16 . . . .

A Antos , , , C Szepesvri , , , R Munos . . Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path . . Mach Learn , , 2008b . . 71 ( ( 1 ): ): 89 - - 129 . . DOI: 10.1007/s10994-007-5038-2 http://doi.org/10.1007/s10994-007-5038-2 . .

M Assran , , , J Romoff , , , N Ballas , , , 等 . . Gossip-based actor-learner architectures for deep reinforcement learning . . Advances in Neural Information Processing Systems , , 2019 . . p. 13299 - - 13309 . . . .

T Baar , , , GJ Olsder . . Dynamic Noncooperative Game Theory , , : : SIAM Philadelphia , , 1999 . . .

J Baxter , , , PL Bartlett . . Infinite-horizon policy-gradient estimation . . J Artif Intell Res , , 2001 . . 15 319 - - 350 . . DOI: 10.1613/jair.806 http://doi.org/10.1613/jair.806 . .

D Bertsekas . . Multiagent rollout algorithms and reinforcement learning . . 2019 . . https://arxiv.org/abs/1910.00120 https://arxiv.org/abs/1910.00120 , , . .

DP Bertsekas . . Dynamic Programming and Optimal Control , , : : Belmont, MA, USA Athena Scientific , , 2005 . . .

J Bhandari , , , D Russo , , , R Singal . . A finite time analysis of temporal difference learning with linear function approximation . . Proc $31.{\rm st}$ Conf on Learning Theory , , 2018 . . p. 1691 - - 1692 . . . .

S Bhatnagar , , , RS Sutton , , , M Ghavamzadeh , , , 等 . . Natural actor-critic algorithms . . Automatica , , 2009 . . 45 ( ( 11 ): ): 2471 - - 2482 . . DOI: 10.1016/j.automatica.2009.07.008 http://doi.org/10.1016/j.automatica.2009.07.008 . .

VS Borkar . . Stochastic Approximation: a Dynamical Systems Viewpoint , , : : Cambridge, UK Cambridge University Press , , 2008 . . .

C Boutilier . . Planning, learning and coordination in multiagent decision processes . . Proc $6.{\rm th}$ Conf on Theoretical Aspects of Rationality and Knowledge , , 1996 . . p. 195 - - 210 . . . .

S Boyd , , , N Parikh , , , E Chu , , , 等 . . Distributed optimization and statistical learning via the alternating direction method of multipliers . . Found $Trends. \circledR$ Mach Learn , , 2011 . . 3 ( ( 1 ): ): 1 - - 122 . . DOI: 10.1561/2200000016 http://doi.org/10.1561/2200000016 . .

L Busoniu , , , R Babuska , , , B de Schutter , , , 等 . . A comprehensive survey of multiagent reinforcement learning . . IEEE Trans Syst Man Cybern Part C Appl Rev , , 2008 . . 38 ( ( 2 ): ): 156 - - 172 . . DOI: 10.1109/TSMCC.2007.913919 http://doi.org/10.1109/TSMCC.2007.913919 . .

L Cassano , , , K Yuan , , , AH Sayed . . Multi-agent fully decentralized value function learning with linear convergence rates . . 2018 . . https://arxiv.org/abs/1810.07792 https://arxiv.org/abs/1810.07792 , , . .

L Cassano , , , SA Alghunaim , , , AH Sayed . . Team policy learning for multi-agent reinforcement learning . . IEEE Int Conf on Acoustics, Speech and Signal Processing , , 2019 . . p. 3062 - - 3066 . . DOI: 10.1109/ICASSP.2019.8683168 http://doi.org/10.1109/ICASSP.2019.8683168 . .

TY Chen , , , KQ Zhang , , , GB Giannakis , , , 等 . . Communication-efficient distributed reinforcement learning . . 2018 . . https://arxiv.org/abs/1812.03239 https://arxiv.org/abs/1812.03239 , , . .

K Ciosek , , , S Whiteson . . Expected policy gradients for reinforcement learning . . 2018 . . https://arxiv.org/abs/1801.03326 https://arxiv.org/abs/1801.03326 , , . .

P Corke , , , R Peterson , , , D Rus . . Networked robots: flying robot navigation using a sensor net . . In: Dario P, Chatila R (Eds. ), Robotics Research. Springer, Berlin , , 2005 . . p. 234 - - 243 . . DOI: 10.1007/11008941_25 http://doi.org/10.1007/11008941_25 . .

E Dall'Anese , , , H Zhu , , , GB Giannakis . . Distributed optimal power flow for smart microgrids . . IEEE Trans Smart Grid , , 2013 . . 4 ( ( 3 ): ): 1464 - - 1475 . . DOI: 10.1109/TSG.2013.2248175 http://doi.org/10.1109/TSG.2013.2248175 . .

DS Ding , , , XH Wei , , , ZR Yang , , , 等 . . Fast multi-agent temporal-difference learning via homotopy stochastic primal-dual optimization . . 2019 . . https://arxiv.org/abs/1908.02805 https://arxiv.org/abs/1908.02805 , , . .

TT Doan , , , S Maguluri , , , J Romberg . . Finite-time analysis of distributed TD(0) with linear function approximation for multi-agent reinforcement learning . . Proc $36. {\rm th }$ Int Conf on Machine Learning , , 2019a . . p. 1626 - - 1635 . . . .

TT Doan , , , ST Maguluri , , , J Romberg . . Finite-time performance of distributed temporal difference learning with linear function approximation . . 2019b . . https://arxiv.org/abs/1907.12530 https://arxiv.org/abs/1907.12530 , , . .

JQ Fan , , , X Tong , , , Y Zeng . . Multi-agent inference in social networks: a finite population learning approach . . J Am Stat Assoc , , 2015 . . 110 ( ( 509 ): ): 149 - - 158 . . . .

AM Farahmand , , , R Munos , , , C Szepesvri . . Error propagation for approximate policy and value iteration . . Advances in Neural Information Processing Systems , , 2010 . . p.568 - - 576 . . . .

JN Foerster , , , YM Assael , , , N de Freitas , , , 等 . . Learning to communicate with deep multi-agent reinforcement learning . . Proc $30.{\rm th}$ Int Conf on Neural Information Processing Systems , , 2016 . . p.2137 - - 2145 . . . .

JK Gupta , , , M Egorov , , , M Kochenderfer . . Cooperative multi-agent control using deep reinforcement learning . . Int Conf on Autonomous Agents and Multiagent Systems , , 2017 . . p.66 - - 83 . . DOI: 10.1007/978-3-319-71682-4_5 http://doi.org/10.1007/978-3-319-71682-4_5 . .

MY Hong , , , TH Chang . . Stochastic proximal gradient consensus over random networks . . IEEE Trans Signal Process , , 2017 . . 65 ( ( 11 ): ): 2933 - - 2948 . . DOI: 10.1109/TSP.2017.2673815 http://doi.org/10.1109/TSP.2017.2673815 . .

D Jakovetic , , , J Xavier , , , JMF Moura . . Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication . . IEEE Trans Signal Process , , 2011 . . 59 ( ( 8 ): ): 3889 - - 3902 . . DOI: 10.1109/TSP.2011.2146776 http://doi.org/10.1109/TSP.2011.2146776 . .

S Kar , , , JMF Moura . . Consensus + innovations distributed inference over networks: cooperation and sensing in networked systems . . IEEE Signal Process Mag , , 2013 . . 30 ( ( 3 ): ): 99 - - 109 . . DOI: 10.1109/MSP.2012.2235193 http://doi.org/10.1109/MSP.2012.2235193 . .

S Kar , , , JMF Moura , , , HV Poor . . $QD$-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations . . IEEE Trans Signal Process , , 2013 . . 61 ( ( 7 ): ): 1848 - - 1862 . . DOI: 10.1109/TSP.2013.2241057 http://doi.org/10.1109/TSP.2013.2241057 . .

J Kober , , , JA Bagnell , , , J Peters . . Reinforcement learning in robotics: a survey . . Int J Rob Res , , 2013 . . 32 ( ( 11 ): ): 1238 - - 1274 . . DOI: 10.1177/0278364913495721 http://doi.org/10.1177/0278364913495721 . .

VR Konda , , , JN Tsitsiklis . . Actor-critic algorithms . . Advances in Neural Information Processing Systems , , 1999 . . p. 1008 - - 1014 . . . .

S Lange , , , T Gabel , , , M Riedmiller . . Batch reinforcement learning . . In: Wiering M, van Otterlo M (Eds.), Reinforcement Learning. Adaptation, Learning, and Optimization. Springer, Berlin, Heidelberg , , 2012 . . DOI: 10.1007/978-3-642-27645-3_2 http://doi.org/10.1007/978-3-642-27645-3_2 . .

M Lauer , , , MA Riedmiller . . An algorithm for distributed reinforcement learning in cooperative multi-agent systems . . Proc $17^ {\rm th}$ Int Conf on Machine Learning , , 2000 . . p. 535 - - 542 . . . .

D Lee , , , H Yoon , , , N Hovakimyan . . Primal-dual algorithm for distributed reinforcement learning: distributed GTD . . IEEE Conf on Decision and Control , , 2018 . . p.1967 - - 1972 . . DOI: 10.1109/CDC.2018.8619839 http://doi.org/10.1109/CDC.2018.8619839 . .

TP Lillicrap , , , JJ Hunt , , , A Pritzel , , , 等 . . Continuous control with deep reinforcement learning . . Proc $4^ {\rm th}$ Int Conf on Learning Representations , , 2016 . . .

YX Lin , , , KQ Zhang , , , ZR Yang , , , 等 . . A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning . . Proc IEEE $58.{\rm th}$ Conf on Decision and Control , , 2019 . . p.5562 - - 5567 . . DOI: 10.1109/CDC40024.2019.9029257 http://doi.org/10.1109/CDC40024.2019.9029257 . .

ML Littman . . Markov games as a framework for multi-agent reinforcement learning . . Proc $11^ {\rm th}$ Int Conf on Machine Learning , , 1994 . . p. 157 - - 163 . . . .

B Liu , , , J Liu , , , M Ghavamzadeh , , , 等 . . Finite-sample analysis of proximal gradient TD algorithms . . Proc $31.{\rm st}$ Conf on Uncertainty in Artificial Intelligence , , 2015 . . p.504 - - 513 . . . .

R Lowe , , , Y Wu , , , A Tamar , , , 等 . . Multi-agent actor-critic for mixed cooperative-competitive environments . . Proc $31.{\rm st}$ Int Conf on Neural Information Processing Systems , , 2017 . . p.6379 - - 6390 . . . .

SV Macua , , , JS Chen , , , S Zazo , , , 等 . . Distributed policy evaluation under multiple behavior strategies . . IEEE Trans Autom Contr , , 2015 . . 60 ( ( 5 ): ): 1260 - - 1274 . . DOI: 10.1109/TAC.2014.2368731 http://doi.org/10.1109/TAC.2014.2368731 . .

SV Macua , , , A Tukiainen , , , DGO Hernndez , , , 等 . . Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning . . 2017 . . https://arxiv.org/abs/1710.10363 https://arxiv.org/abs/1710.10363 , , . .

A Mahajan , , , D Teneketzis . . Sequential Decomposition of Sequential Dynamic Teams: Applications to Real-Time Communication and Networked Control Systems . . University of Michigan, Ann Arbor, USA , , 2008 . . .

HR Meai , , , C Szepesvri , , , S Bhatnagar , , , 等 . . Convergent temporal-difference learning with arbitrary smooth function approximation . . Proc $22.{\rm nd}$ Int Conf on Neural Information Processing Systems , , 2009 . . p.1204 - - 1212 . . . .

V Mnih , , , K Kavukcuoglu , , , D Silver , , , 等 . . Human-level control through deep reinforcement learning . . Nature , , 2015 . . 518 ( ( 7540 ): ): 529 - - 533 . . DOI: 10.1038/nature14236 http://doi.org/10.1038/nature14236 . .

R Munos . . Performance bounds in $L_{p}$-norm for approximate value iteration . . SIAM J Contr Optim , , 2007 . . 46 ( ( 2 ): ): 541 - - 561 . . DOI: 10.1137/040614384 http://doi.org/10.1137/040614384 . .

R Munos , , , C Szepesvri . . Finite-time bounds for fitted value iteration . . J Mach Learn Res , , 2008 . . 9 815 - - 857 . . . .

A Nedi , , , A Ozdaglar . . Distributed subgradient methods for multi-agent optimization . . IEEE Trans Autom Contr , , 2009 . . 54 ( ( 1 ): ): 48 - - 61 . . DOI: 10.1109/TAC.2008.2009515 http://doi.org/10.1109/TAC.2008.2009515 . .

A Nedi , , , A Olshevsky , , , W Shi . . Achieving geometric convergence for distributed optimization over time-varying graphs . . SIAM J Optim , , 2017 . . 27 ( ( 4 ): ): 2597 - - 2633 . . DOI: 10.1137/16M1084316 http://doi.org/10.1137/16M1084316 . .

FA Oliehoek , , , C Amato . . A Concise Introduction to Decentralized POMDPs . . Springer, Cham , , 2016 . . .

S Omidshafiei , , , J Pazis , , , C Amato , , , 等 . . Deep decentralized multi-task multi-agent reinforcement learning under partial observability . . Proc $34.{\rm th}$ Int Conf on Machine Learning , , 2017 . . p.2681 - - 2690 . . . .

P Pennesi , , , IC Paschalidis . . A distributed actor-critic algorithm and applications to mobile sensor network coordination problems . . IEEE Trans Autom Contr , , 2010 . . 55 ( ( 2 ): ): 492 - - 497 . . DOI: 10.1109/TAC.2009.2037462 http://doi.org/10.1109/TAC.2009.2037462 . .

H Qie , , , DX Shi , , , TL Shen , , , 等 . . Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning . . IEEE Access , , 2019 . . 7 146264 - - 146272 . . DOI: 10.1109/ACCESS.2019.2943253 http://doi.org/10.1109/ACCESS.2019.2943253 . .

GN Qu , , , N Li . . Harnessing smoothness to accelerate distributed optimization . . IEEE Trans Contr Netw Syst , , 2018 . . 5 ( ( 3 ): ): 1245 - - 1260 . . DOI: 10.1109/TCNS.2017.2698261 http://doi.org/10.1109/TCNS.2017.2698261 . .

M Rabbat , , , R Nowak . . Distributed optimization in sensor networks . . Proc $3. {\rm rd}$ Int Symp on Information Processing in Sensor Networks , , 2004 . . p. 20 - - 27 . . DOI: 10.1145/984622.984626 http://doi.org/10.1145/984622.984626 . .

J Ren , , , J Haupt . . A communication efficient hierarchical distributed optimization algorithm for multi-agent reinforcement learning . . Real-World Sequential Decision Making Workshop at Int Conf on Machine Learning , , 2019 . . .

M Riedmiller . . Neural fitted Q iteration——first experiences with a data efficient neural reinforcement learning method . . Proc $16.{\rm th}$ European Conf on Machine Learning , , 2005 . . p.317 - - 328 . . DOI: 10.1007/11564096_32 http://doi.org/10.1007/11564096_32 . .

AH Sayed . . Adaptation, learning, and optimization over networks . . Found $Trends. \circledR$ Mach Learn , , 2014 . . 7 ( ( 4-5 ): ): 311 - - 801 . . DOI: 10.1561/2200000051 http://doi.org/10.1561/2200000051 . .

M Schmidt , , , N Le Roux , , , F Bach . . Minimizing finite sums with the stochastic average gradient . . Math Program , , 2017 . . 162 ( ( 1-2 ): ): 83 - - 112 . . DOI: 10.1007/s10107-016-1030-6 http://doi.org/10.1007/s10107-016-1030-6 . .

XY Sha , , , JQ Zhang , , , KQ Zhang , , , 等 . . Asynchronous policy evaluation in distributed reinforcement learning over networks . . 2020 . . https://arxiv.org/abs/2003.00433 https://arxiv.org/abs/2003.00433 , , . .

S Shalev-Shwartz , , , S Shammah , , , A Shashua . . Safe, multi-agent, reinforcement learning for autonomous driving . . 2016 . . https://arxiv.org/abs/1610.03295 https://arxiv.org/abs/1610.03295 , , . .

LS Shapley . . Stochastic games . . PNAS , , 1953 . . 39 ( ( 10 ): ): 1095 - - 1100 . . DOI: 10.1073/pnas.39.10.1095 http://doi.org/10.1073/pnas.39.10.1095 . .

W Shi , , , Q Ling , , , G Wu , , , 等 . . Extra: an exact first-order algorithm for decentralized consensus optimization . . SIAM J Optim , , 2015 . . 25 ( ( 2 ): ): 944 - - 966 . . DOI: 10.1137/14096668X http://doi.org/10.1137/14096668X . .

D Silver , , , G Lever , , , N Heess , , , 等 . . Deterministic policy gradient algorithms . . Proc $31. {\rm st}$ Int Conf on Machine Learning , , 2014 . . p. 387 - - 395 . . . .

D Silver , , , A Huang , , , CJ Maddison , , , 等 . . Mastering the game of Go with deep neural networks and tree search . . Nature , , 2016 . . 529 ( ( 7587 ): ): 484 - - 489 . . DOI: 10.1038/nature16961 http://doi.org/10.1038/nature16961 . .

D Silver , , , J Schrittwieser , , , K Simonyan , , , 等 . . Mastering the game of Go without human knowledge . . Nature , , 2017 . . 550 ( ( 7676 ): ): 354 - - 359 . . DOI: 10.1038/nature24270 http://doi.org/10.1038/nature24270 . .

S Singh , , , T Jaakkola , , , ML Littman , , , 等 . . Convergence results for single-step on-policy reinforcement-learning algorithms . . Mach Learn , , 2000 . . 38 ( ( 3 ): ): 287 - - 308 . . DOI: 10.1023/A:1007678930559 http://doi.org/10.1023/A:1007678930559 . .

SP Singh , , , RS Sutton . . Reinforcement learning with replacing eligibility traces . . Mach Learn , , 1996 . . 22 ( ( 1-3 ): ): 123 - - 158 . . DOI: 10.1007/BF00114726 http://doi.org/10.1007/BF00114726 . .

R Srikant , , , L Ying . . Finite-time error bounds for linear stochastic approximation and TD learning . . Proc $32. {\rm nd}$ Conf on Learning Theory , , 2019 . . p. 2803 - - 2830 . . . .

MS Stankovi , , , SS Stankovi . . Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies . . American Control Conf , , 2016 . . p.167 - - 172 . . DOI: 10.1109/ACC.2016.7524910 http://doi.org/10.1109/ACC.2016.7524910 . .

MS Stankovi , , , N Ili , , , SS Stankovi . . Distributed stochastic approximation: weak convergence and network design . . IEEE Trans Autom Contr , , 2016 . . 61 ( ( 12 ): ): 4069 - - 4074 . . DOI: 10.1109/TAC.2016.2545098 http://doi.org/10.1109/TAC.2016.2545098 . .

W Suttle , , , ZR Yang , , , KQ Zhang , , , 等 . . A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning . . 2019 . . .

RS Sutton , , , DA McAllester , , , SP Singh , , , 等 . . Policy gradient methods for reinforcement learning with function approximation . . Advances in Neural Information Processing Systems , , 2000 . . p. 1057 - - 1063 . . . .

RS Sutton , , , C Szepesvri , , , HR Maei . . A convergent $O(n)$ algorithm for off-policy temporal-difference learning with linear function approximation . . Proc $21.{\rm st}$ Int Conf on Neural Information Processing Systems , , 2008 . . p.1609 - - 1616 . . . .

RS Sutton , , , HR Maei , , , D Precup , , , 等 . . Fast gradient-descent methods for temporal-difference learning with linear function approximation . . Proc $26.{\rm th}$ Annual Int Conf on Machine Learning , , 2009 . . p.993 - - 1000 . . DOI: 10.1145/1553374.1553501 http://doi.org/10.1145/1553374.1553501 . .

RS Sutton , , , AR Mahmood , , , M White . . An emphatic approach to the problem of off-policy temporal-difference learning . . J Mach Learn Res , , 2016 . . 17 ( ( 1 ): ): 2603 - - 2631 . . . .

G Tesauro . . Temporal difference learning and TD-Gammon . . Commun ACM , , 1995 . . 38 ( ( 3 ): ): 58 - - 68 . . DOI: 10.1145/203330.203343 http://doi.org/10.1145/203330.203343 . .

JN Tsitsiklis , , , B van Roy . . Analysis of temporal-diffference learning with function approximation . . Advances in Neural Information Processing Systems , , 1997 . . p. 1075 - - 1081 . . . .

SY Tu , , , AH Sayed . . Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks . . IEEE Trans Signal Process , , 2012 . . 60 ( ( 12 ): ): 6217 - - 6234 . . DOI: 10.1109/TSP.2012.2217338 http://doi.org/10.1109/TSP.2012.2217338 . .

P Varshavskaya , , , LP Kaelbling , , , D Rus . . Efficient distributed reinforcement learning through agreement . . In: Asama H, Kurokawa H, Ota J, et al. (Eds. ), Distributed Autonomous Robotic Systems. Springer, Berlin , , 2009 . . p. 367 - - 378 . . DOI: 10.1007/978-3-642-00644-9_33 http://doi.org/10.1007/978-3-642-00644-9_33 . .

HT Wai , , , Z Yang , , , ZR Wang , , , 等 . . Multi-agent reinforcement learning via double averaging primal-dual optimization . . Advances in Neural Information Processing Systems , , 2018 . . p. 9649 - - 9660 . . . .

XF Wang , , , T Sandholm . . Reinforcement learning to play an optimal Nash equilibrium in team Markov games . . Proc $15.{\rm th}$ Int Conf on Neural Information Processing Systems , , 2003 . . p.1603 - - 1610 . . . .

CJCH Watkins , , , P Dayan . . Q-learning . . Mach Learn , , 1992 . . 8 ( ( 3-4 ): ): 279 - - 292 . . DOI: 10.1007/BF00992698 http://doi.org/10.1007/BF00992698 . .

RJ Williams . . Simple statistical gradient-following algorithms for connectionist reinforcement learning . . Mach Learn , , 1992 . . 8 ( ( 3-4 ): ): 229 - - 256 . . DOI: 10.1007/BF00992696 http://doi.org/10.1007/BF00992696 . .

L Xiao , , , S Boyd , , , SJ Kim . . Distributed average consensus with least-mean-square deviation . . J Parall Distrib Comput , , 2007 . . 67 ( ( 1 ): ): 33 - - 46 . . DOI: 10.1016/j.jpdc.2006.08.010 http://doi.org/10.1016/j.jpdc.2006.08.010 . .

BC Ying , , , K Yuan , , , AH Sayed . . Convergence of variance-reduced learning under random reshuffling . . IEEE Int Conf on Acoustics, Speech and Signal Processing , , 2018 . . p. 2286 - - 2290 . . DOI: 10.1109/ICASSP.2018.8461739 http://doi.org/10.1109/ICASSP.2018.8461739 . .

HZ Yu . . On convergence of emphatic temporal-difference learning . . Proc $28^ {\rm th}$ Conf on Learning Theory , , 2015 . . p. 1724 - - 1751 . . . .

S Zazo , , , SV Macua , , , M Snchez-Fernndez , , , 等 . . Dynamic potential games with constraints: fundamentals and applications in communications . . IEEE Trans Signal Process , , 2016 . . 64 ( ( 14 ): ): 3806 - - 3821 . . DOI: 10.1109/TSP.2016.2551693 http://doi.org/10.1109/TSP.2016.2551693 . .

HG Zhang , , , H Jiang , , , YH Luo , , , 等 . . Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method . . IEEE Trans Ind Electron , , 2017 . . 64 ( ( 5 ): ): 4091 - - 4100 . . DOI: 10.1109/TIE.2016.2542134 http://doi.org/10.1109/TIE.2016.2542134 . .

KQ Zhang , , , LQ Lu , , , C Lei , , , 等 . . Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks . . Transp Res Part C Emerg Technol , , 2018a . . 92 472 - - 485 . . DOI: 10.1016/j.trc.2018.05.011 http://doi.org/10.1016/j.trc.2018.05.011 . .

KQ Zhang , , , ZR Yang , , , H Liu , , , 等 . . Finite-sample analyses for fully decentralized multi-agent reinforcement learning . . 2018b . . https://arxiv.org/abs/1812.02783v5 https://arxiv.org/abs/1812.02783v5 , , . .

KQ Zhang , , , ZR Yang , , , H Liu , , , 等 . . Fully decentralized multi-agent reinforcement learning with networked agents . . Proc $35^ {\rm th}$ Int Conf on Machine Learning , , 2018c . . p. 5867 - - 5876 . . . .

KQ Zhang , , , ZR Yang , , , T Baar . . Networked multi-agent reinforcement learning in continuous spaces . . IEEE Conf on Decision and Control , , 2018d . . p.2771 - - 2776 . . DOI: 10.1109/CDC.2018.8619581 http://doi.org/10.1109/CDC.2018.8619581 . .

KQ Zhang , , , ZR Yang , , , T Baar . . Multi-agent reinforcement learning: a selective overview of theories and algorithms . . 2019 . . https://arxiv.org/abs/1911.10635 https://arxiv.org/abs/1911.10635 , , . .

QC Zhang , , , DB Zhao , , , FL Lewis . . Model-free reinforcement learning for fully cooperative multi-agent graphical games . . Int Joint Conf on Neural Networks , , 2018 . . p.1 - - 6 . . DOI: 10.1109/IJCNN.2018.8489477 http://doi.org/10.1109/IJCNN.2018.8489477 . .

Y Zhang , , , MM Zavlanos . . Distributed off-policy actor-critic reinforcement learning with policy consensus . . 2019 . . https://arxiv.org/abs/1903.09255 https://arxiv.org/abs/1903.09255 , , . .

浏览量

391

Downloads

CSCD

文章被引用时，请邮件提醒。

Submit

工具集

关联资源

Reinforcement learning based privacy-preserving consensus tracking control of nonstrict-feedback discrete-time multi-agent systems

An anti-collision algorithm for robotic search-and-rescue tasks in unknown dynamic environments

Event-triggered distributed optimization for model-free multi-agent systems

Jiu fusion artificial intelligence (JFA): a two-stage reinforcement learning model with hierarchical neural networks and human knowledge for Tibetan Jiu chess

Efficient learning of robust multigait quadruped locomotion for minimizing the cost of transport