FOLLOWUS
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, IL 61801, USA
Department of Operations Research and Financial Engineering, Princeton University, NJ 08544, USA
Kaiqing ZHANG, E-mail: kzhang66@illinois.edu
[ "Zhuoran YANG, E-mail: zy6@princeton.edu" ]
[ "Tamer BAŞAR, E-mail: basar1@illinois.edu" ]
纸质出版日期:2021-06,
收稿日期:2019-11-30,
修回日期:2020-04-29,
Scan QR Code
张凯清, 杨卓然, Tamer Başar. 带有网络智能体的去中心化多智能体强化学习进展[J]. 信息与电子工程前沿(英文), 2021,22(6):802-814.
KAIQING ZHANG, ZHUORAN YANG, TAMER BAŞAR. Decentralized multi-agent reinforcement learning with networked agents: recent advances. [J]. Frontiers of information technology & electronic engineering, 2021, 22(6): 802-814.
张凯清, 杨卓然, Tamer Başar. 带有网络智能体的去中心化多智能体强化学习进展[J]. 信息与电子工程前沿(英文), 2021,22(6):802-814. DOI: 10.1631/FITEE.1900661.
KAIQING ZHANG, ZHUORAN YANG, TAMER BAŞAR. Decentralized multi-agent reinforcement learning with networked agents: recent advances. [J]. Frontiers of information technology & electronic engineering, 2021, 22(6): 802-814. DOI: 10.1631/FITEE.1900661.
多智能体强化学习长期以来一直是机器学习和控制领域的重要研究课题。最近在(单智能体)深度强化学习领域的进展重新唤醒了对多智能体强化学习的研究兴趣,尤其在理论分析方面。本文回顾这个大课题中的一个子领域:带有网络智能体的去中心化多智能体强化学习。在这一场景中,多个智能体在一个共同的环境中进行序贯决策,无需中心控制器的协调,且智能体被允许和它们在通信网络上的邻居交换信息。这样的一个模型在很多方向都有相关应用,包括机器人控制、无人车控制、移动传感器网络控制、智能电网,等等。本综述旨在覆盖和整理我们和其他科研人员在这一方向的相关工作。我们希望该综述能够激发更多研究热情,投入到这个激动人心却又充满挑战的领域。
Multi-agent reinforcement learning (MARL) has long been a significant research topic in both machine learning and control systems. Recent development of (single-agent) deep reinforcement learning has created a resurgence of interest in developing new MARL algorithms
especially those founded on theoretical analysis. In this paper
we review recent advances on a sub-area of this topic: decentralized MARL with networked agents. In this scenario
multiple agents perform sequential decision-making in a common environment
and without the coordination of any central controller
while being allowed to exchange information with their neighbors over a communication network. Such a setting finds broad applications in the control and operation of robots
unmanned vehicles
mobile sensor networks
and the smart grid. This review covers several of our research endeavors in this direction
as well as progress made by other researchers along the line. We hope that this review promotes additional research efforts in this exciting yet challenging area.
强化学习多智能体系统网络系统一致性优化分布式优化博弈论
Reinforcement learningMulti-agent systemsNetworked systemsConsensus optimizationDistributed optimizationGame theory
JL Adler, , , VJ Blue. . A cooperative multi-agent transportation management and route guidance system. . Transp Res Part C Emerg Technol, , 2002. . 10((5-6):):433--454. . DOI:10.1016/S0968-090X(02)00030-Xhttp://doi.org/10.1016/S0968-090X(02)00030-X..
A Agarwal, , , JC Duchi. . Distributed delayed stochastic optimization. . Proc $24. {\rm th }$ Int Conf on Neural Information Processing Systems, , 2011. . p. 873--881. . ..
A Antos, , , C Szepesvri, , , R Munos. . Fitted Q-iteration in continuous action-space MDPs. . Advances in Neural Information Processing Systems, , 2008a. . p. 9--16. . ..
A Antos, , , C Szepesvri, , , R Munos. . Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. . Mach Learn, , 2008b. . 71((1):):89--129. . DOI:10.1007/s10994-007-5038-2http://doi.org/10.1007/s10994-007-5038-2..
M Assran, , , J Romoff, , , N Ballas, , , 等. . Gossip-based actor-learner architectures for deep reinforcement learning. . Advances in Neural Information Processing Systems, , 2019. . p. 13299--13309. . ..
T Baar, , , GJ Olsder. . Dynamic Noncooperative Game Theory, , ::SIAMPhiladelphia, , 1999. ..
J Baxter, , , PL Bartlett. . Infinite-horizon policy-gradient estimation. . J Artif Intell Res, , 2001. . 15319--350. . DOI:10.1613/jair.806http://doi.org/10.1613/jair.806..
D Bertsekas. . Multiagent rollout algorithms and reinforcement learning. . 2019. . https://arxiv.org/abs/1910.00120https://arxiv.org/abs/1910.00120, , ..
DP Bertsekas. . Dynamic Programming and Optimal Control, , ::Belmont, MA, USAAthena Scientific, , 2005. ..
J Bhandari, , , D Russo, , , R Singal. . A finite time analysis of temporal difference learning with linear function approximation. . Proc $31.{\rm st}$ Conf on Learning Theory, , 2018. . p. 1691--1692. . ..
S Bhatnagar, , , RS Sutton, , , M Ghavamzadeh, , , 等. . Natural actor-critic algorithms. . Automatica, , 2009. . 45((11):):2471--2482. . DOI:10.1016/j.automatica.2009.07.008http://doi.org/10.1016/j.automatica.2009.07.008..
VS Borkar. . Stochastic Approximation: a Dynamical Systems Viewpoint, , ::Cambridge, UKCambridge University Press, , 2008. ..
C Boutilier. . Planning, learning and coordination in multiagent decision processes. . Proc $6.{\rm th}$ Conf on Theoretical Aspects of Rationality and Knowledge, , 1996. . p. 195--210. . ..
S Boyd, , , N Parikh, , , E Chu, , , 等. . Distributed optimization and statistical learning via the alternating direction method of multipliers. . Found $Trends. \circledR$ Mach Learn, , 2011. . 3((1):):1--122. . DOI:10.1561/2200000016http://doi.org/10.1561/2200000016..
L Busoniu, , , R Babuska, , , B de Schutter, , , 等. . A comprehensive survey of multiagent reinforcement learning. . IEEE Trans Syst Man Cybern Part C Appl Rev, , 2008. . 38((2):):156--172. . DOI:10.1109/TSMCC.2007.913919http://doi.org/10.1109/TSMCC.2007.913919..
L Cassano, , , K Yuan, , , AH Sayed. . Multi-agent fully decentralized value function learning with linear convergence rates. . 2018. . https://arxiv.org/abs/1810.07792https://arxiv.org/abs/1810.07792, , ..
L Cassano, , , SA Alghunaim, , , AH Sayed. . Team policy learning for multi-agent reinforcement learning. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2019. . p. 3062--3066. . DOI:10.1109/ICASSP.2019.8683168http://doi.org/10.1109/ICASSP.2019.8683168..
TY Chen, , , KQ Zhang, , , GB Giannakis, , , 等. . Communication-efficient distributed reinforcement learning. . 2018. . https://arxiv.org/abs/1812.03239https://arxiv.org/abs/1812.03239, , ..
K Ciosek, , , S Whiteson. . Expected policy gradients for reinforcement learning. . 2018. . https://arxiv.org/abs/1801.03326https://arxiv.org/abs/1801.03326, , ..
P Corke, , , R Peterson, , , D Rus. . Networked robots: flying robot navigation using a sensor net. . In: Dario P, Chatila R (Eds. ), Robotics Research. Springer, Berlin, , 2005. . p. 234--243. . DOI:10.1007/11008941_25http://doi.org/10.1007/11008941_25..
E Dall'Anese, , , H Zhu, , , GB Giannakis. . Distributed optimal power flow for smart microgrids. . IEEE Trans Smart Grid, , 2013. . 4((3):):1464--1475. . DOI:10.1109/TSG.2013.2248175http://doi.org/10.1109/TSG.2013.2248175..
DS Ding, , , XH Wei, , , ZR Yang, , , 等. . Fast multi-agent temporal-difference learning via homotopy stochastic primal-dual optimization. . 2019. . https://arxiv.org/abs/1908.02805https://arxiv.org/abs/1908.02805, , ..
TT Doan, , , S Maguluri, , , J Romberg. . Finite-time analysis of distributed TD(0) with linear function approximation for multi-agent reinforcement learning. . Proc $36. {\rm th }$ Int Conf on Machine Learning, , 2019a. . p. 1626--1635. . ..
TT Doan, , , ST Maguluri, , , J Romberg. . Finite-time performance of distributed temporal difference learning with linear function approximation. . 2019b. . https://arxiv.org/abs/1907.12530https://arxiv.org/abs/1907.12530, , ..
JQ Fan, , , X Tong, , , Y Zeng. . Multi-agent inference in social networks: a finite population learning approach. . J Am Stat Assoc, , 2015. . 110((509):):149--158. . ..
AM Farahmand, , , R Munos, , , C Szepesvri. . Error propagation for approximate policy and value iteration. . Advances in Neural Information Processing Systems, , 2010. . p.568--576. . ..
JN Foerster, , , YM Assael, , , N de Freitas, , , 等. . Learning to communicate with deep multi-agent reinforcement learning. . Proc $30.{\rm th}$ Int Conf on Neural Information Processing Systems, , 2016. . p.2137--2145. . ..
JK Gupta, , , M Egorov, , , M Kochenderfer. . Cooperative multi-agent control using deep reinforcement learning. . Int Conf on Autonomous Agents and Multiagent Systems, , 2017. . p.66--83. . DOI:10.1007/978-3-319-71682-4_5http://doi.org/10.1007/978-3-319-71682-4_5..
MY Hong, , , TH Chang. . Stochastic proximal gradient consensus over random networks. . IEEE Trans Signal Process, , 2017. . 65((11):):2933--2948. . DOI:10.1109/TSP.2017.2673815http://doi.org/10.1109/TSP.2017.2673815..
D Jakovetic, , , J Xavier, , , JMF Moura. . Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication. . IEEE Trans Signal Process, , 2011. . 59((8):):3889--3902. . DOI:10.1109/TSP.2011.2146776http://doi.org/10.1109/TSP.2011.2146776..
S Kar, , , JMF Moura. . Consensus + innovations distributed inference over networks: cooperation and sensing in networked systems. . IEEE Signal Process Mag, , 2013. . 30((3):):99--109. . DOI:10.1109/MSP.2012.2235193http://doi.org/10.1109/MSP.2012.2235193..
S Kar, , , JMF Moura, , , HV Poor. . $QD$-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. . IEEE Trans Signal Process, , 2013. . 61((7):):1848--1862. . DOI:10.1109/TSP.2013.2241057http://doi.org/10.1109/TSP.2013.2241057..
J Kober, , , JA Bagnell, , , J Peters. . Reinforcement learning in robotics: a survey. . Int J Rob Res, , 2013. . 32((11):):1238--1274. . DOI:10.1177/0278364913495721http://doi.org/10.1177/0278364913495721..
VR Konda, , , JN Tsitsiklis. . Actor-critic algorithms. . Advances in Neural Information Processing Systems, , 1999. . p. 1008--1014. . ..
S Lange, , , T Gabel, , , M Riedmiller. . Batch reinforcement learning. . In: Wiering M, van Otterlo M (Eds.), Reinforcement Learning. Adaptation, Learning, and Optimization. Springer, Berlin, Heidelberg, , 2012. . DOI:10.1007/978-3-642-27645-3_2http://doi.org/10.1007/978-3-642-27645-3_2..
M Lauer, , , MA Riedmiller. . An algorithm for distributed reinforcement learning in cooperative multi-agent systems. . Proc $17^ {\rm th}$ Int Conf on Machine Learning, , 2000. . p. 535--542. . ..
D Lee, , , H Yoon, , , N Hovakimyan. . Primal-dual algorithm for distributed reinforcement learning: distributed GTD. . IEEE Conf on Decision and Control, , 2018. . p.1967--1972. . DOI:10.1109/CDC.2018.8619839http://doi.org/10.1109/CDC.2018.8619839..
TP Lillicrap, , , JJ Hunt, , , A Pritzel, , , 等. . Continuous control with deep reinforcement learning. . Proc $4^ {\rm th}$ Int Conf on Learning Representations, , 2016. ..
YX Lin, , , KQ Zhang, , , ZR Yang, , , 等. . A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. . Proc IEEE $58.{\rm th}$ Conf on Decision and Control, , 2019. . p.5562--5567. . DOI:10.1109/CDC40024.2019.9029257http://doi.org/10.1109/CDC40024.2019.9029257..
ML Littman. . Markov games as a framework for multi-agent reinforcement learning. . Proc $11^ {\rm th}$ Int Conf on Machine Learning, , 1994. . p. 157--163. . ..
B Liu, , , J Liu, , , M Ghavamzadeh, , , 等. . Finite-sample analysis of proximal gradient TD algorithms. . Proc $31.{\rm st}$ Conf on Uncertainty in Artificial Intelligence, , 2015. . p.504--513. . ..
R Lowe, , , Y Wu, , , A Tamar, , , 等. . Multi-agent actor-critic for mixed cooperative-competitive environments. . Proc $31.{\rm st}$ Int Conf on Neural Information Processing Systems, , 2017. . p.6379--6390. . ..
SV Macua, , , JS Chen, , , S Zazo, , , 等. . Distributed policy evaluation under multiple behavior strategies. . IEEE Trans Autom Contr, , 2015. . 60((5):):1260--1274. . DOI:10.1109/TAC.2014.2368731http://doi.org/10.1109/TAC.2014.2368731..
SV Macua, , , A Tukiainen, , , DGO Hernndez, , , 等. . Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. . 2017. . https://arxiv.org/abs/1710.10363https://arxiv.org/abs/1710.10363, , ..
A Mahajan, , , D Teneketzis. . Sequential Decomposition of Sequential Dynamic Teams: Applications to Real-Time Communication and Networked Control Systems. . University of Michigan, Ann Arbor, USA, , 2008. ..
HR Meai, , , C Szepesvri, , , S Bhatnagar, , , 等. . Convergent temporal-difference learning with arbitrary smooth function approximation. . Proc $22.{\rm nd}$ Int Conf on Neural Information Processing Systems, , 2009. . p.1204--1212. . ..
V Mnih, , , K Kavukcuoglu, , , D Silver, , , 等. . Human-level control through deep reinforcement learning. . Nature, , 2015. . 518((7540):):529--533. . DOI:10.1038/nature14236http://doi.org/10.1038/nature14236..
R Munos. . Performance bounds in $L_{p}$-norm for approximate value iteration. . SIAM J Contr Optim, , 2007. . 46((2):):541--561. . DOI:10.1137/040614384http://doi.org/10.1137/040614384..
R Munos, , , C Szepesvri. . Finite-time bounds for fitted value iteration. . J Mach Learn Res, , 2008. . 9815--857. . ..
A Nedi, , , A Ozdaglar. . Distributed subgradient methods for multi-agent optimization. . IEEE Trans Autom Contr, , 2009. . 54((1):):48--61. . DOI:10.1109/TAC.2008.2009515http://doi.org/10.1109/TAC.2008.2009515..
A Nedi, , , A Olshevsky, , , W Shi. . Achieving geometric convergence for distributed optimization over time-varying graphs. . SIAM J Optim, , 2017. . 27((4):):2597--2633. . DOI:10.1137/16M1084316http://doi.org/10.1137/16M1084316..
FA Oliehoek, , , C Amato. . A Concise Introduction to Decentralized POMDPs. . Springer, Cham, , 2016. ..
S Omidshafiei, , , J Pazis, , , C Amato, , , 等. . Deep decentralized multi-task multi-agent reinforcement learning under partial observability. . Proc $34.{\rm th}$ Int Conf on Machine Learning, , 2017. . p.2681--2690. . ..
P Pennesi, , , IC Paschalidis. . A distributed actor-critic algorithm and applications to mobile sensor network coordination problems. . IEEE Trans Autom Contr, , 2010. . 55((2):):492--497. . DOI:10.1109/TAC.2009.2037462http://doi.org/10.1109/TAC.2009.2037462..
H Qie, , , DX Shi, , , TL Shen, , , 等. . Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. . IEEE Access, , 2019. . 7146264--146272. . DOI:10.1109/ACCESS.2019.2943253http://doi.org/10.1109/ACCESS.2019.2943253..
GN Qu, , , N Li. . Harnessing smoothness to accelerate distributed optimization. . IEEE Trans Contr Netw Syst, , 2018. . 5((3):):1245--1260. . DOI:10.1109/TCNS.2017.2698261http://doi.org/10.1109/TCNS.2017.2698261..
M Rabbat, , , R Nowak. . Distributed optimization in sensor networks. . Proc $3. {\rm rd}$ Int Symp on Information Processing in Sensor Networks, , 2004. . p. 20--27. . DOI:10.1145/984622.984626http://doi.org/10.1145/984622.984626..
J Ren, , , J Haupt. . A communication efficient hierarchical distributed optimization algorithm for multi-agent reinforcement learning. . Real-World Sequential Decision Making Workshop at Int Conf on Machine Learning, , 2019. ..
M Riedmiller. . Neural fitted Q iteration——first experiences with a data efficient neural reinforcement learning method. . Proc $16.{\rm th}$ European Conf on Machine Learning, , 2005. . p.317--328. . DOI:10.1007/11564096_32http://doi.org/10.1007/11564096_32..
AH Sayed. . Adaptation, learning, and optimization over networks. . Found $Trends. \circledR$ Mach Learn, , 2014. . 7((4-5):):311--801. . DOI:10.1561/2200000051http://doi.org/10.1561/2200000051..
M Schmidt, , , N Le Roux, , , F Bach. . Minimizing finite sums with the stochastic average gradient. . Math Program, , 2017. . 162((1-2):):83--112. . DOI:10.1007/s10107-016-1030-6http://doi.org/10.1007/s10107-016-1030-6..
XY Sha, , , JQ Zhang, , , KQ Zhang, , , 等. . Asynchronous policy evaluation in distributed reinforcement learning over networks. . 2020. . https://arxiv.org/abs/2003.00433https://arxiv.org/abs/2003.00433, , ..
S Shalev-Shwartz, , , S Shammah, , , A Shashua. . Safe, multi-agent, reinforcement learning for autonomous driving. . 2016. . https://arxiv.org/abs/1610.03295https://arxiv.org/abs/1610.03295, , ..
LS Shapley. . Stochastic games. . PNAS, , 1953. . 39((10):):1095--1100. . DOI:10.1073/pnas.39.10.1095http://doi.org/10.1073/pnas.39.10.1095..
W Shi, , , Q Ling, , , G Wu, , , 等. . Extra: an exact first-order algorithm for decentralized consensus optimization. . SIAM J Optim, , 2015. . 25((2):):944--966. . DOI:10.1137/14096668Xhttp://doi.org/10.1137/14096668X..
D Silver, , , G Lever, , , N Heess, , , 等. . Deterministic policy gradient algorithms. . Proc $31. {\rm st}$ Int Conf on Machine Learning, , 2014. . p. 387--395. . ..
D Silver, , , A Huang, , , CJ Maddison, , , 等. . Mastering the game of Go with deep neural networks and tree search. . Nature, , 2016. . 529((7587):):484--489. . DOI:10.1038/nature16961http://doi.org/10.1038/nature16961..
D Silver, , , J Schrittwieser, , , K Simonyan, , , 等. . Mastering the game of Go without human knowledge. . Nature, , 2017. . 550((7676):):354--359. . DOI:10.1038/nature24270http://doi.org/10.1038/nature24270..
S Singh, , , T Jaakkola, , , ML Littman, , , 等. . Convergence results for single-step on-policy reinforcement-learning algorithms. . Mach Learn, , 2000. . 38((3):):287--308. . DOI:10.1023/A:1007678930559http://doi.org/10.1023/A:1007678930559..
SP Singh, , , RS Sutton. . Reinforcement learning with replacing eligibility traces. . Mach Learn, , 1996. . 22((1-3):):123--158. . DOI:10.1007/BF00114726http://doi.org/10.1007/BF00114726..
R Srikant, , , L Ying. . Finite-time error bounds for linear stochastic approximation and TD learning. . Proc $32. {\rm nd}$ Conf on Learning Theory, , 2019. . p. 2803--2830. . ..
MS Stankovi, , , SS Stankovi. . Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies. . American Control Conf, , 2016. . p.167--172. . DOI:10.1109/ACC.2016.7524910http://doi.org/10.1109/ACC.2016.7524910..
MS Stankovi, , , N Ili, , , SS Stankovi. . Distributed stochastic approximation: weak convergence and network design. . IEEE Trans Autom Contr, , 2016. . 61((12):):4069--4074. . DOI:10.1109/TAC.2016.2545098http://doi.org/10.1109/TAC.2016.2545098..
W Suttle, , , ZR Yang, , , KQ Zhang, , , 等. . A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. . 2019. ..
RS Sutton, , , DA McAllester, , , SP Singh, , , 等. . Policy gradient methods for reinforcement learning with function approximation. . Advances in Neural Information Processing Systems, , 2000. . p. 1057--1063. . ..
RS Sutton, , , C Szepesvri, , , HR Maei. . A convergent $O(n)$ algorithm for off-policy temporal-difference learning with linear function approximation. . Proc $21.{\rm st}$ Int Conf on Neural Information Processing Systems, , 2008. . p.1609--1616. . ..
RS Sutton, , , HR Maei, , , D Precup, , , 等. . Fast gradient-descent methods for temporal-difference learning with linear function approximation. . Proc $26.{\rm th}$ Annual Int Conf on Machine Learning, , 2009. . p.993--1000. . DOI:10.1145/1553374.1553501http://doi.org/10.1145/1553374.1553501..
RS Sutton, , , AR Mahmood, , , M White. . An emphatic approach to the problem of off-policy temporal-difference learning. . J Mach Learn Res, , 2016. . 17((1):):2603--2631. . ..
G Tesauro. . Temporal difference learning and TD-Gammon. . Commun ACM, , 1995. . 38((3):):58--68. . DOI:10.1145/203330.203343http://doi.org/10.1145/203330.203343..
JN Tsitsiklis, , , B van Roy. . Analysis of temporal-diffference learning with function approximation. . Advances in Neural Information Processing Systems, , 1997. . p. 1075--1081. . ..
SY Tu, , , AH Sayed. . Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks. . IEEE Trans Signal Process, , 2012. . 60((12):):6217--6234. . DOI:10.1109/TSP.2012.2217338http://doi.org/10.1109/TSP.2012.2217338..
P Varshavskaya, , , LP Kaelbling, , , D Rus. . Efficient distributed reinforcement learning through agreement. . In: Asama H, Kurokawa H, Ota J, et al. (Eds. ), Distributed Autonomous Robotic Systems. Springer, Berlin, , 2009. . p. 367--378. . DOI:10.1007/978-3-642-00644-9_33http://doi.org/10.1007/978-3-642-00644-9_33..
HT Wai, , , Z Yang, , , ZR Wang, , , 等. . Multi-agent reinforcement learning via double averaging primal-dual optimization. . Advances in Neural Information Processing Systems, , 2018. . p. 9649--9660. . ..
XF Wang, , , T Sandholm. . Reinforcement learning to play an optimal Nash equilibrium in team Markov games. . Proc $15.{\rm th}$ Int Conf on Neural Information Processing Systems, , 2003. . p.1603--1610. . ..
CJCH Watkins, , , P Dayan. . Q-learning. . Mach Learn, , 1992. . 8((3-4):):279--292. . DOI:10.1007/BF00992698http://doi.org/10.1007/BF00992698..
RJ Williams. . Simple statistical gradient-following algorithms for connectionist reinforcement learning. . Mach Learn, , 1992. . 8((3-4):):229--256. . DOI:10.1007/BF00992696http://doi.org/10.1007/BF00992696..
L Xiao, , , S Boyd, , , SJ Kim. . Distributed average consensus with least-mean-square deviation. . J Parall Distrib Comput, , 2007. . 67((1):):33--46. . DOI:10.1016/j.jpdc.2006.08.010http://doi.org/10.1016/j.jpdc.2006.08.010..
BC Ying, , , K Yuan, , , AH Sayed. . Convergence of variance-reduced learning under random reshuffling. . IEEE Int Conf on Acoustics, Speech and Signal Processing, , 2018. . p. 2286--2290. . DOI:10.1109/ICASSP.2018.8461739http://doi.org/10.1109/ICASSP.2018.8461739..
HZ Yu. . On convergence of emphatic temporal-difference learning. . Proc $28^ {\rm th}$ Conf on Learning Theory, , 2015. . p. 1724--1751. . ..
S Zazo, , , SV Macua, , , M Snchez-Fernndez, , , 等. . Dynamic potential games with constraints: fundamentals and applications in communications. . IEEE Trans Signal Process, , 2016. . 64((14):):3806--3821. . DOI:10.1109/TSP.2016.2551693http://doi.org/10.1109/TSP.2016.2551693..
HG Zhang, , , H Jiang, , , YH Luo, , , 等. . Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. . IEEE Trans Ind Electron, , 2017. . 64((5):):4091--4100. . DOI:10.1109/TIE.2016.2542134http://doi.org/10.1109/TIE.2016.2542134..
KQ Zhang, , , LQ Lu, , , C Lei, , , 等. . Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks. . Transp Res Part C Emerg Technol, , 2018a. . 92472--485. . DOI:10.1016/j.trc.2018.05.011http://doi.org/10.1016/j.trc.2018.05.011..
KQ Zhang, , , ZR Yang, , , H Liu, , , 等. . Finite-sample analyses for fully decentralized multi-agent reinforcement learning. . 2018b. . https://arxiv.org/abs/1812.02783v5https://arxiv.org/abs/1812.02783v5, , ..
KQ Zhang, , , ZR Yang, , , H Liu, , , 等. . Fully decentralized multi-agent reinforcement learning with networked agents. . Proc $35^ {\rm th}$ Int Conf on Machine Learning, , 2018c. . p. 5867--5876. . ..
KQ Zhang, , , ZR Yang, , , T Baar. . Networked multi-agent reinforcement learning in continuous spaces. . IEEE Conf on Decision and Control, , 2018d. . p.2771--2776. . DOI:10.1109/CDC.2018.8619581http://doi.org/10.1109/CDC.2018.8619581..
KQ Zhang, , , ZR Yang, , , T Baar. . Multi-agent reinforcement learning: a selective overview of theories and algorithms. . 2019. . https://arxiv.org/abs/1911.10635https://arxiv.org/abs/1911.10635, , ..
QC Zhang, , , DB Zhao, , , FL Lewis. . Model-free reinforcement learning for fully cooperative multi-agent graphical games. . Int Joint Conf on Neural Networks, , 2018. . p.1--6. . DOI:10.1109/IJCNN.2018.8489477http://doi.org/10.1109/IJCNN.2018.8489477..
Y Zhang, , , MM Zavlanos. . Distributed off-policy actor-critic reinforcement learning with policy consensus. . 2019. . https://arxiv.org/abs/1903.09255https://arxiv.org/abs/1903.09255, , ..
关联资源
相关文章
相关作者
相关机构