

FOLLOWUS
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410000, China
Hao-nan WANG, E-mail: wanghaonan14@nudt.edu.cn
[ "Ning LIU, liuning17a@nudt.edu.cn" ]
[ "Yi-yun ZHANG, zhangyiyun213@163.com" ]
[ "Da-wei FENG, fengdawei@nudt.edu.cn" ]
[ "Feng HUANG, huangfeng@nudt.edu.cn" ]
[ "Dong-sheng LI, dsli@nudt.edu.cn" ]
[ "Yi-ming ZHANG, zhangyiming@nudt.edu.cn" ]
Received:29 September 2019,
Revised:2020-;6-;4,
Published Online:15 October 2020,
Published:2020-12
Scan QR Code
Hao-nan WANG, Ning LIU, Yi-yun ZHANG, et al. Deep reinforcement learning: a survey[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(12): 1726-1744.
Hao-nan WANG, Ning LIU, Yi-yun ZHANG, et al. Deep reinforcement learning: a survey[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(12): 1726-1744. DOI: 10.1631/FITEE.1900533.
深度强化学习已成为人工智能研究中最受欢迎的主题之一,已被广泛应用于端到端控制、机器人控制、推荐系统、自然语言对话系统等多个领域。本文对深度强化学习算法和应用进行系统分类,提供详细论述,并将现有深度强化学习算法分为基于模型的方法、无模型方法和高级深度强化学习方法。之后,全面分析探索、逆强化学习和迁移强化学习等高级算法的进展。最后,概述当前深度强化学习的代表性应用,并分析4个亟待解决的问题。
Deep reinforcement learning (RL) has become one of the most popular topics in artificial intelligence research. It has been widely used in various fields
such as end-to-end control
robotic control
recommendation systems
and natural language dialogue systems. In this survey
we systematically categorize the deep RL algorithms and applications
and provide a detailed review over existing deep RL algorithms by dividing them into modelbased methods
model-free methods
and advanced RL methods. We thoroughly analyze the advances including exploration
inverse RL
and transfer RL. Finally
we outline the current representative applications
and analyze four open problems for future research.
P Abbeel , , , AY Ng . . Apprenticeship learning via inverse reinforcement learning . . Proc 21 st Int Conf on Machine Learning , , 2004 . . p.1 - - 8 . . DOI: 10.1145/1015330.1015430 http://doi.org/10.1145/1015330.1015430 . .
J Achiam , , , D Held , , , A Tamar , , , 等 . . Constrained policy optimization . . Proc 34 th Int Conf on Machine Learning , , 2017 . . p.22 - - 31 . . . .
RRO Al-Nima , , , TT Han , , , TL Chen . . Road tracking using deep reinforcement learning for self-driving car applications . . Int Conf on Computer Recognition Systems , , 2019 . . p.106 - - 116 . . DOI: 10.1007/978-3-030-19738-4_12 http://doi.org/10.1007/978-3-030-19738-4_12 . .
SO Arik , , , JT Chen , , , KN Peng , , , 等 . . Neural voice cloning with a few samples . . Proc 32 nd Neural Information Processing Systems , , 2018 . . p.10019 - - 10029 . . . .
Y Aytar , , , T Pfaff , , , D Budden , , , 等 . . Playing hard exploration games by watching YouTube . . Proc 32 nd Neural Information Processing Systems , , 2018 . . p.2930 - - 2941 . . . .
MG Bellemare , , , Y Naddaf , , , J Veness , , , 等 . . The Arcade learning environment: an evaluation platform for general agents . . J Artif Intell Res , , 2013 . . 47 253 - - 279 . . DOI: 10.1613/jair.3912 http://doi.org/10.1613/jair.3912 . .
MG Bellemare , , , S Srinivasan , , , G Ostrovski , , , 等 . . Unifying count-based exploration and intrinsic motivation . . Proc 30 th Neural Information Processing Systems , , 2016 . . p.1471 - - 1479 . . . .
MG Bellemare , , , W Dabney , , , R Munos . . A distributional perspective on reinforcement learning . . Proc 34 th Int Conf on Machine Learning , , 2017 . . p.449 - - 458 . . . .
C Blundell , , , J Cornebise , , , K Kavukcuoglu , , , 等 . . Weight uncertainty in neural networks . . Proc 32 nd Int Conf on Machine Learning , , 2015 . . p.1613 - - 1622 . . . .
M Botvinick , , , S Ritter , , , JX Wang , , , 等 . . Reinforcement learning, fast and slow . . Trends Cogn Sci , , 2019 . . 23 ( ( 5 ): ): 408 - - 422 . . DOI: 10.1016/j.tics.2019.02.006 http://doi.org/10.1016/j.tics.2019.02.006 . .
J Buckman , , , D Hafner , , , G Tucker , , , 等 . . Sample-efficient reinforcement learning with stochastic ensemble value expansion . . Proc 32 nd Neural Information Processing Systems , , 2018 . . p.8224 - - 8234 . . . .
Y Burda , , , H Edwards , , , D Pathak , , , 等 . . Large-scale study of curiosity-driven learning . . 2019 . . https://arxiv.org/abs/1808.04355 https://arxiv.org/abs/1808.04355 , , . .
O Chapelle , , , LH Li . . An empirical evaluation of Thompson sampling . . Proc 24 th Neural Information Processing Systems , , 2011 . . p.2249 - - 2257 . . . .
Y Chebotar , , , K Hausman , , , M Zhang , , , 等 . . Combining model-based and model-free updates for trajectory-centric reinforcement learning . . Proc 34 th Int Conf on Machine Learning , , 2017 . . p.703 - - 711 . . . .
L Chen , , , J Lingys , , , K Chen , , , 等 . . AuTO: scaling deep reinforcement learning for datacenter-scale automatic traffic optimization . . Proc Conf of the ACM Special Interest Group on Data Communication , , 2018 . . p.191 - - 205 . . DOI: 10.1145/3230543.3230551 http://doi.org/10.1145/3230543.3230551 . .
YT Chen , , , Y Assael , , , B Shillingford , , , 等 . . Sample efficient adaptive text-to-speech . . 2019 . . https://arxiv.org/abs/1809.10460 https://arxiv.org/abs/1809.10460 , , . .
K Chua , , , R Calandra , , , R McAllister , , , 等 . . Deep reinforcement learning in a handful of trials using probabilistic dynamics models . . Proc 32 nd Neural Information Processing Systems , , 2018 . . p.4754 - - 4765 . . . .
C Devin , , , A Gupta , , , T Darrell , , , 等 . . Learning modular neural network policies for multi-task and multi-robot transfer . . Proc IEEE Int Conf on Robotics and Automation , , 2017 . . p.2169 - - 2176 . . DOI: 10.1109/ICRA.2017.7989250 http://doi.org/10.1109/ICRA.2017.7989250 . .
B Dhingra , , , LH Li , , , XJ Li , , , 等 . . Towards end-to-end reinforcement learning of dialogue agents for information access . . Proc 55 th Annual Meeting of the Association for Computational Linguistics , , 2017 . . p.484 - - 495 . . DOI: 10.18653/v1/P17-1045 http://doi.org/10.18653/v1/P17-1045 . .
Y Duan , , , J Schulman , , , X Chen , , , 等 . . RL 2 : fast reinforcement learning via slow reinforcement learning . . 2017 . . https://arxiv.org/abs/1611.02779 https://arxiv.org/abs/1611.02779 , , . .
F Ebert , , , C Finn , , , AX Lee , , , 等 . . Self-supervised visual planning with temporal skip connections . . Proc 1 st Annual Conf on Robot Learning , , 2017 . . p.344 - - 356 . . . .
V Feinberg , , , A Wan , , , I Stoica , , , 等 . . Model-based value estimation for efficient model-free reinforcement learning . . 2018 . . https://arxiv.org/abs/1803.00101 https://arxiv.org/abs/1803.00101 , , . .
C Finn , , , S Levine . . Deep visual foresight for planning robot motion . . Proc IEEE Int Conf on Robotics and Automation , , 2017 . . p.2786 - - 2793 . . DOI: 10.1109/ICRA.2017.7989324 http://doi.org/10.1109/ICRA.2017.7989324 . .
C Finn , , , S Levine , , , P Abbeel . . Guided cost learning: deep inverse optimal control via policy optimization . . Proc 33 rd Int Conf on Machine Learning , , 2016a . . p.49 - - 58 . . . .
C Finn , , , XY Tan , , , Y Duan , , , 等 . . Deep spatial autoencoders for visuomotor learning . . Proc IEEE Int Conf on Robotics and Automation , , 2016b . . p.512 - - 519 . . DOI: 10.1109/ICRA.2016.7487173 http://doi.org/10.1109/ICRA.2016.7487173 . .
C Finn , , , P Abbeel , , , S Levine . . Model-agnostic meta-learning for fast adaptation of deep networks . . Proc 34 th Int Conf on Machine Learning , , 2017a . . p.1126 - - 1135 . . . .
C Finn , , , T Yu , , , T Zhang , , , 等 . . One-shot visual imitation learning via meta-learning . . Proc 1 st Conf on Robot Learning , , 2017b . . p.357 - - 368 . . . .
M Fortunato , , , MG Azar , , , B Piot , , , 等 . . Noisy networks for exploration . . 2019 . . https://arxiv.org/abs/1706.10295 https://arxiv.org/abs/1706.10295 , , . .
J Fu , , , S Levine , , , P Abbeel . . One-shot learning of manipulation skills with online dynamics adaptation and neural network priors . . Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems , , 2016 . . p.4019 - - 4026 . . DOI: 10.1109/IROS.2016.7759592 http://doi.org/10.1109/IROS.2016.7759592 . .
J Fu , , , JD Co-Reyes , , , S Levine . . EX$^2$: exploration with exemplar models for deep reinforcement learning . . Proc 30$^{\rm th}$ Neural Information Processing Systems , , 2017a . . p.2577 - - 2587 . . . .
J Fu , , , K Luo , , , S Levine . . Learning robust rewards with adversarial inverse reinforcement learning . . 2017b . . https://arxiv.org/abs/1710.11248 https://arxiv.org/abs/1710.11248 , , . .
S Fujimoto , , , H Hoof , , , D Meger . . Addressing function approximation error in actor-critic methods . . Proc 35 th Int Conf on Machine Learning , , 2018 . . p.1587 - - 1596 . . . .
Y Gal , , , J Hron , , , A Kendall . . Concrete dropout . . Proc 30$^{\rm th}$ Neural Information Processing Systems , , 2017 . . p.3581 - - 3590 . . . .
FM Garcia , , , PS Thomas . . A meta-MDP approach to exploration for lifelong reinforcement learning . . Proc 32$^{\rm nd}$ Neural Information Processing Systems , , 2019 . . p.5691 - - 5700 . . . .
SKS Ghasemipour , , , SX Gu , , , R Zemel . . SMILe: scalable meta inverse reinforcement learning through context-conditional policies . . Proc 32 nd Neural Information Processing Systems , , 2019 . . p.7879 - - 7889 . . . .
JT Gu , , , H Hassan , , , J Devlin , , , 等 . . Universal neural machine translation for extremely low resource languages . . Proc 16 th Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , , 2018a . . p.344 - - 354 . . DOI: 10.18653/v1/N18-1032 http://doi.org/10.18653/v1/N18-1032 . .
JT Gu , , , Y Wang , , , Y Chen , , , 等 . . Meta-learning for low-resource neural machine translation . . Proc Conf on Empirical Methods in Natural Language Processing , , 2018b . . p.3622 - - 3631 . . DOI: 10.18653/v1/D18-1398 http://doi.org/10.18653/v1/D18-1398 . .
SX Gu , , , T Lillicrap , , , I Sutskever , , , 等 . . Continuous deep Q-learning with model-based acceleration . . Proc 33 rd Int Conf on Machine Learning , , 2016 . . p.2829 - - 2838 . . . .
SX Gu , , , E Holly , , , T Lillicrap , , , 等 . . Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates . . Proc IEEE Int Conf on Robotics and Automation , , 2017a . . p.3389 - - 3396 . . DOI: 10.1109/ICRA.2017.7989385 http://doi.org/10.1109/ICRA.2017.7989385 . .
SX Gu , , , T Lillicrap , , , Z Ghahramani , , , 等 . . Q-Prop: sample-efficient policy gradient with an off-policy critic . . 2017b . . https://arxiv.org/abs/1611.02247 https://arxiv.org/abs/1611.02247 , , . .
A Gupta , , , R Mendonca , , , YX Liu , , , 等 . . Meta-reinforcement learning of structured exploration strategies . . Proc 32 nd Neural Information Processing Systems , , 2018 . . p.5302 - - 5311 . . . .
T Haarnoja , , , HR Tang , , , P Abbeel , , , 等 . . Reinforcement learning with deep energy-based policies . . Proc 34 th Int Conf on Machine Learning , , 2017 . . p.1352 - - 1361 . . . .
T Haarnoja , , , A Zhou , , , P Abbeel , , , 等 . . Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor . . Proc 35 th Int Conf on Machine Learning , , 2018 . . p.1861 - - 1870 . . . .
M Hausknecht , , , P Stone . . Deep recurrent Q-learning for partially observable MDPs . . 2017 . . https://arxiv.org/abs/1507.06527 https://arxiv.org/abs/1507.06527 , , . .
D He , , , YC Xia , , , T Qin , , , 等 . . Dual learning for machine translation . . Proc 30 th Neural Information Processing Systems , , 2016 . . p.820 - - 828 . . . .
N Heess , , , S Sriram , , , J Lemmon , , , 等 . . Emergence of locomotion behaviours in rich environments . . 2017 . . https://arxiv.org/abs/1707.02286 https://arxiv.org/abs/1707.02286 , , . .
M Hessel , , , J Modayil , , , H van Hasselt , , , 等 . . Rainbow: combining improvements in deep reinforcement learning . . 2018 . . https://arxiv.org/abs/1710.02298 https://arxiv.org/abs/1710.02298 , , . .
J Ho , , , S Ermon . . Generative adversarial imitation learning . . Proc 30 th Neural Information Processing Systems , , 2016 . . p.4565 - - 4573 . . . .
D Horgan , , , J Quan , , , D Budden , , , 等 . . Distributed prioritized experience replay . . 2018 . . https://arxiv.org/abs/1803.00933 https://arxiv.org/abs/1803.00933 , , . .
R Houthooft , , , X Chen , , , Y Duan , , , 等 . . Variational information maximizing exploration . . Proc 30$^{\rm th}$ Neural Information Processing Systems , , 2017 . . p.1109 - - 1117 . . . .
S Kakade , , , J Langford . . Approximately optimal approximate reinforcement learning . . Proc 19 th Int Conf on Machine Learning , , 2002 . . p.267 - - 274 . . . .
D Kalashnikov , , , A Irpan , , , P Pastor , , , 等 . . QT-Opt: scalable deep reinforcement learning for vision-based robotic manipulation . . Proc 2 nd Conf on Robot Learning , , 2018 . . p.651 - - 673 . . . .
E Klein , , , M Geist , , , B Piot , , , 等 . . Inverse reinforcement learning through structured classification . . Proc 25$^{\rm th}$ Neural Information Processing Systems , , 2012 . . p.1007 - - 1015 . . . .
JZ Kolter , , , AY Ng . . Near-Bayesian exploration in polynomial time . . Proc 26 th Int Conf on Machine Learning , , 2009 . . p.513 - - 520 . . DOI: 10.1145/1553374.1553441 http://doi.org/10.1145/1553374.1553441 . .
A Krizhevsky , , , I Sutskever , , , GE Hinton . . ImageNet classification with deep convolutional neural networks . . Proc 25 th Neural Information Processing Systems , , 2012 . . p.1097 - - 1105 . . . .
BJA Kröse . . Learning from delayed rewards . . Robot Auton Syst , , 1995 . . 15 ( ( 4 ): ): 233 - - 235 . . DOI: 10.1016/0921-8890(95)00026-C http://doi.org/10.1016/0921-8890(95)00026-C . .
S Lange , , , M Riedmiller , , , A Voigtländer . . Autonomous reinforcement learning on raw visual input data in a real world application . . Proc Int Joint Conf on Neural Networks , , 2012 . . p.1 - - 8 . . DOI: 10.1109/IJCNN.2012.6252823 http://doi.org/10.1109/IJCNN.2012.6252823 . .
S Levine , , , V Koltun . . Guided policy search . . Proc 30 th Int Conf on Machine Learning , , 2013 . . p.1 - - 9 . . . .
S Levine , , , N Wagener , , , P Abbeel . . Learning contact-rich manipulation skills with guided policy search . . Proc IEEE Int Conf on Robotics and Automation , , 2015 . . p.156 - - 163 . . DOI: 10.1109/ICRA.2015.7138994 http://doi.org/10.1109/ICRA.2015.7138994 . .
S Levine , , , C Finn , , , T Darrell , , , 等 . . End-to-end training of deep visuomotor policies . . J Mach Learn Res , , 2016 . . 17 ( ( 1 ): ): 1334 - - 1373 . . . .
TP Lillicrap , , , JJ Hunt , , , A Pritzel , , , 等 . . Continuous control with deep reinforcement learning . . Proc 4 th Int Conf on Learning Representations , , 2016 . . p.2829 - - 2838 . . . .
LJ Lin . . Self-improving reactive agents based on reinforcement learning, planning and teaching . . Mach Learn , , 1992 . . 8 ( ( 3-4 ): ): 293 - - 321 . . DOI: 10.1007/BF00992699 http://doi.org/10.1007/BF00992699 . .
HZ Mao , , , M Alizadeh , , , I Menache , , , 等 . . Resource management with deep reinforcement learning . . Proc 15 th ACM Workshop on Hot Topics in Networks , , 2016 . . p.50 - - 56 . . DOI: 10.1145/3005745.3005750 http://doi.org/10.1145/3005745.3005750 . .
HZ Mao , , , M Schwarzkopf , , , SB Venkatakrishnan , , , 等 . . Learning scheduling algorithms for data processing clusters . . Proc ACM Special Interest Group on Data Communication , , 2019a . . p.270 - - 288 . . . .
HZ Mao , , , P Negi , , , A Narayan , , , 等 . . Park: an open platform for learning-augmented computer systems . . Proc 36 th Int Conf on Machine Learning , , 2019b . . p.2490 - - 2502 . . . .
N Mishra , , , M Rohaninejad , , , X Chen , , , 等 . . A simple neural attentive meta-learner . . 2018 . . https://arxiv.org/abs/1707.03141 https://arxiv.org/abs/1707.03141 , , . .
V Mnih , , , K Kavukcuoglu , , , D Silver , , , 等 . . Playing Atari with deep reinforcement learning . . 2013 . . https://arxiv.org/abs/1312.5602 https://arxiv.org/abs/1312.5602 , , . .
V Mnih , , , K Kavukcuoglu , , , D Silver , , , 等 . . Human-level control through deep reinforcement learning . . Nature , , 2015 . . 518 ( ( 7540 ): ): 529 - - 533 . . DOI: 10.1038/nature14236 http://doi.org/10.1038/nature14236 . .
V Mnih , , , AP Badia , , , M Mirza , , , 等 . . Asynchronous methods for deep reinforcement learning . . Proc 33 rd Int Conf on Machine Learning , , 2016 . . p.1928 - - 1937 . . . .
SS Mousavi , , , M Schukat , , , E Howley . . Deep reinforcement learning: an overview . . Proc SAI Intelligent Systems Conf , , 2018 . . p.426 - - 440 . . DOI: 10.1007/978-3-319-56991-8_32 http://doi.org/10.1007/978-3-319-56991-8_32 . .
O Nachum , , , M Norouzi , , , K Xu , , , 等 . . Bridging the gap between value and policy based reinforcement learning . . Proc 31 st Neural Information Processing Systems , , 2017a . . p.2775 - - 2785 . . . .
O Nachum , , , M Norouzi , , , K Xu , , , 等 . . Trust-PCL: an off-policy trust region method for continuous control . . 2017b . . https://arxiv.org/abs/1707.01891 https://arxiv.org/abs/1707.01891 , , . .
A Nagabandi , , , G Kahn , , , RS Fearing , , , 等 . . Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning . . IEEE Int Conf on Robotics and Automation , , 2018 . . p.7559 - - 7566 . . DOI: 10.1109/ICRA.2018.8463189 http://doi.org/10.1109/ICRA.2018.8463189 . .
A Nagabandi , , , I Clavera , , , SM Liu , , , 等 . . Learning to adapt in dynamic, real-world environments through meta-reinforcement learning . . 2019 . . https://arxiv.org/abs/1803.11347v6 https://arxiv.org/abs/1803.11347v6 , , . .
AY Ng , , , SJ Russell . . Algorithms for inverse reinforcement learning . . Proc 17 th Int Conf on Machine Learning , , 2000 . . p.663 - - 670 . . . .
I Osband , , , C Blundell , , , A Pritzel , , , 等 . . Deep exploration via bootstrapped DQN . . Proc 29$^{\rm th}$ Neural Information Processing Systems , , 2016 . . p.4026 - - 4034 . . . .
G Ostrovski , , , MG Bellemare , , , A van den Oord , , , 等 . . Count-based exploration with neural density models . . Proc 34 th Int Conf on Machine Learning , , 2017 . . p.2721 - - 2730 . . . .
E Parisotto , , , JL Ba , , , R Salakhutdinov . . Actor-Mimic: deep multitask and transfer reinforcement learning . . 2016 . . https://arxiv.org/abs/1511.06342 https://arxiv.org/abs/1511.06342 , , . .
D Pathak , , , P Agrawal , , , AA Efros , , , 等 . . Curiosity-driven exploration by self-supervised prediction . . Proc IEEE Conf on Computer Vision and Pattern Recognition Workshops , , 2017 . . p.488 - - 489 . . DOI: 10.1109/CVPRW.2017.70 http://doi.org/10.1109/CVPRW.2017.70 . .
XB Peng , , , P Abbeel , , , S Levine , , , 等 . . DeepMimic: example-guided deep reinforcement learning of physics-based character skills . . ACM Trans Graph , , 2018a . . 37 ( ( 4 ): ): 143 DOI: 10.1145/3197517.3201311 http://doi.org/10.1145/3197517.3201311 . .
XB Peng , , , M Andrychowicz , , , W Zaremba , , , 等 . . Sim-to-real transfer of robotic control with dynamics randomization . . Proc IEEE Int Conf on Robotics and Automation , , 2018b . . p.3803 - - 3810 . . DOI: 10.1109/ICRA.2018.8460528 http://doi.org/10.1109/ICRA.2018.8460528 . .
W Ping , , , KN Peng , , , A Gibiansky , , , 等 . . Deep voice 3: 2000-speaker neural text-to-speech . . Proc Int Conf on Learning Representations , , 2018 . . p.214 - - 217 . . . .
T Pohlen , , , B Piot , , , T Hester , , , 等 . . Observe and look further: achieving consistent performance on Atari . . 2018 . . https://arxiv.org/abs/1805.11593 https://arxiv.org/abs/1805.11593 , , . .
S Racanière , , , T Weber , , , DP Reichert , , , 等 . . Imagination-augmented agents for deep reinforcement learning . . Proc 31 st Neural Information Processing Systems , , 2017 . . p.5694 - - 5705 . . . .
R Rahmatizadeh , , , P Abolghasemi , , , A Behal , , , 等 . . Learning real manipulation tasks from virtual demonstrations using LSTM . . 2016 . . https://arxiv.org/abs/1603.03833v2 https://arxiv.org/abs/1603.03833v2 , , . .
A Rajeswaran , , , S Ghotra , , , B Ravindran , , , 等 . . EPOpt: learning robust neural network policies using model ensembles . . 2017 . . https://arxiv.org/abs/1610.01283 https://arxiv.org/abs/1610.01283 , , . .
K Rakelly , , , A Zhou , , , D Quillen , , , 等 . . Efficient off-policy meta-reinforcement learning via probabilistic context variables . . Proc 36 th Int Conf on Machine Learning , , 2019 . . p.5331 - - 5340 . . . .
ND Ratliff , , , JA Bagnell , , , MA Zinkevich . . Maximum margin planning . . Proc 23 rd Int Conf on Machine Learning , , 2006 . . p.729 - - 736 . . DOI: 10.1145/1143844.1143936 http://doi.org/10.1145/1143844.1143936 . .
D Russo , , , BV Roy . . Learning to optimize via information-directed sampling . . Proc 27 th Neural Information Processing Systems , , 2014 . . p.1583 - - 1591 . . . .
AA Rusu , , , SG Colmenarejo , , , C Gulcehre , , , 等 . . Policy distillation . . 2016a . . https://arxiv.org/abs/1511.06295 https://arxiv.org/abs/1511.06295 , , . .
AA Rusu , , , NC Rabinowitz , , , G Desjardins , , , 等 . . Progressive neural networks . . 2016b . . https://arxiv.org/abs/1606.04671 https://arxiv.org/abs/1606.04671 , , . .
T Schaul , , , J Quan , , , I Antonoglou , , , 等 . . Prioritized experience replay . . 2016 . . https://arxiv.org/abs/1511.05952 https://arxiv.org/abs/1511.05952 , , . .
J Schulman , , , S Levine , , , P Moritz , , , 等 . . Trust region policy optimization . . Proc Int Conf on Machine Learning , , 2015 . . p.1889 - - 1897 . . . .
J Schulman , , , P Moritz , , , S Levine , , , 等 . . High-dimensional continuous control using generalized advantage estimation . . 2016 . . https://arxiv.org/abs/1506.02438 https://arxiv.org/abs/1506.02438 , , . .
J Schulman , , , F Wolski , , , P Dhariwal , , , 等 . . Proximal policy optimization algorithms . . 2017 . . https://arxiv.org/abs/1707.06347 https://arxiv.org/abs/1707.06347 , , . .
HY Shum , , , XD He , , , D Li . . From Eliza to XiaoIce: challenges and opportunities with social chatbots . . Front Inform Technol Electron Eng , , 2018 . . 19 ( ( 1 ): ): 10 - - 26 . . DOI: 10.1631/FITEE.1700826 http://doi.org/10.1631/FITEE.1700826 . .
D Silver , , , G Lever , , , N Heess , , , 等 . . Deterministic policy gradient algorithms . . Proc 31 st Int Conf on Machine Learning , , 2014 . . p.387 - - 395 . . . .
D Silver , , , A Huang , , , CJ Maddison , , , 等 . . Mastering the game of go with deep neural networks and tree search . . Nature , , 2016 . . 529 ( ( 7587 ): ): 484 - - 489 . . DOI: 10.1038/nature16961 http://doi.org/10.1038/nature16961 . .
RJ Skerry-Ryan , , , E Battenberg , , , Y Xiao , , , 等 . . Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron . . Int Conf on Machine Learning , , 2018 . . p.4693 - - 4702 . . . .
BC Stadie , , , G Yang , , , R Houthooft , , , 等 . . Some considerations on learning to explore via meta-reinforcement learning . . 2018 . . https://arxiv.org/abs/1803.01118 https://arxiv.org/abs/1803.01118 , , . .
AL Strehl , , , ML Littman . . An analysis of model-based interval estimation for Markov decision processes . . J Comput Syst Sci , , 2008 . . 74 ( ( 8 ): ): 1309 - - 1331 . . DOI: 10.1016/j.jcss.2007.08.009 http://doi.org/10.1016/j.jcss.2007.08.009 . .
RS Sutton . . Learning to predict by the methods of temporal differences . . Mach Learn , , 1988 . . 3 ( ( 1 ): ): 9 - - 44 . . DOI: 10.1023/A:1022633531479 http://doi.org/10.1023/A:1022633531479 . .
RS Sutton , , , AG Barto . . Reinforcement Learning: an Introduction , , 2 nd Ed. : : Cambridge, MA, USA MIT Press , , 2018 . . .
HR Tang , , , R Houthooft , , , D Foote , , , 等 . . #Exploration: a study of count-based exploration for deep reinforcement learning . . Proc 31 st Neural Information Processing Systems , , 2017 . . p.2753 - - 2762 . . . .
H van Hasselt , , , A Guez , , , D Silver . . Deep reinforcement learning with double Q-learning . . Proc 30 th AAAI Conf on Artificial Intelligence , , 2016 . . p.2096 - - 2100 . . . .
J Vanschoren . . Meta-learning: a survey . . 2018 . . https://arxiv.org/abs/1810.03548 https://arxiv.org/abs/1810.03548 , , . .
O Vinyals , , , T Ewalds , , , S Bartunov , , , 等 . . StarCraft II: a new challenge for reinforcement learning . . 2017 . . https://arxiv.org/abs/1708.04782 https://arxiv.org/abs/1708.04782 , , . .
JX Wang , , , Z Kurth-Nelson , , , D Tirumala , , , 等 . . Learning to reinforcement learn . . 2017 . . https://arxiv.org/abs/1611.05763 https://arxiv.org/abs/1611.05763 , , . .
ZY Wang , , , T Schaul , , , M Hessel , , , 等 . . Dueling network architectures for deep reinforcement learning . . Proc 33 rd Int Conf on Machine Learning , , 2016 . . p.1995 - - 2003 . . . .
ZY Wang , , , V Bapst , , , N Heess , , , 等 . . Sample efficient actor-critic with experience replay . . 2017 . . https://arxiv.org/abs/1611.01224 https://arxiv.org/abs/1611.01224 , , . .
M Watter , , , JT Springenberg , , , J Boedecker , , , 等 . . Embed to control: a locally linear latent dynamics model for control from raw images . . Proc 28 th Neural Information Processing Systems , , 2015 . . p.2746 - - 2754 . . . .
RJ Williams . . Simple statistical gradient-following algorithms for connectionist reinforcement learning . . Mach Learn , , 1992 . . 8 ( ( 3-4 ): ): 229 - - 256 . . DOI: 10.1023/A:1022672621406 http://doi.org/10.1023/A:1022672621406 . .
YH Wu , , , E Mansimov , , , RB Grosse , , , 等 . . Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation . . Proc 30$^{\rm th}$ Neural Information Processing Systems , , 2017 . . p.5279 - - 5288 . . . .
C Xia , , , Kamel A El . . Neural inverse reinforcement learning in autonomous navigation . . Robot Auton Syst , , 2016 . . 84 1 - - 14 . . DOI: 10.1016/j.robot.2016.06.003 http://doi.org/10.1016/j.robot.2016.06.003 . .
A Yahya , , , A Li , , , M Kalakrishnan , , , 等 . . Collective robot reinforcement learning with distributed asynchronous guided policy search . . IEEE/RSJ Int Conf on Intelligent Robots and Systems , , 2017 . . p.79 - - 86 . . DOI: 10.1109/IROS.2017.8202141 http://doi.org/10.1109/IROS.2017.8202141 . .
TH Yu , , , C Finn , , , AN Xie , , , 等 . . One-shot imitation from observing humans via domain-adaptive meta-learning . . 2018 . . https://arxiv.org/abs/1802.01557v1 https://arxiv.org/abs/1802.01557v1 , , . .
WH Yu , , , J Tan , , , CK Liu , , , 等 . . Preparing for the unknown: learning a universal policy with online system identification . . 2017 . . https://arxiv.org/abs/1702.02453 https://arxiv.org/abs/1702.02453 , , . .
M Zhang , , , S Vikram , , , L Smith , , , 等 . . SOLAR: deep structured representations for model-based reinforcement learning . . Proc 36 th Int Conf on Machine Learning , , 2019 . . p.7444 - - 7453 . . . .
BD Ziebart , , , A Maas , , , JA Bagnell , , , 等 . . Maximum entropy inverse reinforcement learning . . Proc 23 rd AAAI Conf on Artificial Intelligence , , 2008 . . p.1433 - - 1438 . . . .
L Zintgraf , , , K Shiarli , , , V Kurin , , , 等 . . Fast context adaptation via meta-learning . . Proc 36 th Int Conf on Machine Learning , , 2019 . . p.7693 - - 7702 . . . .
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621