Autonomic discovery of subgoals in hierarchical reinforcement learning

doi:10.1016/S1005-8885(14)60337-X

中国邮电高校学报(英文) ›› 2014, Vol. 21 ›› Issue (5): 94-104.doi: 10.1016/S1005-8885(14)60337-X

• Others • 上一篇

Autonomic discovery of subgoals in hierarchical reinforcement learning

肖丁,石川

北京邮电大学

收稿日期:2013-11-11 修回日期:2014-06-23 出版日期:2014-10-31 发布日期:2014-10-30
通讯作者: 肖丁 E-mail:dxiao@bupt.edu.cn

Autonomic discovery of subgoals in hierarchical reinforcement learning

Received:2013-11-11 Revised:2014-06-23 Online:2014-10-31 Published:2014-10-30
Contact: XIAO Ding E-mail:dxiao@bupt.edu.cn

摘要/Abstract

摘要： 　Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent’s actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.

关键词: hierarchical reinforcement learning, option, Q-learning, subgoal, UDV

Abstract: 　Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent’s actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.

Key words: hierarchical reinforcement learning, option, Q-learning, subgoal, UDV

中图分类号:

TP393

参考文献

1. Singh S P, Jaakkola T, Jordan M I. Reinforcement learning with soft state aggregation. Advance in Neural Information Processing Systems 7: Proceedings of the Neural Information Processing Systems Conference (NIPS’94), Nov 28-Dec 1, 1994, Denver, CO, USA. Cambridge, MA, USA: MIT Press, 1995: 361-368

2. Tsitsiklis J N, Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674-690

3. Dietterich T G. Hierarchical reinforcement learning with the max Q value function decomposition. Journal of Artificial Intelligence Research, 2000, 13: 227-303

4. Parr R. Hierarchical control and learning for Markov decision processes. PhD Thesis. Berkeley, CA, USA: University of California, Berkeley, 1998

5. Simsek Ö, Wolfe P A, Barto A G. Identifying useful subgoals in reinforcement learning by local graph partitioning. Proceedings of the 22nd International Conference on Machine Learning (ICML’05), Aug 7-10, 2005. Bonn, Germany. New York, NY, USA: ACM, 2005: 816-823

6. Sutton R S, Precup D, Singh S. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999, 112(1/2): 181-211

7. Digney B L. Learning hierarchical control structure for multiple tasks and changing environments. From Animals to Animats 5: Proceedings of the 5th International Conference on Simulation of Adaptive Behavior (SAB’98). Aug 17-21, 1998, Zurich, Switzerland. Cambridge, MA, USA: MIT Press, 1998: 321-330

8. Mcgovern A, Barto A G. Automatic discovery of subgoals in reinforcement learning using diverse density. Proceedings of the 18th International Conference on Machine Learning (ICML’01), Jun 28-Jul 1, Williamstown, MA, USA. San Francisco, CA, USA: Morgan Kaufmann, 2001: 361-368

9. Stolle M, Precup D. Learning options in reinforcement learning. Proceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation (SARA’02), Aug 2-4, Kananaskis, Canada. Berlin, Germany: Springer, 2002: 212-223

10. Asadi M, Huber M. Autonomous subgoal discovery and hierarchical abstraction for reinforcement learning using Monte Carlo method. Proceedings of the 20th National Conference on Artificial Intelligence and the 17th Innovative Applications of Artificial Intelligence Conference (AAAI’05), Jul 9-13, 2005, Pittsburgh, PA, USA. Cambridge, MA, USA: MIT Press, 2005: 1588-1589

11. Goel S, Huber M. Subgoal discovery for hierarchical reinforcement learning using learnt policies. Proceedings of the 16th International Florida Artificial Intelligence Research Society Conference (FLAIRS’03), May 12-14, 2003, St Augustine, FL, USA. 2003: 346-350

12. Mannor S, Menache I, Hoze I, et al. Dynamic abstraction in reinforcement learning via clustering. Proceedings of the 21st International Conference on Machine Learning (ICML’04), Jul 4-8, 2004, Banff, Canada. San Francisco, CA, USA: Morgan Kaufmann, 2004: 560-567

13. Menache I, Mannor S, Shimkin N. Q-cut-dynamic discovery of subgoals in reinforcement learning. Proceedings of the 13th European Conference on Machine Learning (ECML’02), Aug 19-23, 2002, Helsinki, Finland. Berlin, Germany: Springer, 2002: 295-306

14. Jing S, Gu G C, Liu H B. Automatic option generation in hierarchical reinforcement learning via immune clustering. Proceedings of the 1st International Symposium on Systems and Control in Aerospace and Astronautics(SSCAA’06), Jan 19-21, 2006, Harbin, China. Piscataway, NJ, USA: IEEE, 2006: 4p

15. Simsek Ö, Barto A G. Skill characterization based on betweenness. Advances in Neural Information Processing Systems 21: Proceedings of the 22 Annual Conference on Neural Information Processing Systems (NIPS’09), Dec 8-11, 2008, Vancouver, Canada. Cambridge, MA, USA: MIT Press, 2009: 1497-1504

16. Entezari N, Shiri M E, Moradi P. Subgoal discovery in reinforcement learning using local graph clustering. International Journal of Future Generation Communication and Networking, 2011,4(3): 13-23

17. He R J, Brunskill E, Roy N. PUMA: Planning under uncertainty with macro-actions. Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI’10), Jul 11-15, 2010, Atlanta, GA, USA. Cambridge, MA, USA: MIT Press, 2010: 1089-1096

18. Konidaris G, Barto A. Efficient skill learning using abstraction selection. Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI’09}, Jul 11-17, 2009, Pasadena, CA, USA. 2009: 1107-1113

19. Wang B N, Gao Y, Chen Z Q, et al. K-cluster subgoal discovery algorithm for option. Journal of Computer Research and Development, 2006, 42(5): 851-855 ( in Chinese)

20. Sutton R S, Barto A G. Reinforcement learning: An introduction. Cambridge, MA, USA: MIT Press, 1998

21. Precup D. Temporal abstraction in reinforcement learning. Ph. D Thesis. Amherst, MA, USA: University of Massachusetts, 2000

[1]	肖雅郑世慧孙斌. Trusted GPSR protocol without reputation faking in VANET[J]. Acta Metallurgica Sinica(English letters), 2015, 22(5): 22-31.
[2]	石文孝吴丹许银龙王继红. Routing metric of interference-aware link quality: an improved ETX in wireless mesh networks[J]. Acta Metallurgica Sinica(English letters), 2014, 21(5): 61-67.
[3]	童俊杰鄂海红宋美娜宋俊德. Host load prediction in cloud based on classification methods[J]. Acta Metallurgica Sinica(English letters), 2014, 21(4): 40-46.
[4]	徐嬴颖陈常嘉赵永祥陈一帅. Cloud download system optimizing by job and notification scheduling[J]. Acta Metallurgica Sinica(English letters), 2014, 21(4): 47-53.
[5]	王若愚刘珍张凌. Method of data cleaning for network traffic classification[J]. Acta Metallurgica Sinica(English letters), 2014, 21(3): 35-45.
[6]	林文辉雷振明刘军杨洁刘芳何刚,WANG Qin. MapReduce optimization algorithm based on machine learning in heterogeneous cloud environment[J]. Acta Metallurgica Sinica(English letters), 2013, 20(6): 77-87.
[7]	李文霁郑康锋张冬梅 YE-Qing YANG Yi-xian. Efficient identity-based signature scheme with batch authentication for delay tolerant mobile sensor network [J]. Acta Metallurgica Sinica(English letters), 2013, 20(4): 80-86.
[8]	付雄 ZHU Xin-xin, HAN Jing-yu, WANG Ru-chuan. QoS-aware replica placement for data intensive applications[J]. Acta Metallurgica Sinica(English letters), 2013, 20(3): 43-47.
[9]	姚玉坤王冠任智李鹏翔陈永超. Efficient distributed address assignment algorithm based on topology maintenance in ZigBee networks[J]. Acta Metallurgica Sinica(English letters), 2013, 20(3): 53-59.
[10]	张招亮李栋黄庭培崔莉. Leveraging data fusion to improve barrier coverage in wireless sensor networks[J]. Acta Metallurgica Sinica(English letters), 2013, 20(1): 26-36.
[11]	沙超王汝传. Energy-efficient node deployment strategy forwireless sensor networks[J]. Acta Metallurgica Sinica(English letters), 2013, 20(1): 54-57.
[12]	姚玉坤温亚迪任智刘智虎. High efficient multipacket decoding approach for network coding in wireless networks[J]. Acta Metallurgica Sinica(English letters), 2013, 20(1): 95-100.
[13]	詹振球朱杰孙宜进. Further results on local stability of LRC-RED algorithm in Internet[J]. Acta Metallurgica Sinica(English letters), 2012, 19(5): 99-103.
[14]	詹振球朱杰徐迪. Stability analysis in an AVQ model of Internet congestion control algorithm[J]. Acta Metallurgica Sinica(English letters), 2012, 19(4): 22-28.
[15]	吴吉祥夏靖波. Enhanced calculation of necessary QoS for user satisfaction with a QoS mapping matrix[J]. Acta Metallurgica Sinica(English letters), 2012, 19(4): 29-33.

Autonomic discovery of subgoals in hierarchical reinforcement learning

Autonomic discovery of subgoals in hierarchical reinforcement learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价