Acta Metallurgica Sinica(English letters) ›› 2014, Vol. 21 ›› Issue (5): 94-104.doi: 10.1016/S1005-8885(14)60337-X

• Others • 上一篇    

Autonomic discovery of subgoals in hierarchical reinforcement learning

肖丁,石川   

  1. 北京邮电大学
  • 收稿日期:2013-11-11 修回日期:2014-06-23 出版日期:2014-10-31 发布日期:2014-10-30
  • 通讯作者: 肖丁 E-mail:dxiao@bupt.edu.cn

Autonomic discovery of subgoals in hierarchical reinforcement learning

  • Received:2013-11-11 Revised:2014-06-23 Online:2014-10-31 Published:2014-10-30
  • Contact: XIAO Ding E-mail:dxiao@bupt.edu.cn

摘要:  Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent’s actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.

关键词: hierarchical reinforcement learning, option, Q-learning, subgoal, UDV

Abstract:  Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent’s actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.

Key words: hierarchical reinforcement learning, option, Q-learning, subgoal, UDV

中图分类号: