Paper:

# Exploitation-Oriented Learning PS-r^{#}

## Kazuteru Miyazaki^{*} and Shigenobu Kobayashi^{**}

^{*}Department of Assessment and Research for degree Awarding, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira, Tokyo 187-8587, Japan

^{**}Graduate School of Interdisciplinary Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8502, Japan

^{*}, partially observed Markov decision process, Exploitation-oriented Learning XoL

Exploitation-oriented learning (XoL) is a novel approach to goal-directed learning from interaction. Reinforcement learning is much more focused on learning and ensures optimality in Markov decision process (MDP) environments, XoL involves learning a rational policy that obtains rewards continuously and very quickly. PS-r^{*}, a form of XoL, involves learning a useful rational policy not inferior to the random walk in the partially observed Markov decision process (POMDP) where reward types number one. PS-r^{*}, however, requires O(MN^{2}) memory where N is the number of sensory input types and M is an action. We propose PS-r^{#} for learning a useful rational policy in the POMDP using O(MN) memory. PS-r^{#} effectiveness is confirmed in numerical examples.

^{#},”

*J. Adv. Comput. Intell. Intell. Inform.*, Vol.13, No.6, pp. 624-630, 2009.

- [1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
- [2] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
- [3] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
- [4] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proc. of the 17th Int. Conf. on Machine Learning, pp. 663-670, 2000.
- [5] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proc. of the 22nd Int. Conf. on Machine Learning, pp. 1-8, 2005.
- [6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” Proc. of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.
- [7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.
- [8] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r
^{*}and its Evaluation,” J. of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese). - [9] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.
- [10] R. A. McCallum, “Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 387-395, 1995.
- [11] C. Boutilier and D. Poole, “Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations,” Proc. of the 13th National Conf. on Artificial Intelligence, pp. 1168-1175, 1996.
- [12] S. P. Singh, T. Jaakkola, and M. I. Jordan, “Learning Without State-Estimation in Partially Observable Markovian Decision Processes,” Proc. of the 11th Int. Conf. on Machine Learning, pp. 284-292, 1994.
- [13] R. J. Williams, “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,” Machine Learning, Vol.8, pp. 229-256, 1992.
- [14] T. Jaakkola, S. P. Singh, and M. I. Jordan, “Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems,” Advances in Neural Information Processing Systems, Vol.7, pp. 345-352, 1994.
- [15] H. Kimura, M. Yamamura, and S. Kobayashi, “Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 295-303, 1995.
- [16] H. Kimura, K. Miyazaki, and S. Kobayashi, “Reinforcement Learning in POMDPs with Function Approximation,” Proc. of the 14th Int. Conf. on Machine Learning, pp. 152-160, 1997.
- [17] L. Baird and D. Poole, “Gradient Descent for General Reinforcement Learning,” Advances in Neural Information Processing System, Vol.11, pp. 968-974, 1999.
- [18] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” Advances in Neural Information Processing Systems, Vol.12, pp. 1008-1014, 2000.
- [19] R. S. Sutton, D. McAllester, S. P. Singh, and Y. Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.
- [20] D. Aberdeen and J. Baxter, “Scalable Internal-State Policy-Gradient Methods for POMDPs,” Proc. of the 19th Int. Conf. on Machine Learning, pp. 3-10, 2002.
- [21] T. J. Perkins, “Reinforcement Learning for POMDPs based on Action Values and Stochastic Optimization,” Proc. of the 18th National Conf. on Artificial Intelligence, pp. 199-204, 2002.
- [22] K. Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 International License.