We developed two online data-driven methods for estimating an objective function in continuous-time linear and nonlinear deterministic systems. The primary focus addressed the challenge posed by unknown input dynamics (control mapping function) in the expert system, a critical element for an online solution of the problem. Our methods leverage both the learner's and expert's data for effective problem-solving.
View Article and Find Full Text PDFDeep learning (DL) and reinforcement learning (RL) methods seem to be a part of indispensable factors to achieve human-level or super-human AI systems. On the other hand, both DL and RL have strong connections with our brain functions and with neuroscientific findings. In this review, we summarize talks and discussions in the "Deep Learning and Reinforcement Learning" session of the symposium, International Symposium on Artificial Intelligence and Brain Science.
View Article and Find Full Text PDFOur brain can be recognized as a network of largely hierarchically organized neural circuits that operate to control specific functions, but when acting in parallel, enable the performance of complex and simultaneous behaviors. Indeed, many of our daily actions require concurrent information processing in sensorimotor, associative, and limbic circuits that are dynamically and hierarchically modulated by sensory information and previous learning. This organization of information processing in biological organisms has served as a major inspiration for artificial intelligence and has helped to create in silico systems capable of matching or even outperforming humans in several specific tasks, including visual recognition and strategy-based games.
View Article and Find Full Text PDFNeural Netw
December 2021
This paper proposes model-free imitation learning named Entropy-Regularized Imitation Learning (ERIL) that minimizes the reverse Kullback-Leibler (KL) divergence. ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process. An inverse RL step computes the log-ratio between two distributions by evaluating two binary discriminators.
View Article and Find Full Text PDFModular Reinforcement Learning decomposes a monolithic task into several tasks with sub-goals and learns each one in parallel to solve the original problem. Such learning patterns can be traced in the brains of animals. Recent evidence in neuroscience shows that animals utilize separate systems for processing rewards and punishments, illuminating a different perspective for modularizing Reinforcement Learning tasks.
View Article and Find Full Text PDFFront Neurorobot
December 2019
A deep Q network (DQN) (Mnih et al., 2013) is an extension of Q learning, which is a typical deep reinforcement learning method. In DQN, a Q function expresses all action values under all states, and it is approximated using a convolutional neural network.
View Article and Find Full Text PDFFront Neurorobot
September 2018
This paper proposes Cooperative and competitive Reinforcement And Imitation Learning (CRAIL) for selecting an appropriate policy from a set of multiple heterogeneous modules and training all of them in parallel. Each learning module has its own network architecture and improves the policy based on an off-policy reinforcement learning algorithm and behavior cloning from samples collected by a behavior policy that is constructed by a combination of all the policies. Since the mixing weights are determined by the performance of the module, a better policy is automatically selected based on the learning progress.
View Article and Find Full Text PDFIn recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold.
View Article and Find Full Text PDFEM-based policy search methods estimate a lower bound of the expected return from the histories of episodes and iteratively update the policy parameters using the maximum of a lower bound of expected return, which makes gradient calculation and learning rate tuning unnecessary. Previous algorithms like Policy learning by Weighting Exploration with the Returns, Fitness Expectation Maximization, and EM-based Policy Hyperparameter Exploration implemented the mechanisms to discard useless low-return episodes either implicitly or using a fixed baseline determined by the experimenter. In this paper, we propose an adaptive baseline method to discard worse samples from the reward history and examine different baselines, including the mean, and multiples of SDs from the mean.
View Article and Find Full Text PDFFree-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state and action spaces. However, the FERL method does only really work well with binary, or close to binary, state input, where the number of active states is fewer than the number of non-active states. In the FERL method, the value function is approximated by the negative free energy of a restricted Boltzmann machine (RBM).
View Article and Find Full Text PDFLinearly solvable Markov Decision Process (LMDP) is a class of optimal control problem in which the Bellman's equation can be converted into a linear equation by an exponential transformation of the state value function (Todorov, 2009b). In an LMDP, the optimal value function and the corresponding control policy are obtained by solving an eigenvalue problem in a discrete state space or an eigenfunction problem in a continuous state using the knowledge of the system dynamics and the action, state, and terminal cost functions. In this study, we evaluate the effectiveness of the LMDP framework in real robot control, in which the dynamics of the body and the environment have to be learned from experience.
View Article and Find Full Text PDFFree-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state- and action spaces, which cannot be handled by standard function approximation methods. In this study, we propose a scaled version of free-energy based reinforcement learning to achieve more robust and more efficient learning performance. The action-value function is approximated by the negative free-energy of a restricted Boltzmann machine, divided by a constant scaling factor that is related to the size of the Boltzmann machine (the square root of the number of state nodes in this study).
View Article and Find Full Text PDFMost conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate gamma for the value functions close to 1, these algorithms do not permit gamma to be set exactly at gamma = 1.
View Article and Find Full Text PDFNeural Netw
December 2008
Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution.
View Article and Find Full Text PDF