We present a trajectory optimization approach to reinforcement learning in continuous state and action spaces, called probabilistic differential dynamic programming (PDDP). Our method represents systems dynamics using Gaussian processes (GPs), and performs local dynamic programming iteratively around a nominal trajectory in Gaussian belief spaces. Different from model-based policy search methods, PDDP does not require a policy parameterization and learns a time-varying control policy via successive forward-backward sweeps. A convergence analysis of the iterative scheme is given, showing that our algorithm converges to a stationary point globally under certain conditions. We show that prior model knowledge can be incorporated into the proposed framework to speed up learning, and a generalized optimization criterion based on the predicted cost distribution can be employed to enable risk-sensitive learning. We demonstrate the effectiveness and efficiency of the proposed algorithm using nontrivial tasks. Compared with a state-of-the-art GP-based policy search method, PDDP offers a superior combination of learning speed, data efficiency, and applicability.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2017.2764499DOI Listing

Publication Analysis

Top Keywords

reinforcement learning
8
trajectory optimization
8
dynamic programming
8
policy search
8
learning
5
efficient reinforcement
4
learning probabilistic
4
probabilistic trajectory
4
optimization trajectory
4
optimization approach
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!