The stochastic dynamics of reinforcement learning is studied using a master equation formalism. We consider two different problems-Q learning for a two-agent game and the multiarmed bandit problem with policy gradient as the learning method. The master equation is constructed by introducing a probability distribution over continuous policy parameters or over both continuous policy parameters and discrete state variables (a more advanced case). We use a version of the moment closure approximation to solve for the stochastic dynamics of the models. Our method gives accurate estimates for the mean and the (co)variance of policy variables. For the case of the two-agent game, we find that the variance terms are finite at steady state and derive a system of algebraic equations for computing them directly.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1103/PhysRevE.107.034112 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!