Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors.

Jingliang Duan Yang Guan Shengbo Eben Li Yangang Ren Qi Sun Bo Cheng

IEEE Trans Neural Netw Learn Syst

Published: November 2022

In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2021.3082568	DOI Listing

Publication Analysis

Top Keywords

distributional soft

-value overestimations

soft actor-critic

reinforcement learning

policy performance

continuous control

distribution function

state-action returns

return distribution

actor-critic off-policy

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!