The challenge of the exploration-exploitation dilemma persists in off-policy reinforcement learning (RL) algorithms, impeding the improvement of policy performance and sample efficiency. To tackle this challenge, a novel historical decision-making regularized maximum entropy (HDMRME) RL algorithm is developed to strike the balance between exploration and exploitation. Built upon the maximum entropy RL framework, the historical decision-making regularization method is proposed to enhance the exploitation capability of RL policies. The theoretical analysis involves proving the convergence of HDMRME, investigating the tradeoff between exploration and exploitation of HDMRME, examining the disparity between the Q-function learned through HDMRME and the classic one, and analyzing the suboptimality of the trained policy. The performance of HDMRME is evaluated across various continuous-action control tasks from Mujoco and OpenAI Gym platforms. Comparative experiments demonstrate that HDMRME exhibits superior sample efficiency and achieves more competitive performance compared with other state-of-the-art RL algorithms.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2024.3481887 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!