Human-in-the-loop for reinforcement learning (RL) is usually employed to overcome the challenge of sample inefficiency, in which the human expert provides advice for the agent when necessary. The current human-in-the-loop RL (HRL) results mainly focus on discrete action space. In this article, we propose a Q value-dependent policy (QDP)-based HRL (QDP-HRL) algorithm for continuous action space. Considering the cognitive costs of human monitoring, the human expert only selectively gives advice in the early stage of agent learning, where the agent implements human-advised action instead. The QDP framework is adapted to the twin delayed deep deterministic policy gradient algorithm (TD3) in this article for the convenience of comparison with the state-of-the-art TD3. Specifically, the human expert in the QDP-HRL considers giving advice in the case that the difference between the twin Q -networks' output exceeds the maximum difference in the current queue. Moreover, to guide the update of the critic network, the advantage loss function is developed using expert experience and agent policy, which provides the learning direction for the QDP-HRL algorithm to some extent. To verify the effectiveness of QDP-HRL, the experiments are conducted on several continuous action space tasks in the OpenAI gym environment, and the results demonstrate that QDP-HRL greatly improves learning speed and performance.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2023.3289315 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!