First return, then explore.

Nature

Uber AI Labs, San Francisco, CA, USA.

Published: February 2021

AI Article Synopsis

Article Abstract

Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly 'remembering' promising states and returning to such states before intentionally exploring. Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders-of-magnitude improvements on the grand challenges of Montezuma's Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore's exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration-an insight that may prove critical to the creation of truly intelligent learning agents.

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-03157-9DOI Listing

Publication Analysis

Top Keywords

reinforcement learning
8
simple principles
8
states returning
8
return explore
4
explore reinforcement
4
learning promises
4
promises solve
4
solve complex
4
complex sequential-decision
4
sequential-decision problems
4

Similar Publications

Our ability to maintain a consistent attentional state is essential to many aspects of daily life. Still, despite our best efforts, attention naturally fluctuates between more and less vigilant states. Previous work has shown that offering performance-based rewards or incentives can help to buffer against attentional lapses.

View Article and Find Full Text PDF

Despite increased attempts to express equality in speech, biases often leak out through subtle linguistic cues. For example, the subject-complement statement (SCS, "Girls are as good as boys at math") is used to advocate for equality but often reinforces gender stereotypes (boys are the standard against which girls are judged). We ask whether stereotypes conveyed by SCS can be counteracted by gesture.

View Article and Find Full Text PDF

Exploring the Impact of Declarative Learning on the Consolidation of Acquired Motor Skills Under Valence Feedback.

Hum Brain Mapp

February 2025

Neuroscience and Neuroengineering Research Laboratory, Biomedical Engineering Department, School of Electrical Engineering, Iran University of Science and Technology (IUST), Tehran, Iran.

Implicit motor learning involves the acquisition and consolidation of motor skills without conscious awareness, influenced by various factors. Punishment and reward have been identified as significant modulators during training, impacting skill acquisition differently. Additionally, the role of a second declarative task in offline consolidation has been explored, affecting both stabilization and enhancement processes during wake and sleep periods.

View Article and Find Full Text PDF

Reward Decision Network Disconnection in Poststroke Apathy: A Prospective Multimodality Imaging Study.

Hum Brain Mapp

February 2025

Department of Neurology, Centre for Leading Medicine and Advanced Technologies of IHM, the First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China.

Apathy is a common neuropsychiatric symptom following stroke, characterized by reduced goal-directed behavior. The reward decision network (RDN), which plays a crucial role in regulating goal-directed behaviors, is closely associated with apathy. However, the relationship between poststroke apathy (PSA) and RDN dysfunction remains unclear due to apathy heterogeneity, the confounding effect of depression and individual variability in lesion impacts.

View Article and Find Full Text PDF

Education Research: Use by Neurologists of Microteaching and Microassessment Programs for Teaching, Learning, and Patient Care Needs: A Qualitative Study.

Neurol Educ

December 2024

From the Warren Alpert Medical School of Brown University (K.A.S.), Providence, RI; Memorial Sloan Kettering Cancer Center (A.M.M.), New York, NY; Department of Neurology (J.J.M.), Yale School of Medicine, New Haven, CT; Wake Forest University School of Medicine (K.W., S.-E.G., R.E.S.), Winston-Salem, NC; American Academy of Neurology (X.S., L.S., R.R., M.M., T.D.), Minneapolis, MN; and University of Michigan School of Medicine (Z.L.), Ann Arbor, MI.

Background And Objectives: Microlearning is the acquisition of knowledge or skills in small units, commonly delivered by digital technology. NeuroBytes (NB) and Question of the Day (QOD) are 2 microinstructional programs in neurology. NB programs are brief, video-based mini-courses on clinical topics (microteaching); QODs are daily multiple-choice questions (microassessment).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!