First return, then explore.

Adrien Ecoffet Joost Huizinga Joel Lehman Kenneth O Stanley Jeff Clune

Nature

Uber AI Labs, San Francisco, CA, USA.

Published: February 2021

Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly 'remembering' promising states and returning to such states before intentionally exploring. Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders-of-magnitude improvements on the grand challenges of Montezuma's Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore's exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration-an insight that may prove critical to the creation of truly intelligent learning agents.

Download full-text PDF	Source
http://dx.doi.org/10.1038/s41586-020-03157-9	DOI Listing

Publication Analysis

Top Keywords

reinforcement learning

simple principles

states returning

return explore

explore reinforcement

learning promises

promises solve

solve complex

complex sequential-decision

sequential-decision problems

Similar Publications

Rewards transiently and automatically enhance sustained attention.

J Exp Psychol Gen

January 2025

Department of Psychology, Yale University.

Juliana E Trach Megan T deBettencourt Angela Radulescu Samuel D McDougle

Our ability to maintain a consistent attentional state is essential to many aspects of daily life. Still, despite our best efforts, attention naturally fluctuates between more and less vigilant states. Previous work has shown that offering performance-based rewards or incentives can help to buffer against attentional lapses.

View Article and Find Full Text PDF

Similar Publications

Gesture counteracts gender stereotypes conveyed through subtle linguistic cues.

Proc Natl Acad Sci U S A

January 2025

Department of Psychology, University of Chicago, Chicago, IL 60637.

Yihan Qian Susan Goldin-Meadow Lin Bian

Despite increased attempts to express equality in speech, biases often leak out through subtle linguistic cues. For example, the subject-complement statement (SCS, "Girls are as good as boys at math") is used to advocate for equality but often reinforces gender stereotypes (boys are the standard against which girls are judged). We ask whether stereotypes conveyed by SCS can be counteracted by gesture.

View Article and Find Full Text PDF

Similar Publications

Exploring the Impact of Declarative Learning on the Consolidation of Acquired Motor Skills Under Valence Feedback.

Hum Brain Mapp

February 2025

Neuroscience and Neuroengineering Research Laboratory, Biomedical Engineering Department, School of Electrical Engineering, Iran University of Science and Technology (IUST), Tehran, Iran.

Ashkan Farrokhi Mina Habibi Mohammad Reza Daliri

Implicit motor learning involves the acquisition and consolidation of motor skills without conscious awareness, influenced by various factors. Punishment and reward have been identified as significant modulators during training, impacting skill acquisition differently. Additionally, the role of a second declarative task in offline consolidation has been explored, affecting both stabilization and enhancement processes during wake and sleep periods.

View Article and Find Full Text PDF

Similar Publications

Reward Decision Network Disconnection in Poststroke Apathy: A Prospective Multimodality Imaging Study.

Hum Brain Mapp

February 2025

Department of Neurology, Centre for Leading Medicine and Advanced Technologies of IHM, the First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China.

Yirong Fang Xian Chao Jinjing Wang Zeyu Lu Dawei Yin

Apathy is a common neuropsychiatric symptom following stroke, characterized by reduced goal-directed behavior. The reward decision network (RDN), which plays a crucial role in regulating goal-directed behaviors, is closely associated with apathy. However, the relationship between poststroke apathy (PSA) and RDN dysfunction remains unclear due to apathy heterogeneity, the confounding effect of depression and individual variability in lesion impacts.

View Article and Find Full Text PDF

Similar Publications

Education Research: Use by Neurologists of Microteaching and Microassessment Programs for Teaching, Learning, and Patient Care Needs: A Qualitative Study.

Neurol Educ

December 2024

From the Warren Alpert Medical School of Brown University (K.A.S.), Providence, RI; Memorial Sloan Kettering Cancer Center (A.M.M.), New York, NY; Department of Neurology (J.J.M.), Yale School of Medicine, New Haven, CT; Wake Forest University School of Medicine (K.W., S.-E.G., R.E.S.), Winston-Salem, NC; American Academy of Neurology (X.S., L.S., R.R., M.M., T.D.), Minneapolis, MN; and University of Michigan School of Medicine (Z.L.), Ann Arbor, MI.

Kara A Stavros Alexandra Michelle Miller Jeremy J Moeller Kimberly Wiseman Sydney-Evelyn Gibbs

Background And Objectives: Microlearning is the acquisition of knowledge or skills in small units, commonly delivered by digital technology. NeuroBytes (NB) and Question of the Day (QOD) are 2 microinstructional programs in neurology. NB programs are brief, video-based mini-courses on clinical topics (microteaching); QODs are daily multiple-choice questions (microassessment).

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!