Introduction

In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.

Benchmarking

FLOPs2.41e+22

Notes: [Final calculation] (8 TPUs) * (1.23e14 FLOP/TPU-s) * (0.1 utilization) / (50k steps/s) = 1.968e9 FLOP/step (32 agents) * (383B steps/agent) * (1.968e9 FLOP/step) = 2.412e22 FLOPs ========================== NOTES BELOW 6.1: Each agent is trained using 8 TPUv3s and consumes approximately 50,000 agent steps (observations) per second. Multiple agents interacting probably mean a fairly low utilization rate, so let’s assume 0.10 8 * 1.23e14 * 0.1 / 50k = 1.968e9 FLOPs per step The paper doesn’t say exactly how many agents they train in each population. The original PBT paper uses 32 agents for one task (in general it uses between 10 and 80), so as a guesstimate let’s go with that. Figure 16: They train over 5 generations. Summing the number of steps, it looks like there were roughly 383B steps 32 * 383B * 1.968e9 = 2.412e22 Final estimate: 2.412e22 I do a confidence interval analysis here and find a 90% CI of 6.9e21 to 1.3e23, so we can call this estimate "likely" (within 1 OOM): https://colab.research.google.com/drive/1wGSTQxBExY6Fa0-d7msVumf5-KnsWLe6?usp=sharing

Authors

Open-Ended Learning Team*, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, Nat McAleese, Nathalie Bradley-Schmieg, Nathaniel Wong, Nicolas Porcel, Roberta Raileanu, Steph Hughes-Fitt, Valentin Dalibard and Wojciech Marian Czarnecki

Introduction

Benchmarking

FLOPs2.41e+22

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors