Paper, Tags: #reinforcement-learning
In chess and Go, tree-based planning methods have been used where a perfect simulator is available. But in real-world problems, the dynamics governing the environment are often complex and unknown.
MuZero: an algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.
Applying the model that MuZero learns iteratively, it can predict the quantities most dierctly relevant to planning: reward, action-selection policy, and the value function. We evaluated on 57 different Atari games, where historically model-based planning approaches have struggled.
Model-free RL: estimate the optimal policy and/or value function directly from interactions with the environment. But these don't have such good results in cases that require precise and sophisticated lookahead such as chess and Go.
In this paper, MuZero, a model-based RL method achieves state-of-the-art performance in Atari 2600. MuZero builds upon AlphaZero's powerful search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. It also extends AlphaZero to a broader set of environments, such as single agent domains and non-zero rewards at intermediate time-steps.
The main idea of the algorithm: predict those aspects of the future that are directly relevant for planning. The model receives the observation as input, transforms it into a hidden state, which is updated iteratively by a recurrent process. At every step, the model predicts the policy (move to play), value function (predicted winner) and reward (points scored by playing a move).
There's no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies.
This model is represented by a Markov Decision Process, consisting of 2 components:
- state transition model to predict next state
- reward model to predict the expected reward during that transition
Once constructed, it's straightforward to apply MDP planning algorithms like Monte Carlo tree search to compute optimal policy.
More specifics on the model on section 3 of paper.
MuZero doesn't need any knowledge of the game rules or environment dynamics.