The recent breakthrough successes in machine learning are mainly attributed to scale: namely large-scale attention-based architectures and datasets of unprecedented scale. This paper investigates the impact of training at scale for chess. Unlike traditional chess engines that rely on complex heuristics, explicit search, or a combination of both, we train a 270M parameter transformer model with supervised learning on a dataset of 10 million chess games. We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine, leading to roughly 15 billion data points. Our largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit search algorithms. We also show that our model outperforms AlphaZero's policy and value networks (without MCTS) and GPT-3.5-turbo-instruct. A systematic investigation of model and dataset size shows that strong chess performance only arises at sufficient scale. To validate our results, we perform an extensive series of ablations of design choices and hyperparameters.
Notes: 10356718320000000065536 FLOP "Board states 𝑠 are encoded as FEN strings which we convert to fixed-length strings of 77 characters where the ASCII-code of each character is one token." so 77 tokens for board + 1 token for action "For our largest training dataset, based on 10M games, this results in 15.32B action-value estimates" so input is 78 tokens for each action-value number of tokens = 1194960000000.0 The model is dense transformer " We train for 10 million steps, which corresponds to 2.67 epochs for a batch size of 4096 with 15.32B data points ", but in appendix A.2 there is mention of 5.35 epochs I have used higher value from 5.35 and 2.67, Probably final they trained model for 5.35 epochs and used checkpoint from 2.67 as final model. aproximation 6ND for 5.35 epochs = 6*270e6*1194960000000.0 * 5.35 = 10356718320000000065536
Size Notes: 15.32B examples * 78 tokens per example = 1.19e12 Training is supervised. I count each action-value (board state, action and numeric evaluation of state from Stockfish 16) as 1 data point."For our largest training dataset, based on 10M games, this results in 15.32B action-value estimates"
Notes: "Our largest model has roughly 270 million parameters."