{"id":17766518,"url":"https://github.com/pythonnut/alphazero-othello","last_synced_at":"2025-07-22T23:03:50.212Z","repository":{"id":82963611,"uuid":"320945687","full_name":"PythonNut/alphazero-othello","owner":"PythonNut","description":"An implementation of the AlphaZero algorithm for playing Othello (aka. Reversi)","archived":false,"fork":false,"pushed_at":"2020-12-21T06:16:29.000Z","size":10958,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-01T14:51:56.691Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TeX","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PythonNut.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-12T23:33:23.000Z","updated_at":"2025-01-16T17:18:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"399d7cbb-e38f-4de6-bc0a-ca40d60837ed","html_url":"https://github.com/PythonNut/alphazero-othello","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/PythonNut/alphazero-othello","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonNut%2Falphazero-othello","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonNut%2Falphazero-othello/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonNut%2Falphazero-othello/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonNut%2Falphazero-othello/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PythonNut","download_url":"https://codeload.github.com/PythonNut/alphazero-othello/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonNut%2Falphazero-othello/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266586905,"owners_count":23952205,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-26T20:30:49.945Z","updated_at":"2025-07-22T23:03:50.173Z","avatar_url":"https://github.com/PythonNut.png","language":"TeX","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AlphaZero-Othello\n\nAn implementation of the AlphaZero algorithm that plays Othello (aka. Reversi)\n\n\u003cimg width=500 src=\"https://raw.githubusercontent.com/PythonNut/alphazero-othello/main/figures/az_iagno_first_win.png\"/\u003e\nFigure 1: The final board after AlphaZero-Othello beat Iagno (Medium difficulty) for the first time!\n\n## What is Othello?\n\nOthello is an abstract strategy board game for two players. \nIt is played on an 8×8 board with 64 pieces, called disks, which have different colors on each side (ususally black and white).\nThe players alternate placing disks with their assigned color facing up.\nIf a contiguous line of a player's disks along one of the eight directions becomes surrounded on both ends by those of their opponent, all of the middle disks are flipped to the opponent's color.\nEvery move must produce at least one of these flips.\nA player must pass if and only if they have no valid moves and the game ends when both players pass.\nThe winner is the player with the most disks of their color on the final board. \n\nOn the strength of the best known computer programs relative to humans, [Wikipedia](https://en.wikipedia.org/wiki/Reversi) has this to say:\n\n\u003e Good Othello computer programs play very strongly against human opponents. This is mostly due to difficulties in human look-ahead peculiar to Othello: The interchangeability of the disks and therefore apparent strategic meaninglessness (as opposed to chess pieces for example) makes an evaluation of different moves much harder. \n\n## What is AlphaZero?\n\nAlphaZero is a computer program created by DeepMind that plays Chess, Shogi, and Go descended from AlphaGo program which famously became the first to achieve superhuman performance playing Go.\nNotably, AlphaZero learns to play these games tabula rasa⁠ (i.e. given only the rules of the game itself) and is able to achieve superhuman performance in all three.\nThe AlphaZero algorithm is easily adapted to other games.\n\n### Main ideas\nAlphaZero uses a neural network which takes in the state `s` of the board and outputs two things: `p(s)`, a distribution over set of actions, and `v(s)`, an prediction of which player will win the game. The network is trained to minimize the following loss\n\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\dpi{300}\u0026space;\\bg_white\u0026space;L(\\theta)\u0026space;=\u0026space;\\sum_t\u0026space;\\left(\u0026space;\\left(v_\\theta(s_t)\u0026space;-\u0026space;z_t\\right\u0026space;)^2\u0026space;-\u0026space;\\hat{\\pi}(s_t)^T\\log\\left(p_\\theta(s_t)\\right)\u0026space;\\right)\" target=\"_blank\"\u003e\u003cimg width=300 src=\"https://latex.codecogs.com/png.latex?\\dpi{300}\u0026space;\\bg_white\u0026space;L(\\theta)\u0026space;=\u0026space;\\sum_t\u0026space;\\left(\u0026space;\\left(v_\\theta(s_t)\u0026space;-\u0026space;z_t\\right\u0026space;)^2\u0026space;-\u0026space;\\hat{\\pi}(s_t)^T\\log\\left(p_\\theta(s_t)\\right)\u0026space;\\right)\" title=\"L(\\theta) = \\sum_t \\left( \\left(v_\\theta(s_t) - z_t\\right )^2 - \\hat{\\pi}(s_t)^T\\log\\left(p_\\theta(s_t)\\right) \\right)\" /\u003e\u003c/a\u003e\n\nwhere `zₜ` is the outcome of the game (from perspective of time `t`) and `̂π(s)` is an improved policy.\n`zₜ` is easy to calculate (we just look at who won in the end) but the computation of `̂π(s)` is more involved.\n\nIn order to calculate `̂π(s)` we use Monte Carlo Tree Search (MCTS).\nTo explain how this works, define\n\n* `Q(s, a)` is the average reward after taking action `a` from state `s`.\n* `N(s, a)` is the number of times action `a` was taken at state `s`.\n* `P(s, a)` is the probability of taking action `a` from state `s` (according to `p(s)`)\n\nThese quantities (which are just implemented using hash tables in practice) are used to calculate\n\n\u003ca href=\"https://www.codecogs.com/eqnedit.php?latex=\\dpi{300}\u0026space;\\bg_white\u0026space;U(s,\u0026space;a)\u0026space;=\u0026space;Q(s,\u0026space;a)\u0026space;\u0026plus;\u0026space;c_{\\mathrm{puct}}\u0026space;P(s,\u0026space;a)\\frac{\\sqrt{\\sum_b\u0026space;N(s,\u0026space;b)}}{1\u0026space;\u0026plus;\u0026space;N(s,\u0026space;a)}\" target=\"_blank\"\u003e\u003cimg width=300 src=\"https://latex.codecogs.com/png.latex?\\dpi{300}\u0026space;\\bg_white\u0026space;U(s,\u0026space;a)\u0026space;=\u0026space;Q(s,\u0026space;a)\u0026space;\u0026plus;\u0026space;c_{\\mathrm{puct}}\u0026space;P(s,\u0026space;a)\\frac{\\sqrt{\\sum_b\u0026space;N(s,\u0026space;b)}}{1\u0026space;\u0026plus;\u0026space;N(s,\u0026space;a)}\" title=\"U(s, a) = Q(s, a) + c_{\\mathrm{puct}} P(s, a)\\frac{\\sqrt{\\sum_b N(s, b)}}{1 + N(s, a)}\" /\u003e\u003c/a\u003e\n\nwhich is called the upper confidence bound of `Q(s, a)` (here, `c_puct` is a hyperparameter that controls how much exploration is done.\nDuring the search, the `a` that maximises `U(s, a)` is chosen.\nThis is done recursively until the game ends and then the neccesary updates to `Q` and `N` are done at each step back up the call chain.\nThe search is repeated from the root node many times.\nAs the number of simulations increases, `Q(s, a)` becomes more accurate and the `U(s, a)` also approach `Q(s, a)`.\nAfter all of the simulations are complete, we assign `N(s, a)/sum(N(s, b) for all b)` to `̂π(s, a)` and the `̂π(s)` and `zₜ` are used to train the network, producing an improved policy and value network for the next iteration. \n\n## What is AlphaZero-Othello?\n\nAlphaZero-Othello is an implementation of the AlphaZero algorithm that learns to play Othello.\nIt is written in pure Python, using the PyTorch library to accelerate numerical computations.\nThe goal was to write the simplest and most readable implementation possible.\n\n* 100% of the code is written by me\n* Multithreaded self-play\n* Multithreaded evaluation arena\n* Uses a single GPU on a single node (i.e. it is not distributed)\n* Self-play, evaluation, and training all happen synchronously (unlike in the original AlphaZero)\n\n### Network architecture\nThe policy and value network is build using residual blocks of the following form\n```python\nSequential(\n    SumModule(\n        Sequential(\n            Conv2d(n, k, 3, 1, 1),\n            BatchNorm2d(k),\n            ReLU(),\n            Conv2d(k, k, 3, 1, 1),\n            BatchNorm2d(k),\n            ReLU(),\n        ),\n        Conv2d(n, k, 1, 1, 0),\n    ),\n    ReLU(),\n)\n```\nwhere `n` of one block equals the `k` of the previous block and the second branch of the `SumModule` is replaced with the identity if `n == k`.\nFive blocks are used and the channels numbers are `[16, 32, 64, 128, 128, 128]`.\n\nThe output of the residual tower is then split.\nThe branch that computes `pi` is\n```python\nSequential(\n    Conv2d(128, 16, 1, 1, 0),\n    BatchNorm2d(16),\n    Flatten(),\n    Linear(16 * 8 * 8, 256),\n    BatchNorm1d(256),\n    Linear(256, 8 * 8 + 1),\n    LogSoftmax(dim=1),\n)\n```\nsimilarly, the branch that computes `v` is\n```python\nSequential(\n    Conv2d(128, 16, 1, 1, 0),\n    BatchNorm2d(16),\n    Flatten(),\n    Linear(16 * 8 * 8, 256),\n    BatchNorm1d(256),\n    Linear(256, 1),\n    Tanh(),\n)\n```\n### Parameters\n\nEvery round of self-play consists of 100 games each moved is based on 25 MCTS simulations. \n20 iterations of training data are preserved in the history buffer.\nThe `cpuct` parameter is set at `3` and at the root of the MCTS search, Dirchlet noise with `alpha = 0.9` is mixed with estimates of `pi` (25% noise).\n\n### Results\n\nI have played many games against the trained agent and I have never won.\nThe agent is able to beat the Iagno engine on \"Medium\" difficulty but cannot beat Iagno on \"Hard\" difficulty.\n\nAlthough the results are not spectacular, they are understandable.\nAlphaGo Zero was trained on 4.9 millions games of self play with 1600 simulations per MCTS while this agent was trained on about 40 thousand games with 25 simulations per MCTS.\n\n## Demo\n\nRun [`demo.ipynb`](https://colab.research.google.com/github/PythonNut/alphazero-othello/blob/master/demo.ipynb) on Google Colab!\n\n## References\n* [AlphaGo Zero paper](https://www.nature.com/articles/nature24270)\n* [AlphaZero paper](http://arxiv.org/abs/1712.01815)\n* [MuZero paper](http://arxiv.org/abs/1911.08265)\n* [jonathan-laurent/AlphaZero.jl](https://github.com/jonathan-laurent/AlphaZero.jl)\n* [This post](https://web.stanford.edu/~surag/posts/alphazero.html) by Surag Nair as well as [suragnair/alpha-zero-general](https://github.com/suragnair/alpha-zero-general)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythonnut%2Falphazero-othello","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythonnut%2Falphazero-othello","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythonnut%2Falphazero-othello/lists"}