{"id":15293046,"url":"https://github.com/patrickfrank1/chesspos","last_synced_at":"2025-04-13T12:23:44.477Z","repository":{"id":62561996,"uuid":"247558466","full_name":"patrickfrank1/chesspos","owner":"patrickfrank1","description":"Embedding based chess position search and embedding learning for chess positions","archived":false,"fork":false,"pushed_at":"2021-12-26T18:48:57.000Z","size":5875,"stargazers_count":13,"open_issues_count":12,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-05T17:57:36.904Z","etag":null,"topics":["chess-database","embeddings","faiss","metric-learning","similarity-search","tensorflow","triplet-loss"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/patrickfrank1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-15T21:56:34.000Z","updated_at":"2024-12-10T08:09:54.000Z","dependencies_parsed_at":"2022-11-03T15:17:13.823Z","dependency_job_id":null,"html_url":"https://github.com/patrickfrank1/chesspos","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patrickfrank1%2Fchesspos","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patrickfrank1%2Fchesspos/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patrickfrank1%2Fchesspos/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patrickfrank1%2Fchesspos/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/patrickfrank1","download_url":"https://codeload.github.com/patrickfrank1/chesspos/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240079054,"owners_count":19744716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chess-database","embeddings","faiss","metric-learning","similarity-search","tensorflow","triplet-loss"],"created_at":"2024-09-30T16:38:44.939Z","updated_at":"2025-02-23T10:32:15.771Z","avatar_url":"https://github.com/patrickfrank1.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Warning: The codebase is undergoing a major refactoring\n\nI am currently intgrating [DVC](https://dvc.org/) into the project. This requires\na complete refactoring of the code base. Also the embedding search will be moved\ninto its own repository, since it is strictly speaking a downstream application. \n\nExpect:\n- unstable master branch\n- outdated readme\n- broken code\n\nPlease check back in a bit, to see the new much better version of this project!\n\n# chesspos: embedding learning for chess positions\nEmbedding based chess position search and embedding learning for chess positions\n\nThis repository allows you to search a chess position against billions of chess positions in millions of games and retrieve similar positions. You can also build your own database with the provided tools. Additionally The projects experiments with embeddings learned from bitboards using the triplet neural network architecture. Feel free to try out your own embedding models to improve the embedding based search retrieval.\n\n## Guide\n\n1. Install the package\n2. Demo: Search your positions in a provided database\n3. Extract positions from your own database for search and metric learning\n4. Train and evaluate chess position embeddings\n5. Experiment and Contribute\n6. Cite this project\n\n## 1. Install the package\n\nMake sure you have python3 installed. You will also need the following packages:\n- [h5py](https://github.com/h5py/h5py) to read and write large chunks of data\n- [python-chess](https://github.com/niklasf/python-chess) for parsing chess games\n- [faiss](https://github.com/facebookresearch/faiss) for billion scale nearest neighbor search\n\nand numpy.\n\nAdditionally for the metric learning part of this project you will need [tensorflow (v2)](https://www.tensorflow.org/).\n\nAll packages except for faiss can be pip installed. To install faiss either use anaconda, e.g.\n\n```conda install faiss-cpu -c pytorch```\n\nor follow alternative instructions like [here](https://gist.github.com/korakot/d0a49d7280bd3fb856ae6517bfe8da7a) or [here](https://stackoverflow.com/questions/47967252/installing-faiss-on-google-colaboratory).\n\nFinally pip install this package from source.\n```\ngit clone https://github.com/patrickfrank1/chesspos.git\ncd chesspos\npython -m pip install .\n# test if installation was successful, the following should run without error\npython -c \"import chesspos\"\n```\nCongratulations you have successfully installed the package. It contains the following modules:\n- `chesspos.binary_index`: functions for loading and searching of bitboards in faiss,\n- `chesspos.convert`: convert between different chess position encodings like fen, bitboards and chess.Board(),\n- `chesspos.embedding_index`: functions for loading and searching embeddings in faiss,\n- `chesspos.models`: tensorflow models for embedding learning,\n- `chesspos.monitoring`: function to monitor metric learning progress, in particular callback to track triplet classification accuracy,\n- `chesspos.pgnextract`: functions to extract and save bitboards from pgn files,\n- `chesspos.preprocessing`: prepare triplet generators for metric learning,\n- `chesspos.utils`: general purpose functions.\n\nFurthermore this repository contains folders for tests, demos, command line tools and data files.\n\n## 2. Demo: Search your positions in a provided database\n\nNow that you installed the package you can ckeck out the demo notebook at [./demo/query_bitboard_db.ipynb](./demo/query_bitboard_db.ipynb) or see some [examples live in your browser](https://mybinder.org/v2/gh/patrickfrank1/chesspos/6e9ec55ccab91ed9bd23c9f1a80e3b981c466c2a?filepath=demo%2Fquery_bitboard_example.ipynb).\n\n![animation of demo notebook](./demo/gif/animation.gif)\n\nThe demo enables you to search a small database of bitbaords for similar positions. I provide some more precompiled databases. The following databases contain high quality games that are generated from [freely available lichess games](https://database.lichess.org/), where we only extracted games with both players above elo 2000 and a time control greater or equal 60+1 seconds.\n\n|          file/link              | positions [million] | download size | RAM needed |\n|:-------------------------------:|:-------------------:|:-------------:|:----------:|\n| [index_2013.faiss.bz2][1]       |                 1.7 |         12 MB |     171 MB |\n| [index_2014.faiss.bz2][2]       |                11.5 |         80 MB |     1.2 GB |\n| [index_2015.faiss.bz2][3]       |                  47 |        324 MB |     4.6 GB |\n| [index_2020_01_02.faiss.bz2][5] |                 510 |        3.6 GB |      50 GB |\n\n[1]:https://chess-position-files.s3.amazonaws.com/index/index_2013.faiss.bz2\n[2]:https://chess-position-files.s3.amazonaws.com/index/index_2014.faiss.bz2\n[3]:https://chess-position-files.s3.amazonaws.com/index/index_2015.faiss.tar.bz2\n[5]:https://chess-position-files.s3.amazonaws.com/index/index_2020_01_02.faiss.bz2\n\nHowever, as you can find out by playing with the notebook the similarity search with bitboards is not optimal, this is why we explore metric learning later on.\n\n## 3. Extract positions from your own database for search and metric learning\n\nThe `tools` folder provides useful command line scripts to preprocess pgn files that contain chess positions. If you are stuck display the help with `python3 path/to/tool.py -h`.\n\n#### 3.1 Extract positions\n\nTo extract bitboards from all positions of all games in a pgn file open a terminal in the tools foder and run:\n```bash\npython3 pgn_extract.py ../data/raw/test.pgn --save_position ../data/bitboards/test-bb1.h5\n```\n\nThis command takes as input the path to you pgn file and wirtes the bitboards to an h5 file at the path specified via `--save_position`. Note: you can drop the .pgn and .h5 file endings and the program will still parse the right files. To ease the file writing process and occupy less ram you can use the `--chunksize` flag, so that your data will be written in chunks, e.g `--chunksize 10000`.\n\nWe can also utilize this script to extract tuples of positions for metric learning, to do so run:\n```bash\npython3 pgn_extract.py ../data/raw/test --save_position ../data/bitboards/test-bb1 --tuples True --save_tuples ../data/train_small/test2-tuples-strong\n```\nThis will extract tuples from each game by virtue of the method `tuple_generator` in `chesspos.pgnextract`. Each generated tuple has the shape (15, 773) and contains a randomly sampled position of each game i *game[i][j]* and randomly sampled positions from the next game as\n```\ntuple = (game[i][j], game[i][j+1], game[i][j+2], game[i][j+3], game[i][j+4], game[i][(j+14) mod len(game[i])], game[i+1][rand1], ..., game[i+1][rand9])\n```\n\nFurthermore the command line script implements two simple filters to subsample  big pgn files. For example\n```bash\npython3 pgn_extract.py ../data/raw/test --save_position ../data/bitboards/test-bb2 --chunksize 10000 --tuples True --save_tuples ../data/train_small/test2-tuples-strong --filter elo_min=2400 --filter time_min=61\n```\nselects only games in which both players have an elo greater or equal to 2400 and where the time control is greater or equal to 61. The time control is calculated as *seconds + seconds per move*, which means a bullet game (60s+0s) is discarded whereas a bullet game with increment (60s+1s) is kept.\n\n#### 3.2 Build a faiss database from bitboards for search\n\nUse the following script to create a binary index from positions encoded as bitboards.\n```bash\npython3 index_from_bitboards.py ../data/bitboards/testdir --table_key position_ --save_path ../data/test_index2\n```\nThis command will take all h5 files from the `../data/bitboards/testdir` directory and extract bitboards from all datasets in all h5 files which contain `position_` in their dataset name. The recommended (and also default) value is *position* since bitboards created with the tool in section 3.1 use this name for bitboard datasets. The finished index is saved to `../data/test_index2` and can be used as in the [demo notebook](./demo/query_bitboard_db.ipynb).\n\n## 4. Train and evaluate chess position embeddings\n\n#### 4.1 Embedding model\n\nFirst I tried a simple triplet network architecture to learn a position embedding. This however quickly turned out to be a too simple approach. Instead, I propose a triplet autoencoder architecture, as presented in Figure 1, to learn chess position embeddings.\n\nWhile desingning the network architecture I also  took inspiration from [Courty, Flamary, Ducoffe: Learning Wasserstein Embeddings](https://github.com/mducoffe/Learning-Wasserstein-Embeddings) and [CrimyTheBold/tripletloss](https://github.com/CrimyTheBold/tripletloss).\n\n![triplet autoencoder](./demo/img/triplet_network.png)\n\nThe idea behind this architecture is inspired by word embeddings like word2vec in that subsequent positions in embedding space are similar and should therefore have similar embeddings. This is what the triplet network learns. However this implicit classification discards a lot of information that is encoded in the chess position and therefore I introduced the autoencoder to ensure tha the position's information is encoded in the embedding and to act as a regularizer.\n\nI provide two models, which are trained on more than 50 million triplets:\n- with shallow encoder/decoder networks to 128 dimensions [here](https://chess-position-files.s3.amazonaws.com/model/shallow128.tar.bz2)\n- with deep encoder/decoder networks to 64 dimensions [here](https://chess-position-files.s3.amazonaws.com/model/deep64.tar.bz2)\n\nas well as some [training triplets (11G)](https://chess-position-files.s3.amazonaws.com/tuples/train.tar.bz2) and [validation triplets (1.4G)](https://chess-position-files.s3.amazonaws.com/tuples/validation.tar.bz2). You can genearte your own training data with the script in section **3.1**.\n\nFor inference use the `model_inference.py` command line script from `tools`. This script takes a directory with bitboards stored in h5 files and appends the infereed embeddings to those h5 files. These files are then used to create an index as discussed in section **4.3**.\n\n\n#### 4.2 Train your own embeddings\n\nYou can train your own embeddings using a similar architecture (but different encoder/decoder networks using the `train_model.py` command line script in `tools` \n```bash\npython3 train_model.py path/to/config.json\n```\nwhere the config file has fields (and default values)\n```json\n{\n\t\"train_dir\": \"path/to/directory/with/train/samples\",\n\t\"validation_dir\": \"path/to/directory/with/validation/samples\",\n\t\"save_dir\": \"path/to/save/directory\",\n\t\"input_size\": 773,\n\t\"embedding_size\": 32,\n\t\"alpha\": 0.2,\n\t\"triplet_weight_ratio\": 10.0,\n\t\"hidden_layers\": [],\n\t\"train_batch_size\": 16,\n\t\"validation_batch_size\": 16,\n\t\"train_steps_per_epoch\": 1000,\n\t\"validation_steps_per_epoch\": 100,\n\t\"train_sampling\": [\"easy\",\"semihard\",\"hard\"],\n\t\"validation_sampling\": [\"easy\",\"semihard\",\"hard\"],\n\t\"tf_callbacks\": [\"early_stopping\",\"triplet_accuracy\", \"checkpoints\"],\n\t\"save_stats\": true,\n\t\"hide_tf_warnings\": true\n}\n```\n`àlpha` is the seaparation margin between positive and negative samples in the triplet loss, `triplet_weigth_ration` is a hyperparameter that combines triplet loss and autoencoder loss by weighting the triplet loss e.g. 10 times higher and `training_sampling` selects the way in which triplets are sampled from the provided tuples. To get a better understanding of what is going on cosider looking at `tools/train_model.py`, `chesspos/models.py`, `chesspos/preprocessing.py` and `chesspos/monitoring.py`.\n\n#### 4.3 Build a faiss database from embeddings\n\nTake your trained model (or alternatively one of the two provided above) and generate embeddings for a database of bitboards (stored in h5 files). I provide the `model_inference.py` script for embedding generation in the `tools` folder to be used like that:\n```bash\npython3 model_inference.py path/to/model path/to/bitboard/files\n```\noptionally you can save the inferred embeddings as float16 values (saves 50% memory on disk) with the `--float16 True` flag, you can specify a batch size with e.g. `--batch_size 4096`, specify the table prefix for the bitboard tables with `--table_prefix` (the default is *position*) and specify a prefix for the embedding tables `--embedding_table_prefix` (the default is *test_embedding*). **Warning: the h5 files with bitboards are updated in-place, an embedding table is added for each bitboard table.**\n\nThen you use the `index_from_embedding.py` script from the `tools` folder on those generated embeddings. This will create an index file that is drastically smaller than then stored embeddings such that the index can be searched in RAM!\n```bash\npython3 index_from_embedding.py PCA16,SQ4 path/to/embeddings\n```\nwhere `PCA16,SQ4` can be any valid [faiss index factory](https://github.com/facebookresearch/faiss/wiki/The-index-factory) string and the second argument is the path to the previously written embedding files. For additional info open he manual with `python3 index_from_embedding.py -h`.\n\nThe index compresses the embeddings, in case of `PCA32,SQ4` to 16 bytes per position whhich is much smaller than 92 bytes per embedding for a bitboard. In that way the database can comprise many more positions. However, the nearest positions cannot be retrieved from the index itself but have to be retrieved from file instead. If you provide the files that were used to create the index, then you will be able to retrieve the bitboards of nearest positions as well as the embeddings themselfes. For a demonstration see [./demo/inspect_embeddings.ipynb](./demo/inspect_embeddings.ipynb).\n\nSince the embedding files require a large amount of disk space you can also delete the embeddding tables or provide the original h5 files that only store the bitboards. These files will be much smaller on disk, but you will unly be able to restore the bitboards and not the embeddings. For a demo check out [./demo/query_bitboard_db.ipynb](./demo/query_bitboard_db.ipynb) again.\n\nI also provide links to some precompiled indices below with bitboards and embeddings (use with demo scripts).\n\n| Embedding model  | Positions    | Indices         | Bitboards                                                      |\n|------------------|--------------|-----------------|----------------------------------------------------------------|\n| [shallow 128][6] | 1.7 million  | [2013_s128][7]  | [2013_bitboards][8]                                            |\n| [deep 64][9]     | 1.7 million  | [2013_d64][10]  | [2013_bitboards][8]                                            |\n| [shallow 128][6] | 11.5 million | [2014_s128][11] | [2014_bitboards][12]                                           |\n| [deep 64][9]     | 11.5 million | [2014_d64][13]  | [2014_bitboards][12]                                           |\n| [shallow 128][6] | 907 million  | [all_s128][14]  | [2013_bitboards][8]+[2014_bitboards][12]+[other_bitboards][15] |\n| [deep 64][9]     | 907 million  | [all_d64][16]   | [2013_bitboards][8]+[2014_bitboards][12]+[other_bitboards][15] |\n\n[6]:https://chess-position-files.s3.amazonaws.com/model/shallow128.tar.bz2\n[7]:https://chess-position-files.s3.amazonaws.com/index/2013_s128.tar\n[8]:https://chess-position-files.s3.amazonaws.com/bitboards/2013_bitboards.tar.bz2\n[9]:https://chess-position-files.s3.amazonaws.com/model/deep64.tar.bz2\n[10]:https://chess-position-files.s3.amazonaws.com/index/2013_d64.tar\n[11]:https://chess-position-files.s3.amazonaws.com/index/2014_s128.tar\n[12]:https://chess-position-files.s3.amazonaws.com/bitboards/2014_bitboards.tar.bz2\n[13]:https://chess-position-files.s3.amazonaws.com/index/2013_d64.tar\n[14]:https://chess-position-files.s3.amazonaws.com/index/all_s128.tar\n[15]:https://chess-position-files.s3.amazonaws.com/bitboards/other_bitboards.tar.bz2\n[16]:https://chess-position-files.s3.amazonaws.com/index/all_d64.tar\n\n## 5. Experiment and Contribute\n\nIf you like this project and want to extend it then there are two main challenges to solve as outlined in the chapters above. You can focus on embedding learning or on embedding compression.\n\nA few ideas to improve embedding learning:\n- sample triplets/tuples in a different way (e.g. from openings / endgames to improve search for that particular part of the game)\n- tune the triplet-autoencoder hyperparameters, encoder structure, ...\n- come up with a better neural network architecture for metric learning\n\nA few ideas for improving embedding compression:\n- test different faiss indices, find best compression/accuracy tradeoff\n- try inverted file indices\n\nOther things:\n- retrieve games that belong to the retrieved positions (information is all there)\n- calculate triplet accuracy as tf metric insted of tf callback\n- expose position search as api\n\n## 6. Cite this project\n\nIf you use this project in your work, please consider citing it.\n```\n@misc{frank2020chesspos,\ntitle={chesspos: embedding learning for chess positions},\nauthor={Frank, Patrick},\nurl={https://github.com/patrickfrank1/chesspos},\nyear={2020},\nmonth={04}\n}\n```\n #### License\n \nThis project is licensed under the terms of the GNU GPLv3.0 license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpatrickfrank1%2Fchesspos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpatrickfrank1%2Fchesspos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpatrickfrank1%2Fchesspos/lists"}