{"id":19168391,"url":"https://github.com/bloomberg/koan","last_synced_at":"2025-04-13T08:26:20.943Z","repository":{"id":66125182,"uuid":"320400604","full_name":"bloomberg/koan","owner":"bloomberg","description":"A word2vec negative sampling implementation with correct CBOW update.","archived":false,"fork":false,"pushed_at":"2021-11-08T16:37:42.000Z","size":387,"stargazers_count":260,"open_issues_count":0,"forks_count":18,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-03-27T00:11:25.394Z","etag":null,"topics":["cbow","cpp","skipgram","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bloomberg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-10T22:02:35.000Z","updated_at":"2025-03-07T01:29:49.000Z","dependencies_parsed_at":"2023-04-01T14:19:08.807Z","dependency_job_id":null,"html_url":"https://github.com/bloomberg/koan","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2Fkoan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2Fkoan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2Fkoan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2Fkoan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bloomberg","download_url":"https://codeload.github.com/bloomberg/koan/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248682718,"owners_count":21144820,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cbow","cpp","skipgram","word-embeddings","word2vec"],"created_at":"2024-11-09T09:42:30.439Z","updated_at":"2025-04-13T08:26:20.924Z","avatar_url":"https://github.com/bloomberg.png","language":"C++","readme":"\u003cp align=\"center\"\u003e\u003cimg src=\"koan.png\" width=\"200\"\u003e\u003c/p\u003e\n\n\u003e ... the Zen attitude is that words and truth are incompatible, or at least that no words can capture truth.\n\u003e \n\u003e Douglas R. Hofstadter\n\nA word2vec negative sampling implementation with correct CBOW update. kōan only depends on Eigen.\n\n_Authors_: Ozan İrsoy, Adrian Benton, Karl Stratos\n\nThanks to Cyril Khazan for helping kōan better scale to many threads.\n\n## Menu\n\n- [Rationale](#rationale)\n- [Building](#building)\n- [Quick start](#quick-start)\n- [Installation](#installation)\n- [License](#license)\n\n## Rationale\n\nAlthough continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec [1] and also in subsequent work [2].  However, we found that popular implementations of word2vec with negative sampling such as [word2vec](https://github.com/tmikolov/word2vec/) and [gensim](https://github.com/RaRe-Technologies/gensim/) do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly.\n\nWe release kōan so that others can efficiently train CBOW embeddings using the corrected weight update. See this [technical report](https://arxiv.org/abs/2012.15332) for benchmarks of kōan vs. gensim word2vec negative sampling implementations.  If you use kōan to learn word embeddings for your own work, please cite:\n\n\u003e Ozan İrsoy, Adrian Benton, and Karl Stratos. \"Corrected CBOW Performs as well as Skip-gram.\" The 2nd Workshop on Insights from Negative Results in NLP (__2021__).\n\n[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.\n\n[2] Karl Stratos, Michael Collins, and Daniel Hsu. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing\n(Volume 1: Long Papers), pages 1282–1291, 2015.\n\nSee [here](https://doi.org/10.5281/zenodo.5542319) for kōan embeddings trained on the English cleaned common crawl corpus (C4).\n\n## Building\n\nYou need a C++17 supporting compiler to build koan (tested with g++ 7.5.0, 8.4.0, 9.3.0, and clang 11.0.3).\n\nTo build koan and all tests:\n```\nmkdir build\ncd build\ncmake ..\ncmake --build ./\n```\n\nRun tests with (assuming you are still under `build`):\n```\n./test_gradcheck\n./test_utils\n```\n\n## Installation\n\nInstallation is as simple as placing the koan binary on your `PATH`\n(you might need sudo):\n\n```\ncmake --install ./\n```\n\n## Quick Start\n\nTo train word embeddings on [Wikitext-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), first clone and build koan:\n\n```\ngit clone --recursive git@github.com:bloomberg/koan.git\ncd koan\nmkdir build\ncd build\ncmake .. \u0026\u0026 cmake --build ./\ncd ..\n```\n\nDownload and unzip the Wikitext-2 corpus:\n\n```\ncurl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip\nunzip wikitext-2-v1.zip\nhead -n 5 ./wikitext-2/wiki.train.tokens\n```\n\nAnd learn CBOW embeddings on the training fold with:\n\n```\n./build/koan -V 2000000 \\\n             --epochs 10 \\\n             --dim 300 \\\n             --negatives 5 \\\n             --context-size 5 \\\n             -l 0.075 \\\n             --threads 16 \\\n             --cbow true \\\n             --min-count 2 \\\n             --file ./wikitext-2/wiki.train.tokens\n```\n\nor skipgram embeddings by running with `--cbow false`. `./build/koan --help` for a full list of command-line arguments and descriptions.  Learned embeddings will be saved to `embeddings_${CURRENT_TIMESTAMP}.txt` in the present working directory.\n\n## License\n\nPlease read the [LICENSE](LICENSE) file.\n\n## Benchmarks\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"word2vec_train_times_cbow.png\" width=\"400\"\u003e\u003cimg src=\"word2vec_train_times_sg.png\" width=\"400\"\u003e\u003c/p\u003e\n\nSee the [report](https://arxiv.org/abs/2012.15332) for more details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbloomberg%2Fkoan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbloomberg%2Fkoan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbloomberg%2Fkoan/lists"}