{"id":18674977,"url":"https://github.com/jaanli/sentence_word2vec","last_synced_at":"2025-04-12T02:06:20.956Z","repository":{"id":72811237,"uuid":"74535648","full_name":"jaanli/sentence_word2vec","owner":"jaanli","description":"word2vec with a context based on sentences.","archived":false,"fork":false,"pushed_at":"2017-01-30T20:16:40.000Z","size":34,"stargazers_count":15,"open_issues_count":0,"forks_count":4,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-12T02:06:14.621Z","etag":null,"topics":["ops","sentence","sentence-word2vec","tensorflow","word2vec"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaanli.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-11-23T03:07:46.000Z","updated_at":"2020-03-03T06:54:23.000Z","dependencies_parsed_at":"2023-02-23T12:16:07.753Z","dependency_job_id":null,"html_url":"https://github.com/jaanli/sentence_word2vec","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaanli%2Fsentence_word2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaanli%2Fsentence_word2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaanli%2Fsentence_word2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaanli%2Fsentence_word2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaanli","download_url":"https://codeload.github.com/jaanli/sentence_word2vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248505862,"owners_count":21115354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ops","sentence","sentence-word2vec","tensorflow","word2vec"],"created_at":"2024-11-07T09:21:30.907Z","updated_at":"2025-04-12T02:06:20.945Z","avatar_url":"https://github.com/jaanli.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sentence_word2vec\nword2vec with a context based on sentences, in C++.\n\nThis is based on the tensorflow implementation of word2vec.\n\nHowever, the context for the model is defined differently:\n\n* the context for the model is defined in terms of sentences.\n* the context for a given word is the rest of words in a sentence.\n\nThis is implemented in C++ in the `sentence_word2vec_kernels.cc` file.\n\nWhy might this be useful? This can be used to model playlists or\nuser histories for recommendation! Or any other kind of 'bagged' data.\n\n## Usage\n\nTo compile the C++ ops used:\n```\ngit clone https://github.com/altosaar/sentence_word2vec\ncd sentence_word2vec\n# pull the models repo submodule\ngit submodule update --init\n./compile_ops.sh\n```\n\nTo get the text8 data and split it into sentences for testing:\n```\n./get_data.sh\n```\n\nTo run the code with a sentence-level context window:\n```\npython word2vec_optimized.py -- \\\n    --train_data text8_split \\\n    --eval_data questions-words.txt \\\n    --save_path /tmp \\\n    --sentence_level True\n```\n\nOn a Macbook Air with the following config, the speed is around 17k words/second. This is up from around 2k words/second with a [manual python implementation](https://github.com/altosaar/scirec).\n```\n➜  ~ sysctl -n machdep.cpu.brand_string\nIntel(R) Core(TM) i7-4650U CPU @ 1.70GHz\n```\n\nThis directory contains models for unsupervised training of word embeddings\nusing the model described in:\n(Mikolov, et. al.) [Efficient Estimation of Word Representations in Vector Space](http://arxiv.org/abs/1301.3781),\nICLR 2013.\n\nDetailed instructions and description of this model is available in the\ntensorflow tutorials:\n\n* [Word2Vec Tutorial](http://tensorflow.org/tutorials/word2vec/index.md)\n\nFile | What's in it?\n--- | ---\n`word2vec.py` | A version of word2vec implemented using TensorFlow ops and minibatching.\n`word2vec_optimized.py` | A version of word2vec implemented using C ops that does no minibatching.\n`sentence_word2vec_kernels.cc` | Kernels for the custom input and training ops, including sentence-level contexts.\n`sentence_word2vec_ops.cc` | The declarations of the custom ops.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaanli%2Fsentence_word2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaanli%2Fsentence_word2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaanli%2Fsentence_word2vec/lists"}