{"id":13856716,"url":"https://github.com/maciejkula/glove-python","last_synced_at":"2025-05-15T07:07:04.320Z","repository":{"id":21865689,"uuid":"25189174","full_name":"maciejkula/glove-python","owner":"maciejkula","description":"Toy Python implementation of http://www-nlp.stanford.edu/projects/glove/","archived":false,"fork":false,"pushed_at":"2022-02-19T11:37:43.000Z","size":419,"stargazers_count":1257,"open_issues_count":66,"forks_count":321,"subscribers_count":45,"default_branch":"master","last_synced_at":"2025-05-15T01:46:26.707Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maciejkula.png","metadata":{"files":{"readme":"readme.md","changelog":"changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-10-14T03:42:15.000Z","updated_at":"2025-05-11T14:20:01.000Z","dependencies_parsed_at":"2022-07-20T00:16:57.624Z","dependency_job_id":null,"html_url":"https://github.com/maciejkula/glove-python","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maciejkula%2Fglove-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maciejkula%2Fglove-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maciejkula%2Fglove-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maciejkula%2Fglove-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maciejkula","download_url":"https://codeload.github.com/maciejkula/glove-python/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254292042,"owners_count":22046426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-05T03:01:10.358Z","updated_at":"2025-05-15T07:06:59.291Z","avatar_url":"https://github.com/maciejkula.png","language":"Python","funding_links":[],"categories":["Python","APIs and Libraries"],"sub_categories":["Knowledge Graphs"],"readme":"# glove-python\n\n[![Circle CI](https://circleci.com/gh/maciejkula/glove-python.svg?style=svg)](https://circleci.com/gh/maciejkula/glove-python)\n\nA toy python implementation of [GloVe](http://www-nlp.stanford.edu/projects/glove/).\n\nGlove produces dense vector embeddings of words, where words that occur together are close in the resulting vector space.\n\nWhile this produces embeddings which are similar to [word2vec](https://code.google.com/p/word2vec/) (which has a great python implementation in [gensim](http://radimrehurek.com/gensim/models/word2vec.html)), the method is different: GloVe produces embeddings by factorizing the logarithm of the corpus word co-occurrence matrix.\n\nThe code uses asynchronous stochastic gradient descent, and is implemented in Cython. Most likely, it contains a tremendous amount of bugs.\n\n## Installation\nInstall from pypi using pip: `pip install glove_python`.\n\nNote for OSX users: due to its use of OpenMP, glove-python does not compile under Clang. To install it, you will need a reasonably recent version of `gcc` (from Homebrew for instance). This should be picked up by `setup.py`; if it is not, please open an issue.\n\nBuilding with the default Python distribution included in OSX is also not supported; please try the version from Homebrew or Anaconda.\n\n## Usage\nProducing the embeddings is a two-step process: creating a co-occurrence matrix from the corpus, and then using it to produce the embeddings. The `Corpus` class helps in constructing a corpus from an interable of tokens; the `Glove` class trains the embeddings (with a sklearn-esque API).\n\nThere is also support for rudimentary pagragraph vectors. A paragraph vector (in this case) is an embedding of a paragraph (a multi-word piece of text) in the word vector space in such a way that the paragraph representation is close to the words it contains, adjusted for the frequency of words in the corpus (in a manner similar to tf-idf weighting). These can be obtained after having trained word embeddings by calling the `transform_paragraph` method on the trained model.\n\n## Examples\n`example.py` has some example code for running simple training scripts: `ipython -i -- examples/example.py -c my_corpus.txt -t 10` should process your corpus, run 10 training epochs of GloVe, and drop you into an `ipython` shell where `glove.most_similar('physics')` should produce a list of similar words.\n\nIf you want to process a wikipedia corpus, you can pass file from [here](http://dumps.wikimedia.org/enwiki/latest/) into the `example.py` script using the `-w` flag. Running `make all-wiki` should download a small wikipedia dump file, process it, and train the embeddings. Building the cooccurrence matrix will take some time; training the vectors can be speeded up by increasing the training parallelism to match the number of physical CPU cores available.\n\nRunning this on my machine yields roughly the following results:\n\n```\nIn [1]: glove.most_similar('physics')\nOut[1]:\n[('biology', 0.89425889335342257),\n ('chemistry', 0.88913708236100086),\n ('quantum', 0.88859617025616333),\n ('mechanics', 0.88821824562025431)]\n\nIn [4]: glove.most_similar('north')\nOut[4]:\n[('west', 0.99047203572917908),\n ('south', 0.98655786905501008),\n ('east', 0.97914140138065575),\n ('coast', 0.97680427897282185)]\n\nIn [6]: glove.most_similar('queen')\nOut[6]:\n[('anne', 0.88284931171714842),\n ('mary', 0.87615260138308615),\n ('elizabeth', 0.87362497374226267),\n ('prince', 0.87011034923161801)]\n\nIn [19]: glove.most_similar('car')\nOut[19]:\n[('race', 0.89549347066796814),\n ('driver', 0.89350343749207217),\n ('cars', 0.83601334715106568),\n ('racing', 0.83157724991920212)]\n```\n\n## Development\nPull requests are welcome.\n\nWhen making changes to the `.pyx` extension files, you'll need to run `python setup.py cythonize` in order to produce the extension `.c` and `.cpp` files before running `pip install -e .`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaciejkula%2Fglove-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaciejkula%2Fglove-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaciejkula%2Fglove-python/lists"}