{"id":19608101,"url":"https://github.com/jfilter/hyperhyper","last_synced_at":"2025-06-27T04:32:34.965Z","repository":{"id":62569951,"uuid":"189021107","full_name":"jfilter/hyperhyper","owner":"jfilter","description":"🧮 Python package to construct word embeddings for small data using PMI and SVD","archived":false,"fork":false,"pushed_at":"2020-10-25T21:46:18.000Z","size":435,"stargazers_count":17,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-10T08:07:45.957Z","etag":null,"topics":["embeddings","nlp","pmi","pmi-svd","ppmi","python","python-package","word-analogy","word-embeddings","word-similarity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jfilter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-05-28T12:09:00.000Z","updated_at":"2025-02-02T21:02:06.000Z","dependencies_parsed_at":"2022-11-03T16:32:17.577Z","dependency_job_id":null,"html_url":"https://github.com/jfilter/hyperhyper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jfilter/hyperhyper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fhyperhyper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fhyperhyper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fhyperhyper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fhyperhyper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jfilter","download_url":"https://codeload.github.com/jfilter/hyperhyper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fhyperhyper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260704449,"owners_count":23049461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","nlp","pmi","pmi-svd","ppmi","python","python-package","word-analogy","word-embeddings","word-similarity"],"created_at":"2024-11-11T10:14:09.126Z","updated_at":"2025-06-27T04:32:34.941Z","avatar_url":"https://github.com/jfilter.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `hyperhyper` [![Build Status](https://travis-ci.com/jfilter/hyperhyper.svg?branch=master)](https://travis-ci.com/jfilter/hyperhyper) [![PyPI](https://img.shields.io/pypi/v/hyperhyper.svg)](https://pypi.org/project/hyperhyper/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hyperhyper.svg)](https://pypi.org/project/hyperhyper/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/hyperhyper)](https://pypistats.org/packages/hyperhyper)\n\n`hyperhyper` is a Python package to construct word embeddings for small data.\n\n## Why?\n\nNowadays, [word embeddings](https://en.wikipedia.org/wiki/Word_embedding) are mostly associated with [Word2vec](https://en.wikipedia.org/wiki/Word2vec) or [fastText](https://en.wikipedia.org/wiki/FastText).\nThese approaches focus on scenarios, where an abundance of data is available.\nAnd big players such as Facebook provide ready-to-use [pre-trained word embeddings](https://fasttext.cc/docs/en/crawl-vectors.html).\nSo often you don't have to train new word embeddings from scratch.\nBut sometimes you do.\n\nWord2vec or fastText require a lot of data – but texts, especially domain-specific texts, may be scarce.\nThere exist alternative methods based on counting co-locations (word pairs) that require fewer data to work.\nThis package implements these approaches (somewhat) efficiently.\n\n## Installation\n\n```bash\npip install hyperhyper\n```\n\nTo enable all features (such as pre-processing with spaCy):\n\n```bash\npip install hyperhyper[full]\n```\n\n## Usage\n\n```python\nimport hyperhyper as hy\n\n# download and uncomproess the data\n# wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.en.shuffled.gz \u0026\u0026 gzip -d news.2010.en.shuffled.gz\ncorpus = hy.Corpus.from_file(\"news.2010.en.shuffled\")\nbunch = hy.Bunch(\"news_bunch\", corpus)\n\n# `hyperhyper` is built open `gensim`. So you can get word embeddings in a keyed vectors format.\n# https://radimrehurek.com/gensim/models/keyedvectors.html\nvectors, results = bunch.svd(keyed_vectors=True)\n\nresults[\"results\"][1]\n\u003e\u003e\u003e {\"name\": \"en_ws353\",\n \"score\": 0.6510955349164682,\n \"oov\": 0.014164305949008499,\n \"fullscore\": 0.641873218557878}\n\nvectors.most_similar(\"berlin\")\n\u003e\u003e\u003e [(\"vienna\", 0.6323208808898926),\n (\"frankfurt\", 0.5965485572814941),\n (\"munich\", 0.5737138986587524),\n (\"amsterdam\", 0.5511572360992432),\n (\"stockholm\", 0.5423270463943481)]\n```\n\nCheck out the [examples](./examples).\n\nThe general concepts:\n\n-   preprocess data once and save them in a `bunch`\n-   cache all results and also record their performance on test data\n-   make it easy to fine-tune parameters for your data\n\nMore documentation may be forthcoming. Until then you have to read the [source code](./hyperhyper).\n\n## Performance Optimization\n\n### Install MKL\n\nIf you have an Intel CPU, it's recommended to use [MKL](https://en.wikipedia.org/wiki/Math_Kernel_Library) to speed up numeric executions.\nOtherwise, the default [OpenBLAS](https://en.wikipedia.org/wiki/OpenBLAS) will get installed when initially installing `hyperhyper`.\n\nIt can be challenging to correctly set up MKL.\nA conda package by Intel may help you.\n\n```bash\nconda install -c intel intelpython3_core\npip install hyperhyper\n```\n\nVerify wheter `mkl_info` is present in the numpy config:\n\n```python\n\u003e\u003e\u003e import numpy\n\u003e\u003e\u003e numpy.__config__.show()\n```\n\n### Disable Numerical Multithreading\n\nFurther, disable the internal multithreading ability of MKL or OpenBLAS (numerical libraries).\nThis speeds up computation because you should do multiprocessing on an outer loop anyhow.\nBut you can also leave the default to take advantage of all cores for your numerical computations.\n[Some Tweets why multithreading with OpenBLAS can cause problems.](https://twitter.com/honnibal/status/1067920534585917440)\n\n```bash\nexport OPENBLAS_NUM_THREADS=1\nexport MKL_NUM_THREADS=1\n```\n\n## Background\n\n`hyperhyper` is based on research by Omer Levy et al. from 2015 ([the paper](https://aclweb.org/anthology/papers/Q/Q15/Q15-1016/)).\nThe authors published the code they used in their experiments as [Hyperwods](https://bitbucket.org/omerlevy/hyperwords).\nInitially, I [tried](https://github.com/jfilter/hyperwords) to port their original software to Python 3 but I ended up re-writing large parts of it.\nSo this package was born.\n\n\n![How pairs are counted](./docs/imgs/window.svg)\n\nThe basic idea: Construct pairs of words that appear together in sentences (within a given window size).\nThen do some math magic around matrix operations (PPMI, SVD) to get low-dimensional embeddings.\n\nThe count-based word-embeddings by `hyperhyper` are deterministic.\nSo multiple runs of experiments with identical parameters will yield the same results.\nWord2vec and others unstable.\nDue to randomness, their results will vary.\n\n`hyperhyper` is built upon the seminal Python NLP package [gensim](https://radimrehurek.com/gensim/).\n\nLimitations: With `hyperhyper` you will run into (memory) problems if you need large vocabularies (set of possible words).\nIt's fine if you have a vocabulary up until ~ 50k.\nWord2vec and fastText especially solve this [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality).\nIf you're interested in details you should read the aforementioned excellent [paper by Omer Levy et al.](https://aclweb.org/anthology/papers/Q/Q15/Q15-1016/).\n\n### Scientific Literature\n\nThis software is based on ideas stemming from the following papers:\n\n-   Improving Distributional Similarity with Lessons Learned from Word Embeddings, Omer Levy, Yoav Goldberg, Ido Dagan, TACL 2015. [Paper](https://aclweb.org/anthology/papers/Q/Q15/Q15-1016/) [Code](https://bitbucket.org/omerlevy/hyperwords)\n    \u003e Recent trends suggest that neural-network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.\n-   The Influence of Down-Sampling Strategies on SVD Word Embedding Stability, Johannes Hellrich, Bernd Kampe, Udo Hahn, NAACL 2019. [Paper](https://aclweb.org/anthology/papers/W/W19/W19-2003/) [Code](https://github.com/hellrich/hyperwords) [Code](https://github.com/hellrich/embedding_downsampling_comparison)\n    \u003e The stability of word embedding algorithms, i.e., the consistency of the word representations they reveal when trained repeatedly on the same data set, has recently raised concerns. We here compare word embedding algorithms on three corpora of different sizes, and evaluate both their stability and accuracy. We find strong evidence that down-sampling strategies (used as part of their training procedures) are particularly influential for the stability of SVD-PPMI-type embeddings. This finding seems to explain diverging reports on their stability and lead us to a simple modification which provides superior stability as well as accuracy on par with skip-gram embedding\n\n## Development\n\nInstall and use [poetry](https://python-poetry.org/).\n\n## Contributing\n\nIf you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/jfilter/hyperhyper/issues).\n\n**Pull requests** are especially welcomed when they fix bugs or improve the code quality.\n\n## Future Work / TODO\n\n-   evaluation for analogies\n-   implement counting in a more efficient programming language, e.g. Cython.\n\n## `hyperhyper`?\n\n[![Scooter – Hyper Hyper (Song)](https://img.youtube.com/vi/7Twnmhe948A/0.jpg)](https://www.youtube.com/watch?v=7Twnmhe948A \"Scooter – Hyper Hyper\")\n\n## Acknowledgments\n\nBuilding upon the work by Omer Levy et al. for [Hyperwords](https://bitbucket.org/omerlevy/hyperwords).\n\n## License\n\nBSD-2-Clause\n\n## Sponsoring\n\nThis work was created as part of a [project](https://github.com/jfilter/ptf) that was funded by the German [Federal Ministry of Education and Research](https://www.bmbf.de/en/index.html).\n\n\u003cimg src=\"./bmbf_funded.svg\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fhyperhyper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjfilter%2Fhyperhyper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fhyperhyper/lists"}