{"id":30188678,"url":"https://github.com/cod3licious/conec","last_synced_at":"2025-08-12T17:45:53.039Z","repository":{"id":62564302,"uuid":"71647828","full_name":"cod3licious/conec","owner":"cod3licious","description":"Context Encoders (ConEc) as a simple but powerful extension of the word2vec model for learning word embeddings","archived":false,"fork":false,"pushed_at":"2020-05-09T22:38:08.000Z","size":60,"stargazers_count":20,"open_issues_count":0,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-08-09T16:34:23.720Z","etag":null,"topics":["machine-learning","natural-language-processing","word-embeddings"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cod3licious.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-10-22T15:27:07.000Z","updated_at":"2024-01-04T16:08:28.000Z","dependencies_parsed_at":"2022-11-03T16:31:35.408Z","dependency_job_id":null,"html_url":"https://github.com/cod3licious/conec","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cod3licious/conec","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cod3licious%2Fconec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cod3licious%2Fconec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cod3licious%2Fconec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cod3licious%2Fconec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cod3licious","download_url":"https://codeload.github.com/cod3licious/conec/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cod3licious%2Fconec/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270108930,"owners_count":24528772,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","natural-language-processing","word-embeddings"],"created_at":"2025-08-12T17:45:51.924Z","updated_at":"2025-08-12T17:45:53.008Z","avatar_url":"https://github.com/cod3licious.png","language":"Python","readme":"# Context Encoders (ConEc)\n\nWith this code you can train and evaluate Context Encoders (ConEc), an extension of word2vec, which can learn word embeddings from large corpora and create out-of-vocabulary embeddings on the spot as well as distinguish between multiple meanings of words based on their local contexts.\nFor further details on the model and experiments please refer to the [paper](https://arxiv.org/abs/1706.02496)  - and of course if any of this code was helpful for your research, please consider citing it:\n```\n    @inproceedings{horn2017conecRepL4NLP,\n      author       = {Horn, Franziska},\n      title        = {Context encoders as a simple but powerful extension of word2vec},\n      booktitle    = {Proceedings of the 2nd Workshop on Representation Learning for NLP},\n      year         = {2017},\n      organization = {Association for Computational Linguistics},\n      pages        = {10--14}\n    }\n```\n\nThe code is intended for research purposes. It should run with Python 2.7 and 3 versions - no guarantees on this though (open an issue if you find a bug, please)!\n\n### installation\n\nYou either download the code from here and include the conec folder in your `$PYTHONPATH` or install (the library components only) via pip:\n```\n$ pip install conec\n```\n\n### conec library components\n\ndependencies: `numpy, scipy`\n\n- `word2vec.py`: code to train a standard word2vec model, adapted from the corresponding [gensim](https://radimrehurek.com/gensim/) implementation.\n- `context2vec.py`: code to build a sparse context matrix from a large collection of texts; this context matrix can then be multiplied with the corresponding word2vec embeddings to give the context encoder embeddings:\n\n```python\n# get the text for training\nsentences = Text8Corpus('data/text8')\n# train the word2vec model\nw2v_model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, seed=3)\n# get the global context matrix for the text\ncontext_model = context2vec.ContextModel(sentences, min_count=w2v_model.min_count, window=w2v_model.window, wordlist=w2v_model.wv.index2word)\ncontext_mat = context_model.get_context_matrix(fill_diag=False, norm='max')\n# multiply the context matrix with the (length normalized) word2vec embeddings\n# to get the context encoder (ConEc) embeddings\nconec_emb = context_mat.dot(w2v_model.wv.vectors_norm)\n# renormalize so the word embeddings have unit length again\nconec_emb = conec_emb / np.array([np.linalg.norm(conec_emb, axis=1)]).T\n```\n\n\n### examples\n\nadditional dependencies: `sklearn`\n\n`test_analogy.py` and `test_ner.py` contain the code to replicate the analogy and named entity recognition (NER) experiments discussed in the aforementioned paper.\n\nTo run the analogy experiment, it is assumed that the [`text8 corpus`](http://mattmahoney.net/dc/text8.zip) or [`1-billion corpus`](http://code.google.com/p/1-billion-word-language-modeling-benchmark/) as well as the [`analogy questions`](https://code.google.com/archive/p/word2vec/) are in a data directory.\n\nTo run the named entity recognition experiment, it is assumed that the corresponding [`training and test files`](http://www.cnts.ua.ac.be/conll2003/ner/) are located in the data/conll2003 directory.\n\n\nIf you have any questions please don't hesitate to send me an [email](mailto:cod3licious@gmail.com) and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcod3licious%2Fconec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcod3licious%2Fconec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcod3licious%2Fconec/lists"}