{"id":18541861,"url":"https://github.com/cltk/enm_models_cltk","last_synced_at":"2026-02-08T09:33:28.826Z","repository":{"id":76838897,"uuid":"541765096","full_name":"cltk/enm_models_cltk","owner":"cltk","description":"Models for Middle English provided by CLTK","archived":false,"fork":false,"pushed_at":"2023-10-28T23:38:23.000Z","size":5165,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-08-12T21:32:43.798Z","etag":null,"topics":["cltk","enm","middle-english","nlp"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cltk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-26T20:06:49.000Z","updated_at":"2024-10-26T04:50:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"69bb50a2-50d0-4d08-99b9-3d245f66ec2b","html_url":"https://github.com/cltk/enm_models_cltk","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cltk/enm_models_cltk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fenm_models_cltk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fenm_models_cltk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fenm_models_cltk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fenm_models_cltk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cltk","download_url":"https://codeload.github.com/cltk/enm_models_cltk/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cltk%2Fenm_models_cltk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29226470,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-08T09:15:18.648Z","status":"ssl_error","status_checked_at":"2026-02-08T09:14:33.745Z","response_time":57,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cltk","enm","middle-english","nlp"],"created_at":"2024-11-06T20:06:32.424Z","updated_at":"2026-02-08T09:33:28.806Z","avatar_url":"https://github.com/cltk.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# CLTK models for Middle English\n\n## How the word2vec model was trained\n\nWe first need to install some mathematics and machine learning packages.\n\n```bash\npip install scipy sklearn gensim\n```\n\n\nWe import a corpus of Middle English texts\n```python\nimport pickle\nwith open('./middle_english_txt-.p','rb') as p:\n    corpus = pickle.load(p)\n```\n\nThe normalization process involves removal of punctuation, special characters and numbers.\n\n```python\nCHARACTERS_TO_REMOVE = \"...\"\n\ndef remove_punc(text):\n    for ele in text:\n        if ele in CHARACTERS_TO_REMOVE:\n            return text.lower().replace(ele, \"\")\n        else:\n            return text.lower()\n            \nnew_data = []\n\nfor texts in corpus:\n    part_text = []\n    for word in texts:\n        part_text.append(remove_punc(word))\n    new_data.append(part_text)\n```\n\nThe data was in a form of a list of lists of strings or a list of sentences, where a sentence is a list of words.\n\nThen we use `Word2Vec` class from **gensim**\n```python\n# time to try gensim to create word2vecs\n# see NLP in action, 6.2.4\nfrom gensim.models.word2vec import Word2Vec\n\nnum_features = 50\nmin_word_count = 30\nnum_workers = 2\nwindow_size = 20\nsubsampling = 1e-3\n\nmodel = Word2Vec(\n    new_data,\n    workers=num_workers,\n    vector_size=num_features,\n    min_count=min_word_count,\n    window=window_size,\n    sample=subsampling)\n\nmodel.init_sims(replace=True)\nme_w2v_model = \"me_word_embeddings_model.bin\"\nmodel.save(me_w2v_model)\n\n```\n\nThe model is now saved in the file `me_word_embeddings_model.bin`.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcltk%2Fenm_models_cltk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcltk%2Fenm_models_cltk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcltk%2Fenm_models_cltk/lists"}