{"id":34077500,"url":"https://github.com/clips/wordkit","last_synced_at":"2025-12-14T10:05:20.276Z","repository":{"id":62589398,"uuid":"122481122","full_name":"clips/wordkit","owner":"clips","description":"Featurize words into orthographic and phonological vectors.","archived":false,"fork":false,"pushed_at":"2023-05-20T11:27:37.000Z","size":779,"stargazers_count":41,"open_issues_count":2,"forks_count":10,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-09-22T20:15:27.536Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clips.png","metadata":{"files":{"readme":"README.MD","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-02-22T13:18:57.000Z","updated_at":"2025-05-08T07:34:00.000Z","dependencies_parsed_at":"2022-11-03T17:52:35.674Z","dependency_job_id":"ac75f384-e3f0-4828-8486-70d047825f1e","html_url":"https://github.com/clips/wordkit","commit_stats":{"total_commits":397,"total_committers":2,"mean_commits":198.5,"dds":0.007556675062972307,"last_synced_commit":"c8225246982a52aea424f55db1ce47bf3c910ff7"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/clips/wordkit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fwordkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fwordkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fwordkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fwordkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clips","download_url":"https://codeload.github.com/clips/wordkit/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clips%2Fwordkit/sbom","scorecard":{"id":291811,"data":{"date":"2025-08-11","repo":{"name":"github.com/clips/wordkit","commit":"f10b190983c5f65fd098d105cdab0db89e8816b1"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.7,"checks":[{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/19 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: GNU General Public License v3.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"15 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2021-356 / GHSA-2ww3-fxvq-293j","Warn: Project is vulnerable to: PYSEC-2024-167 / GHSA-cgvx-9447-vcch","Warn: Project is vulnerable to: PYSEC-2021-859 / GHSA-f8m6-h2c7-8h9x","Warn: Project is vulnerable to: PYSEC-2019-106 / GHSA-mr7p-25v2-35wr","Warn: Project is vulnerable to: PYSEC-2022-5 / GHSA-rqjh-jp2r-59cj","Warn: Project is vulnerable to: PYSEC-2021-856 / GHSA-5545-2q6w-2gh6","Warn: Project is vulnerable to: GHSA-6p56-wp2h-9hxr","Warn: Project is vulnerable to: PYSEC-2019-108 / GHSA-9fq2-x9r6-wfmf","Warn: Project is vulnerable to: PYSEC-2021-857 / GHSA-f7c7-j99h-c22f","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: PYSEC-2017-1 / GHSA-frgw-fgh6-9g52","Warn: Project is vulnerable to: PYSEC-2020-73","Warn: Project is vulnerable to: PYSEC-2025-49 / GHSA-5rjg-fvgr-3xxf","Warn: Project is vulnerable to: GHSA-cx63-2mw6-8hw5","Warn: Project is vulnerable to: PYSEC-2022-43012 / GHSA-r9hx-vwmv-q579"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 12 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-17T18:26:59.212Z","repository_id":62589398,"created_at":"2025-08-17T18:26:59.212Z","updated_at":"2025-08-17T18:26:59.212Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27725961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-14T02:00:11.348Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-14T10:05:14.680Z","updated_at":"2025-12-14T10:05:20.271Z","avatar_url":"https://github.com/clips.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wordkit\n\n![overview](images/wordkit_overview.png)\n\nThis is the repository of the `wordkit` package, a `Python 3.X` package for the featurization of words into orthographic and phonological vectors.\n\n## Overview\n\n`wordkit` is a package for working with words.\nThe package contains a variety of functions that allow you to:\n* Extract words from lexical databases in a structured format.\n* Normalize phonological strings across languages and databases.\n* Featurize words for usage in computational psycholinguistic models using the following features:\n    * Open ngrams\n    * Character ngrams\n    * Holographic features\n    * Consonant Vowel mapping (patpho)\n    * Onset Nucleus Coda mapping\n* Find synonyms, homographs, and homophones across languages.\n* Fuse lexical databases, also crosslingually.\n* Sample from (subsets of) corpora by frequency of occurrence.\n\nand much more.\n\n## Installation\n\n`wordkit` is on pip.\n\n`pip install wordkit`\n\n## Examples\n\nSee the [examples](examples) for some ways in which you can use `wordkit`.\nAll examples assume you have wordkit installed (see above.)\n\n## More\n\nIf, after working through the examples, you want to dive deeper into `wordkit`, check out the following documentation.\n\n`wordkit` is a modular system, and contains two broad families of components.\nThe subpackages are documented using separate `README.MD` files.\nFeel free to click ahead to find descriptions of the contents of subpackages.\n\n* [corpora](wordkit/corpora)\n* [features](wordkit/features)\n\nIn general, a `wordkit` pipeline consists of one or more readers, which extract structured information from corpora.\nThis information is then sent to one or more transformers, which are either assigned pre-defined features or a feature extractor.\n\n## Paper\n\nA paper that describes `wordkit` was accepted at LREC 2018.\nIf you use `wordkit` in your research, please cite the following paper:\n\n```\n@InProceedings{TULKENS18.249,\n  author = {Tulkens, Stéphan and Sandra, Dominiek and Daelemans, Walter},\n  title = {WordKit: a Python Package for Orthographic and Phonological Featurization},\n  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},\n  year = {2018},\n  month = {may},\n  date = {7-12},\n  location = {Miyazaki, Japan},\n  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},\n  publisher = {European Language Resources Association (ELRA)},\n  address = {Paris, France},\n  isbn = {979-10-95546-00-9},\n  language = {english}\n  }\n```\n\n\nAdditionally, if you use any of the corpus readers in `wordkit`, you MUST cite the accompanying corpora and transformers.\nAll of these references can be found in the docstrings of the applicable classes.\n\n## Example\n\nThis example shows one big `wordkit` pipeline.\n\n```python\nimport pandas as pd\n\nfrom wordkit.corpora import celex_english, celex_dutch\nfrom wordkit.features import LinearTransformer, NGramTransformer, fourteen\nfrom string import ascii_lowercase\n\n# The fields we want to extract from our corpora.\nfields = ('orthography', 'frequency', 'phonology', 'syllables')\n\n# Link to epl.cd\nenglish = celex_english(\"epw.cd\",\n                        fields=fields)\n# Link to dpl.cd\ndutch = celex_dutch(\"dpw.cd\",\n                    fields=fields)\n\n# Merge both corpora.\nwords = pd.concat([english, dutch], sort=False).reindex()\n\n# We filter both corpora to only contain monosyllables and words\n# with only alphabetical characters\nwords = words[[len(x) == 1 for x in words[\"syllables\"]]]\nwords = words[[not set(x) - set(ascii_lowercase)\n              for x in words[\"orthography\"]]]\n\n# words.iloc[0] =\u003e\n# orthography                      a\n# phonology                   (e, ɪ)\n# syllables                ((e, ɪ),)\n# frequency                   844672\n# log_frequency              5.92669\n# frequency_per_million        21363\n# zipf_score                 4.32966\n# length                           1\n\n# You can also query specific words\nwind = words[words['orthography'] == \"wind\"]\n\n# This gives\n# wind =\u003e\n#        orthography        phonology  ... zipf_score  length\n# 146523        wind  (w, a, ɪ, n, d)  ...   0.015757       4\n# 146524        wind     (w, ɪ, n, d)  ...   1.683096       4\n# 313527        wind     (w, ɪ, n, t)  ...   2.042675       4\n\n# Now, let's transform into features\n# Orthography is a linear transformer with the fourteen segment feature set.\no = LinearTransformer(fourteen, field='orthography')\n# For phonology we use ngrams.\np = NGramTransformer(n=3, field='phonology')\n\nX_o = o.fit_transform(words)\nX_p = p.fit_transform(words)\n\n# Get the feature vector length for each featurizer\no.vec_len # 126\np.vec_len # 5415\n```\n\n## Corpora\n\n`wordkit` currently offers readers for the following corpora.\nNote that, while we offer predefined fields for all these corpora, any fields present in these data can be retrieved by `wordkit` in addition to the fields we define.\nThe Lexicon Projects, for example, also contain lexicality information, accuracy information, and so on.\nThese can be retrieved by passing the appropriate fields as argument to `fields`.\n\n#### BPAL\n\n[Download](http://www.pc.rhul.ac.uk/staff/c.davis/Utilities/B-Pal.zip)\n\nYou have to extract the `nwphono.txt` file from the `.exe` file.\nThe corpus is not for download in a more practical fashion.\n\n[Publication](http://www.pc.rhul.ac.uk/staff/c.davis/Articles/Davis_Perea_in_press.pdf)\n\n```\nFields:     Orthography, Phonology, Frequency\nLanguages:  Spanish\n```\n\n#### Celex\n\nCurrently not freely available.  \n\n```\nFields:     Orthography, Phonology, Syllables, Frequency\nLanguages:  Dutch, German, English\n```\n\n**WARNING:** the Celex frequency norms are no longer thought to be correct. Please use the `SUBTLEX` frequencies instead.\nYou can use the Celex corpus with `SUBTLEX` frequency norms by using a pandas merge.\nIf you use `CELEX` frequency norms at a psycholinguistic conference, you _will_ get yelled at.\n\n#### CMUDICT\n\n[Download](https://github.com/cmusphinx/cmudict)\n\nWe can read the `cmudict.dict` file from the above repository.\n\n```\nFields:     Orthography, Syllables  \nLanguages:  American English\n```\n\n#### Deri\n\n[Download](https://drive.google.com/open?id=0B7R_gATfZJ2aUWsxbF9CSEo1Y00)\n\nDownload the `pron_data.tar.gz` file, and unzip it. We use the `gold_data_train` file.\n\n[Publication](http://isi.edu/~aderi/papers/g2p.pdf)\n\n```\nFields:     Orthography, Phonology  \nLanguages:  lots\n```\n\n**WARNING:** we manually checked the Dutch, Spanish and German phonologies in this corpus, and a lot of them seem to be incorrectly transcribed or extracted. Only use this corpus if you don't have another resource for your language.\n\n#### Lexique\n\n[Download](http://lexique.org/telLexique.php)  \n\nDownload the zip file, we use the `lexique382.txt` file.\n\n[Publication](https://link.springer.com/content/pdf/10.3758/BF03195598.pdf)\n\nNote that this is the publication for Lexique version 2. Lexique 3 does not seem to have an associated publication in English.\n\n```\nFields:     Orthography, Phonology, Frequency, Syllables  \nLanguages:  French\n```\n\n**NOTE:** the currently implemented reader is for version 3.82 (the most recent version as of May 2018) of Lexique.\n\n#### SUBTLEX\n\nCheck the link below for the various SUBTLEX corpora and their associated publications.\nWe support all of the formats from the link below.\n\n[Link](http://crr.ugent.be/programs-data/subtitle-frequencies)\n\n```\nFields:     Orthography, Frequency  \nLanguages:  Dutch, American English, Greek\n            British English, Polish, Chinese,\n            Spanish\n```\n\n#### Wordnet\n\nWe support all the tab-separated formats of the open multilingual WordNet.\nIf you use any of these WordNets, please cite the appropriate source, as well as the official WordNet reference.\n\n[Link](http://compling.hss.ntu.edu.sg/omw/)\n\n```\nFields: Orthography, Semantics\nLanguages: lots\n```\n\n#### Lexicon projects\n\nWe support all lexicon projects.\nThese contain RT data with which you can validate models.\n\n[Link](http://crr.ugent.be/programs-data/lexicon-projects)\n\n```\nFields: Orthography, rt\nLanguages: Dutch, British English, American English, French\n```\n\n## Experiments\n\nThe code for replicating the experiments in the `wordkit` paper can be found [here](https://github.com/stephantul/lrec2018)\n\n## Requirements\n\n- ipapy\n- numpy\n- pandas\n- reach (for the semantics)\n- nltk (for wordnet-related semantics)\n\n\n## Contributors\n\nStéphan Tulkens\n\n## License\n\nGPL v3\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclips%2Fwordkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclips%2Fwordkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclips%2Fwordkit/lists"}