{"id":18557720,"url":"https://github.com/aosingh/lexpy","last_synced_at":"2025-09-15T08:17:28.131Z","repository":{"id":54308995,"uuid":"108201578","full_name":"aosingh/lexpy","owner":"aosingh","description":"Python package for lexicon; Trie and DAWG implementation.","archived":false,"fork":false,"pushed_at":"2024-12-01T05:10:18.000Z","size":14883,"stargazers_count":55,"open_issues_count":4,"forks_count":7,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-27T23:05:49.331Z","etag":null,"topics":["dawg","directed-acyclic-word-graph","graph","lexicon","suffix-tree","trie"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aosingh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-25T01:02:22.000Z","updated_at":"2024-08-19T13:59:36.000Z","dependencies_parsed_at":"2024-03-17T05:27:47.765Z","dependency_job_id":"369500a0-430a-4b78-8d7f-2407eea55bb9","html_url":"https://github.com/aosingh/lexpy","commit_stats":{"total_commits":98,"total_committers":4,"mean_commits":24.5,"dds":0.09183673469387754,"last_synced_commit":"dede12485e1ed9d5c624a6c4d44856ea43329771"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/aosingh/lexpy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aosingh%2Flexpy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aosingh%2Flexpy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aosingh%2Flexpy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aosingh%2Flexpy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aosingh","download_url":"https://codeload.github.com/aosingh/lexpy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aosingh%2Flexpy/sbom","scorecard":{"id":201030,"data":{"date":"2025-08-11","repo":{"name":"github.com/aosingh/lexpy","commit":"422446c21df6d7fb329a2ae518b85730d6b6eb5f"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.4,"checks":[{"name":"Code-Review","score":0,"reason":"Found 1/18 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/lexpy_build.yaml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: GNU General Public License v3.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":0,"reason":"Project has not signed or included provenance with any releases.","details":["Warn: release artifact v1.1.0 not signed: https://api.github.com/repos/aosingh/lexpy/releases/159598474","Warn: release artifact v1.0.0 not signed: https://api.github.com/repos/aosingh/lexpy/releases/75726175","Warn: release artifact v1.1.0 does not have provenance: https://api.github.com/repos/aosingh/lexpy/releases/159598474","Warn: release artifact v1.0.0 does not have provenance: https://api.github.com/repos/aosingh/lexpy/releases/75726175"],"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 16 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/lexpy_build.yaml:16: update your workflow using https://app.stepsecurity.io/secureworkflow/aosingh/lexpy/lexpy_build.yaml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/lexpy_build.yaml:19: update your workflow using https://app.stepsecurity.io/secureworkflow/aosingh/lexpy/lexpy_build.yaml/main?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/lexpy_build.yaml:25","Warn: pipCommand not pinned by hash: .github/workflows/lexpy_build.yaml:26","Warn: pipCommand not pinned by hash: .github/workflows/lexpy_build.yaml:27","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   3 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}}]},"last_synced_at":"2025-08-16T22:51:25.003Z","repository_id":54308995,"created_at":"2025-08-16T22:51:25.003Z","updated_at":"2025-08-16T22:51:25.003Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275225974,"owners_count":25427022,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-15T02:00:09.272Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dawg","directed-acyclic-word-graph","graph","lexicon","suffix-tree","trie"],"created_at":"2024-11-06T21:37:49.924Z","updated_at":"2025-09-15T08:17:28.098Z","avatar_url":"https://github.com/aosingh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lexpy\n\n[![lexpy](https://github.com/aosingh/lexpy/actions/workflows/lexpy_build.yaml/badge.svg)](https://github.com/aosingh/lexpy/actions)\n[![Downloads](https://pepy.tech/badge/lexpy)](https://pepy.tech/project/lexpy)\n[![PyPI version](https://badge.fury.io/py/lexpy.svg)](https://pypi.python.org/pypi/lexpy)\n\n[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)\n[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)\n[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-3100/)\n[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)\n[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120/)\n\n\n\n[![PyPy3.7](https://img.shields.io/badge/python-PyPy3.7-blue.svg)](https://www.pypy.org/download.html)\n[![PyPy3.8](https://img.shields.io/badge/python-PyPy3.8-blue.svg)](https://www.pypy.org/download.html)\n[![PyPy3.9](https://img.shields.io/badge/python-PyPy3.9-blue.svg)](https://www.pypy.org/download.html)\n\n\n\n- A lexicon is a data-structure which stores a set of words. The difference between \na dictionary and a lexicon is that in a lexicon there are **no values** associated with the words. \n\n- A lexicon is similar to a list or a set of words, but the internal representation is different and optimized\nfor faster searches of words, prefixes and wildcard patterns. \n\n- Given a word, precisely, the search time is O(W) where W is the length of the word. \n\n- 2 important lexicon data-structures are **_Trie_** and **_Directed Acyclic Word Graph (DAWG)_**.\n\n# Install\n\n`lexpy` can be installed via Python Package Index `(PyPI)` using `pip`. The only installation requirement is that you need Python 3.7 or higher.\n\n```commandline\npip install lexpy\n```\n\n# Interface\n\n| **Interface Description**                                                                                                     \t| **Trie**                           \t| **DAWG**                           \t|\n|-------------------------------------------------------------------------------------------------------------------------------\t|------------------------------------------\t|------------------------------------------\t|\n| Add a single word                                                                                                             \t| `add('apple', count=2)`                            \t| `add('apple', count=2)`                            \t|\n| Add multiple words                                                                                                            \t| `add_all(['advantage', 'courage'])`       \t| `add_all(['advantage', 'courage'])`       \t|\n| Check if exists?                                                                                                              \t| `in` operator                             \t| `in` operator                             \t|\n| Search using wildcard expression                                                                                              \t| `search('a?b*', with_count=True)`            | `search('a?b*, with_count=True)`             |\n| Search for prefix matches                                                                                                     \t| `search_with_prefix('bar', with_count=True)` | `search_with_prefix('bar')`               \t|\n| Search for similar words within  given edit distance. Here, the notion of edit distance  is same as Levenshtein distance \t| `search_within_distance('apble', dist=1, with_count=True)` \t| `search_within_distance('apble', dist=1, with_count=True)` \t|\n| Get the number of nodes in the automaton \t| `len(trie)` \t| `len(dawg)` \t|\n\n\n# Examples\n\n## Trie\n\n### Build from an input list, set, or tuple of words.\n\n```python\nfrom lexpy import Trie\n\ntrie = Trie()\n\ninput_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', \n               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', \n               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', \n               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' \n               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']\n\ntrie.add_all(input_words) # You can pass any sequence types or a file-like object here\n\nprint(trie.get_word_count())\n\n\u003e\u003e\u003e 48\n```\n\n### Build from a file or file path.\n\nIn the file, words should be newline separated.\n\n```python\n\nfrom lexpy import Trie\n\n# Either\ntrie = Trie()\ntrie.add_all('/path/to/file.txt')\n\n# Or\nwith open('/path/to/file.txt', 'r') as infile:\n     trie.add_all(infile)\n\n```\n\n### Check if exists using the `in` operator\n\n```python\nprint('ampyx' in trie)\n\n\u003e\u003e\u003e True\n```\n\n### Prefix search\n\n```python\nprint(trie.search_with_prefix('ab'))\n\n\u003e\u003e\u003e ['abhor', 'abuzz']\n```\n\n```python\n\nprint(trie.search_with_prefix('ab', with_count=True))\n\n\u003e\u003e\u003e [('abuzz', 2), ('abhor', 1)]\n\n```\n\n### Wildcard search using `?` and `*`\n\n- `?` = 0 or 1 occurrence of any character\n\n- `*` = 0 or more occurrence of any character\n\n```python\nprint(trie.search('a*o*'))\n\n\u003e\u003e\u003e ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']\n\nprint(trie.search('a*o*', with_count=True))\n\n\u003e\u003e\u003e [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]\n\nprint(trie.search('su?t'))\n\n\u003e\u003e\u003e ['suit']\n\nprint(trie.search('su?t', with_count=True))\n\n\u003e\u003e\u003e [('suit', 1)]\n\n```\n\n### Search for similar words using the notion of Levenshtein distance\n\n```python\nprint(trie.search_within_distance('arie', dist=2))\n\n\u003e\u003e\u003e ['athie', 'arbil', 'auric']\n\nprint(trie.search_within_distance('arie', dist=2, with_count=True))\n\n\u003e\u003e\u003e [('athie', 3), ('arbil', 1), ('auric', 1)]\n\n```\n\n### Increment word count\n\n- You can either add a new word or increment the counter for an existing word.\n\n```python\n\ntrie.add('athie', count=1000)\n\nprint(trie.search_within_distance('arie', dist=2, with_count=True))\n\n\u003e\u003e\u003e [('athie', 1003), ('arbil', 1), ('auric', 1)]\n```\n\n# Directed Acyclic Word Graph (DAWG)\n\n- DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always\nless than or equal to the number of nodes in Trie. \n\n- They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.\n\n- In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.\n\n- In the current implementation of DAWG, the insertion order of the words should be **alphabetical**.\n\n- The implementation idea of DAWG is borrowed from http://stevehanov.ca/blog/?id=115\n\n\n```python\nfrom lexpy import Trie, DAWG\n\ntrie = Trie()\ntrie.add_all(['advantageous', 'courageous'])\n\ndawg = DAWG()\ndawg.add_all(['advantageous', 'courageous'])\n\nlen(trie) # Number of Nodes in Trie\n23\n\ndawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.\n\nlen(dawg) # Number of nodes in DAWG\n21\n\n```\n\n## DAWG\n\nThe APIs are exactly same as the Trie APIs\n\n### Build a DAWG\n\n```python\nfrom lexpy import DAWG\ndawg = DAWG()\n\ninput_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', \n               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', \n               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', \n               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' \n               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']\n\n\ndawg.add_all(input_words)\ndawg.reduce()\n\ndawg.get_word_count()\n\n\u003e\u003e\u003e 48\n\n```\n\n### Check if exists using the `in` operator\n\n```python\nprint('ampyx' in dawg)\n\n\u003e\u003e\u003e True\n```\n\n### Prefix search\n\n```python\nprint(dawg.search_with_prefix('ab'))\n\n\u003e\u003e\u003e ['abhor', 'abuzz']\n```\n\n```python\n\nprint(dawg.search_with_prefix('ab', with_count=True))\n\n\u003e\u003e\u003e [('abuzz', 2), ('abhor', 1)]\n\n```\n\n### Wildcard search using `?` and `*`\n\n`?` = 0 or 1 occurance of any character\n\n`*` = 0 or more occurance of any character\n\n```python\nprint(dawg.search('a*o*'))\n\n\u003e\u003e\u003e ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']\n\nprint(dawg.search('a*o*', with_count=True))\n\n\u003e\u003e\u003e [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]\n\nprint(dawg.search('su?t'))\n\n\u003e\u003e\u003e ['suit']\n\nprint(dawg.search('su?t', with_count=True))\n\n\u003e\u003e\u003e [('suit', 1)]\n\n```\n\n### Search for similar words using the notion of Levenshtein distance\n\n```python\nprint(dawg.search_within_distance('arie', dist=2))\n\n\u003e\u003e\u003e ['athie', 'arbil', 'auric']\n\nprint(dawg.search_within_distance('arie', dist=2, with_count=True))\n\n\u003e\u003e\u003e [('athie', 3), ('arbil', 1), ('auric', 1)]\n\n```\n\n### Alphabetical order insertion\n\nIf you insert a word which is lexicographically out-of-order, ``ValueError`` will be raised.\n```python\ndawg.add('athie', count=1000)\n```\nValueError\n\n```text\nValueError: Words should be inserted in Alphabetical order. \u003cPrevious word - thrill\u003e, \u003cCurrent word - athie\u003e\n```\n\n### Increment the word count\n\n- You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.\n\n```python\n\n\ndawg.add_all(['thrill']*20000) # or dawg.add('thrill', count=20000)\n\nprint(dawg.search('thrill', with_count=True))\n\n\u003e\u003e [('thrill', 20001)]\n\n```\n\n## Special Characters\n\nSpecial characters, except `?` and `*`, are matched literally. \n\n```python\nfrom lexpy import Trie\nt = Trie()\nt.add('a©')\n```\n\n```python\nt.search('a©')\n\u003e\u003e ['a©']\n\n```\n\n```python\nt.search('a?')\n\u003e\u003e ['a©']\n```\n\n```python\nt.search('?©')\n\u003e\u003e ['a©']\n```\n\n## Trie vs DAWG\n\n\n![Number of nodes comparison](https://github.com/aosingh/lexpy/blob/main/lexpy_trie_dawg_nodes.png)\n\n![Build time comparison](https://github.com/aosingh/lexpy/blob/main/lexpy_trie_dawg_time.png)\n\n\n\n# Future Work\n\nThese are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.\n\n- Merge trie and DAWG features in one data structure\n  -  Support all functionalities and still be as compressed as possible.\n- Serialization / Deserialization\n    - Pickle is definitely an option. \n- Server (TCP or HTTP) to serve queries over the network.\n\n\n# Fun Facts\n1. The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary.\nSo for all english words, the search time is bounded by O(45). \n2. The longest technical word(not in dictionary) is the name of a protein called as [titin](https://en.wikipedia.org/wiki/Titin). It has 189,819\nletters and it is disputed whether it is a word.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faosingh%2Flexpy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faosingh%2Flexpy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faosingh%2Flexpy/lists"}