{"id":18264565,"url":"https://github.com/modeltc/general-sam-py","last_synced_at":"2025-04-04T21:30:36.940Z","repository":{"id":199997855,"uuid":"704571667","full_name":"ModelTC/general-sam-py","owner":"ModelTC","description":"Python bindings for general-sam and some utilities","archived":false,"fork":false,"pushed_at":"2024-10-18T05:48:47.000Z","size":86,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-10-18T08:34:43.515Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ModelTC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-13T14:44:20.000Z","updated_at":"2024-10-18T05:48:38.000Z","dependencies_parsed_at":"2024-10-20T13:07:01.438Z","dependency_job_id":null,"html_url":"https://github.com/ModelTC/general-sam-py","commit_stats":null,"previous_names":["modeltc/general-sam-py"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ModelTC%2Fgeneral-sam-py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ModelTC%2Fgeneral-sam-py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ModelTC%2Fgeneral-sam-py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ModelTC%2Fgeneral-sam-py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ModelTC","download_url":"https://codeload.github.com/ModelTC/general-sam-py/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247251973,"owners_count":20908601,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T11:15:08.741Z","updated_at":"2025-04-04T21:30:36.630Z","avatar_url":"https://github.com/ModelTC.png","language":"Python","readme":"# general-sam-py\n\n[![PyPI version](https://img.shields.io/pypi/v/general-sam.svg)](https://pypi.org/project/general-sam/)\n[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-informational.svg)](#license)\n[![Build status](https://github.com/ModelTC/general-sam-py/actions/workflows/ci.yml/badge.svg)](https://github.com/ModelTC/general-sam-py/actions)\n\nPython bindings for [`general-sam`](https://github.com/ModelTC/general-sam)\nand some utilities.\n\n```mermaid\nflowchart LR\n  init((ε))\n  a((a))\n  b((b))\n  ab((ab))\n  bc(((bc)))\n  abc((abc))\n  abcb((abcb))\n  abcbc(((abcbc)))\n\n  init -- a --\u003e a\n  init -- b --\u003e b\n  a -- b --\u003e ab\n  b -- c --\u003e bc\n  init -- c --\u003e bc\n  ab -- c --\u003e abc\n  bc -- b --\u003e abcb\n  abc -- b --\u003e abcb\n  abcb -- c --\u003e abcbc\n```\n\n\u003e The suffix automaton of abcbc.\n\n## Installation\n\n```sh\npip install general-sam\n```\n\n## Usage\n\n### `GeneralSam`\n\n```python\nfrom general_sam import GeneralSam\n\nsam = GeneralSam.from_bytes(b\"abcbc\")\n\n# \"cbc\" is a suffix of \"abcbc\"\nstate = sam.get_root_state()\nstate.feed_bytes(b\"cbc\")\nassert state.is_accepting()\n\n# \"bcb\" is not a suffix of \"abcbc\"\nstate = sam.get_root_state()\nstate.feed_bytes(b\"bcb\")\nassert not state.is_accepting()\n```\n\n```python\nfrom general_sam import GeneralSam\n\nsam = GeneralSam.from_chars(\"abcbc\")\nstate = sam.get_root_state()\n\n# \"b\" is not a suffix but at least a substring of \"abcbc\"\nstate.feed_chars(\"b\")\nassert not state.is_accepting()\n\n# \"bc\" is a suffix of \"abcbc\"\nstate.feed_chars(\"c\")\nassert state.is_accepting()\n\n# \"bcbc\" is a suffix of \"abcbc\"\nstate.feed_chars(\"bc\")\nassert state.is_accepting()\n\n# \"bcbcbc\" is not a substring, much less a suffix of \"abcbc\"\nstate.feed_chars(\"bc\")\nassert not state.is_accepting() and state.is_nil()\n```\n\n```python\nfrom general_sam import GeneralSam, GeneralSamState, build_trie_from_chars\n\ntrie, _ = build_trie_from_chars([\"hello\", \"Chielo\"])\nsam = GeneralSam.from_trie(trie)\n\n\ndef fetch_state(s: str) -\u003e GeneralSamState:\n    state = sam.get_root_state()\n    state.feed_chars(s)\n    return state\n\n\nassert fetch_state(\"lo\").is_accepting()\nassert fetch_state(\"ello\").is_accepting()\nassert fetch_state(\"elo\").is_accepting()\n\nstate = fetch_state(\"el\")\nassert not state.is_accepting() and not state.is_nil()\n\nstate = fetch_state(\"bye\")\nassert not state.is_accepting() and state.is_nil()\n```\n\n### `VocabPrefixAutomaton`\n\n```python\nfrom general_sam import CountInfo, VocabPrefixAutomaton\n\nvocab = [\"歌曲\", \"聆听歌曲\", \"播放歌曲\", \"歌词\", \"查看歌词\"]\nautomaton = VocabPrefixAutomaton(vocab, bytes_or_chars=\"chars\")\n\n# NOTE: CountInfo instances are actually related to the sorted `vocab`:\n_ = [\"播放歌曲\", \"查看歌词\", \"歌曲\", \"歌词\", \"聆听歌曲\"]\n\n# Case 1:\n#   一起 | 聆 | 听 | 歌\nstate = automaton.get_root_state()\n\n# prepend '歌'\ncnt_info = automaton.prepend_feed(state, \"歌\")\nassert cnt_info is not None and cnt_info == CountInfo(\n    str_cnt=2, tot_cnt_lower=2, tot_cnt_upper=4\n)\n\n# found '歌曲' at the index 0 and '歌词' at the index 3 prefixed with '歌'\nselected_idx = automaton.get_order_slice(cnt_info)\nassert frozenset(selected_idx) == {0, 3}\nselected_vocab = [vocab[i] for i in selected_idx]\nassert frozenset(selected_vocab) == {\"歌曲\", \"歌词\"}\n\n# prepend 听\ncnt_info = automaton.prepend_feed(state, \"听\")\n# found nothing prefixed with '听歌'\nassert cnt_info is None\nassert not state.is_nil()\n\n# prepend 聆\ncnt_info = automaton.prepend_feed(state, \"聆\")\nassert cnt_info is not None and cnt_info == CountInfo(\n    str_cnt=1, tot_cnt_lower=4, tot_cnt_upper=5\n)\n\n# found '聆听歌曲' at the index 1 prefixed with '聆听歌'\nselected_idx = automaton.get_order_slice(cnt_info)\nassert frozenset(selected_idx) == {1}\nselected_vocab = [vocab[i] for i in selected_idx]\nassert frozenset(selected_vocab) == {\"聆听歌曲\"}\n\n# prepend 一起\nassert not state.is_nil()\n# found nothing prefixed with '一起聆听歌'\ncnt_info = automaton.prepend_feed(state, \"一起\")\nassert state.is_nil()\n\n# Case 2:\n#   来 | 查看 | 歌词\nstate = automaton.get_root_state()\n\n# prepend 歌词\ncnt_info = automaton.prepend_feed(state, \"歌词\")\nassert cnt_info is not None and cnt_info == CountInfo(\n    str_cnt=1, tot_cnt_lower=3, tot_cnt_upper=4\n)\n\n# found '歌词' at the index 3 prefixed with '歌词'\nselected_idx = automaton.get_order_slice(cnt_info)\nassert frozenset(selected_idx) == {3}\nselected_vocab = [vocab[i] for i in selected_idx]\nassert frozenset(selected_vocab) == {\"歌词\"}\n\n# prepend 查看\ncnt_info = automaton.prepend_feed(state, \"查看\")\nassert cnt_info is not None and cnt_info == CountInfo(\n    str_cnt=1, tot_cnt_lower=1, tot_cnt_upper=2\n)\n\n# found '查看歌词' at the index 4 prefixed with '查看歌词'\nselected_idx = automaton.get_order_slice(cnt_info)\nassert frozenset(selected_idx) == {4}\nselected_vocab = [vocab[i] for i in selected_idx]\nassert frozenset(selected_vocab) == {\"查看歌词\"}\n\n# prepend 来\nassert not state.is_nil()\n# found nothing prefixed with '来查看歌词'\ncnt_info = automaton.prepend_feed(state, \"来\")\nassert state.is_nil()\n```\n\n### `GreedyTokenizer`\n\n```python\nfrom general_sam import GeneralSam, GreedyTokenizer, build_trie_from_chars\n\nvocab = [\"a\", \"ab\", \"b\", \"bc\", \"c\", \"d\", \"e\", \"f\", \"cd\", \"abcde\"]\ntrie, token_to_trie_node = build_trie_from_chars(vocab)\n\ntrie_node_to_token = [-1] * trie.num_of_nodes()\nfor i, j in enumerate(token_to_trie_node):\n    trie_node_to_token[j] = i\n\nsam = GeneralSam.from_trie(trie)\ntokenizer = GreedyTokenizer.from_sam_and_trie(sam, trie)\n\n\ndef tokenize(s: str):\n    return [(trie_node_to_token[i], j) for i, j in tokenizer.tokenize_str(s)]\n\n\nassert tokenize(\"abcde\") == [(9, 5)]\nassert tokenize(\"abcdf\") == [(1, 2), (8, 2), (7, 1)]\nassert tokenize(\"abca\") == [(1, 2), (4, 1), (0, 1)]\n```\n\n## License\n\n- \u0026copy; 2023 Chielo Newctle \\\u003c[ChieloNewctle@gmail.com](mailto:ChieloNewctle@gmail.com)\\\u003e\n- \u0026copy; 2023 ModelTC Team\n\nThis project is licensed under either of\n\n- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ([`LICENSE-APACHE`](LICENSE-APACHE))\n- [MIT license](https://opensource.org/licenses/MIT) ([`LICENSE-MIT`](LICENSE-MIT))\n\nat your option.\n\nThe [SPDX](https://spdx.dev) license identifier for this project is `MIT OR Apache-2.0`.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodeltc%2Fgeneral-sam-py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmodeltc%2Fgeneral-sam-py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodeltc%2Fgeneral-sam-py/lists"}