{"id":16510749,"url":"https://github.com/anki-code/tokenize-output","last_synced_at":"2025-12-12T00:53:04.281Z","repository":{"id":43483086,"uuid":"266193569","full_name":"anki-code/tokenize-output","owner":"anki-code","description":"Get identifiers, names, paths, URLs and words from the terminal command output.","archived":false,"fork":false,"pushed_at":"2023-04-12T09:37:39.000Z","size":39,"stargazers_count":7,"open_issues_count":2,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-24T18:30:31.408Z","etag":null,"topics":["console","output","shell","terminal","tokenization","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anki-code.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"custom":["https://github.com/anki-code","https://www.buymeacoffee.com/xxh"]}},"created_at":"2020-05-22T19:39:27.000Z","updated_at":"2025-01-11T12:10:58.000Z","dependencies_parsed_at":"2025-06-11T14:50:04.860Z","dependency_job_id":null,"html_url":"https://github.com/anki-code/tokenize-output","commit_stats":null,"previous_names":["tokenizer/tokenize-output"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/anki-code/tokenize-output","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anki-code%2Ftokenize-output","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anki-code%2Ftokenize-output/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anki-code%2Ftokenize-output/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anki-code%2Ftokenize-output/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anki-code","download_url":"https://codeload.github.com/anki-code/tokenize-output/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anki-code%2Ftokenize-output/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275982707,"owners_count":25564148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-19T02:00:09.700Z","response_time":108,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["console","output","shell","terminal","tokenization","tokenizer"],"created_at":"2024-10-11T15:56:53.209Z","updated_at":"2025-09-19T18:29:45.675Z","avatar_url":"https://github.com/anki-code.png","language":"Python","funding_links":["https://github.com/anki-code","https://www.buymeacoffee.com/xxh"],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\nGet identifiers, names, paths, URLs and words from the command output.\u003cbr\u003e \nThe \u003ca href=\"https://github.com/anki-code/xontrib-output-search\"\u003exontrib-output-search\u003c/a\u003e for \u003ca href=\"https://xon.sh/\"\u003exonsh shell\u003c/a\u003e is using this library.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e  \nIf you like the idea click ⭐ on the repo and stay tuned by watching releases.\n\u003c/p\u003e\n\n## Install\n```shell script\npip install -U tokenize-output\n```\n\n## Usage\n\nYou can use `tokenize-output` command as well as export the tokenizers in Python:\n```python\nfrom tokenize_output.tokenize_output import *\ntokenizer_split(\"Hello world!\")\n# {'final': set(), 'new': {'Hello', 'world!'}}\n```\n\n#### Words tokenizing\n```shell script\necho \"Try https://github.com/xxh/xxh\" | tokenize-output -p\n# Try\n# https://github.com/xxh/xxh\n```\n\n#### JSON, Python dict and JavaScript object tokenizing\n```shell script\necho '{\"Try\": \"xonsh shell\"}' | tokenize-output -p\n# Try\n# shell\n# xonsh\n# xonsh shell\n```    \n\n#### env tokenizing\n```shell script\necho 'PATH=/one/two:/three/four' | tokenize-output -p\n# /one/two\n# /one/two:/three/four\n# /three/four\n# PATH\n```    \n\n## Development\n\n### Tokenizers\nTokenizer is a functions which extract tokens from the text.\n\n| Priority | Tokenizer  | Text example  | Tokens |\n| ---------| ---------- | ----- | ------ |\n| 1        | **dict**   | `{\"key\": \"val as str\"}` | `key`, `val as str` |\n| 2        | **env**    | `PATH=/bin:/etc` | `PATH`, `/bin:/etc`, `/bin`, `/etc` |   \n| 3        | **split**  | `Split  me \\n now!` | `Split`, `me`, `now!` |   \n| 4        | **strip**  | `{Hello}!.` | `Hello` |   \n\nYou can create your tokenizer and add it to `tokenizers_all` in `tokenize_output.py`.\n\nTokenizing is a recursive process where every tokenizer returns `final` and `new` tokens. \nThe `final` tokens directly go to the result list of tokens. The `new` tokens go to all \ntokenizers again to find new tokens. As result if there is a mix of json and env data \nin the output it will be found and tokenized in appropriate way.  \n\n### How to add tokenizer\n\nYou can start from `env` tokenizer:\n\n1. [Prepare regexp](https://github.com/tokenizer/tokenize-output/blob/25b930cfadf8291e72a72144962e411e47d28139/tokenize_output/tokenize_output.py#L10)\n2. [Prepare tokenizer function](https://github.com/tokenizer/tokenize-output/blob/25b930cfadf8291e72a72144962e411e47d28139/tokenize_output/tokenize_output.py#L57-L70)\n3. [Add the function to the list](https://github.com/tokenizer/tokenize-output/blob/25b930cfadf8291e72a72144962e411e47d28139/tokenize_output/tokenize_output.py#L139-L144) and [to the preset](https://github.com/tokenizer/tokenize-output/blob/25b930cfadf8291e72a72144962e411e47d28139/tokenize_output/tokenize_output.py#L147).\n4. [Add test](https://github.com/tokenizer/tokenize-output/blob/25b930cfadf8291e72a72144962e411e47d28139/tests/test_tokenize.py#L34-L35).\n5. Now you can test and debug (see below).\n\n### Test and debug\nRun tests:\n```shell script\ncd ~\ngit clone https://github.com/anki-code/tokenize-output\ncd tokenize-output\npython -m pytest tests/\n```\nTo debug the tokenizer:\n```shell script\necho \"Hello world\" | ./tokenize-output -p\n```\n\n## Related projects\n* [xontrib-output-search][XONTRIB_OUTPUT_SEARCH] for [xonsh shell][XONSH]\n\n[XONTRIB_OUTPUT_SEARCH]: https://github.com/anki-code/xontrib-output-search\n[XONSH]: https://xon.sh/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanki-code%2Ftokenize-output","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanki-code%2Ftokenize-output","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanki-code%2Ftokenize-output/lists"}