{"id":20215984,"url":"https://github.com/thudm/icetk","last_synced_at":"2025-04-06T02:12:47.397Z","repository":{"id":62570185,"uuid":"440944101","full_name":"THUDM/icetk","owner":"THUDM","description":"A unified tokenization tool for Images, Chinese and English.","archived":false,"fork":false,"pushed_at":"2023-03-23T16:36:44.000Z","size":26,"stargazers_count":151,"open_issues_count":6,"forks_count":17,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-03-30T01:13:07.707Z","etag":null,"topics":["tokenization","transformer"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-22T18:04:16.000Z","updated_at":"2024-12-04T08:08:41.000Z","dependencies_parsed_at":"2024-06-18T17:12:10.338Z","dependency_job_id":null,"html_url":"https://github.com/THUDM/icetk","commit_stats":{"total_commits":8,"total_committers":3,"mean_commits":"2.6666666666666665","dds":0.5,"last_synced_commit":"ca254ee935a2038f046b9bbe57df579226992d30"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2Ficetk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2Ficetk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2Ficetk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2Ficetk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/icetk/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247423516,"owners_count":20936626,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["tokenization","transformer"],"created_at":"2024-11-14T06:25:46.532Z","updated_at":"2025-04-06T02:12:47.376Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ICE Tokenizer\n\n- Token id `[0, 20000)` are image tokens.\n- Token id `[20000, 20100)` are common tokens, mainly punctuations. E.g., `icetk[20000] == '\u003cunk\u003e'`, `icetk[20003] == '\u003cpad\u003e'`, `icetk[20006] == ','`.\n-  Token id `[20100, 83823)` are English tokens.\n-  Token id `[83823, 145653)` are Chinese tokens.\n-  Token id `[145653, 150000)` are rare tokens. E.g., `icetk[145803] == 'α'`.\n\nYou can install the package via \n```\npip install icetk\n```\n\n## Tokenization\n\n```python\nfrom icetk import icetk\ntokens = icetk.tokenize('Hello World! I am icetk.')\n# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']\nids = icetk.encode('Hello World! I am icetk.')\n# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]\nen = icetk.decode(ids)\n# en == 'Hello World! I am icetk.' # always perfectly recover (if without \u003cunk\u003e)\n\nids = icetk.encode('你好世界！这里是 icetk。')\n# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]\n\nids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)\n# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')\n# ids.shape == torch.Size([1, 1024])\nimg = icetk.decode(image_ids=ids, compress_rate=8)\n# img.shape == torch.Size([1, 3, 256, 256])\nfrom torchvision.utils import save_image\nsave_image(img, 'recover.jpg')\n\n# add special tokens\nicetk.add_special_tokens(['\u003cstart_of_image\u003e', '\u003cstart_of_english\u003e', '\u003cstart_of_chinese\u003e'])\n\n# transform \\n\nicetk.decode(icetk.encode('abc\\nhi', ignore_linebreak=False))\n# 'abc\\nhi'\nicetk.decode(icetk.encode('abc\\nhi'))\n# 'abc hi'\n\n# discourage rare composed tokens\nicetk.tokenize('//--------')\n# ['▁//', '--------']\nicetk.text_tokenizer.discourage_ids(range(125653,130000)) # or use icetk.text_tokenizer.discourage_tokens\nicetk.tokenize('//--------')\n# ['▁//', '-', '-', '-', '-', '-', '-', '-', '-']\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Ficetk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Ficetk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Ficetk/lists"}