{"id":15601035,"url":"https://github.com/lucidrains/charformer-pytorch","last_synced_at":"2025-08-20T22:32:05.491Z","repository":{"id":38892041,"uuid":"381767714","full_name":"lucidrains/charformer-pytorch","owner":"lucidrains","description":"Implementation of the GBST block from the Charformer paper, in Pytorch","archived":false,"fork":false,"pushed_at":"2021-07-15T01:20:40.000Z","size":79,"stargazers_count":117,"open_issues_count":5,"forks_count":11,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-07-03T08:37:51.031Z","etag":null,"topics":["artificial-intelligence","deep-learning","tokenization","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-30T16:32:13.000Z","updated_at":"2025-05-14T08:40:48.000Z","dependencies_parsed_at":"2022-07-11T20:02:14.853Z","dependency_job_id":null,"html_url":"https://github.com/lucidrains/charformer-pytorch","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/lucidrains/charformer-pytorch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcharformer-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcharformer-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcharformer-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcharformer-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/charformer-pytorch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcharformer-pytorch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271397961,"owners_count":24752641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-20T02:00:09.606Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","tokenization","transformer"],"created_at":"2024-10-03T02:12:42.614Z","updated_at":"2025-08-20T22:32:05.117Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"./charformer.png\" width=\"400px\"\u003e\u003c/img\u003e\n\n## Charformer - Pytorch\n\nImplementation of the GBST (gradient-based subword tokenization) module from the \u003ca href=\"https://arxiv.org/abs/2106.12672\"\u003eCharformer paper\u003c/a\u003e, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.\n\n\u003ca href=\"https://www.youtube.com/watch?v=debgj24BAZE\"\u003eAI Coffee Break with Letitia video\u003c/a\u003e\n\n## Install\n\n```bash\n$ pip install charformer-pytorch\n```\n\n## Usage\n\n```python\nimport torch\nfrom charformer_pytorch import GBST\n\ntokenizer = GBST(\n    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)\n    dim = 512,                    # dimension of token and intra-block positional embedding\n    max_block_size = 4,           # maximum block size\n    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by\n    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper\n)\n\ntokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)\nmask   = torch.ones(1, 1023).bool()\n\n# both tokens and mask will be appropriately downsampled\n\ntokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)\n\n# now pass this on to your transformer\n```\n\nDeviating from the paper, you can also specify block size(s) with different offsets. This is to cover a potential use-case for genomics pre-training, where the tokenizer should be able to learn the correct frame. Simply omit the `max_block_size`, and pass in `blocks` as a list of tuples of tuples, each tuple with the format `(block size, offset)`. Offsets must be less than the block size\n\n```python\nimport torch\nfrom charformer_pytorch import GBST\n\ntokenizer = GBST(\n    num_tokens = 4 + 1,\n    dim = 512,\n    blocks = ((3, 0), (3, 1), (3, 2)),  # block size of 3, with offsets of 0, 1, 2\n    downsample_factor = 3,\n    score_consensus_attn = True\n).cuda()\n\nbasepairs = torch.randint(0, 4, (1, 1023)).cuda()\nmask      = torch.ones(1, 1023).bool().cuda()\n\n# both basepairs and mask will be appropriately downsampled\n\nbasepairs, mask = tokenizer(basepairs, mask = mask)\n```\n\n## Citations\n\n```bibtex\n@misc{tay2021charformer,\n    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, \n    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},\n    year    = {2021},\n    eprint  = {2106.12672},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fcharformer-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fcharformer-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fcharformer-pytorch/lists"}