{"id":23364272,"url":"https://github.com/georg-jung/fastberttokenizer","last_synced_at":"2025-04-04T17:05:24.809Z","repository":{"id":194594624,"uuid":"690963846","full_name":"georg-jung/FastBertTokenizer","owner":"georg-jung","description":"Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.","archived":false,"fork":false,"pushed_at":"2025-03-20T15:31:23.000Z","size":20147,"stargazers_count":47,"open_issues_count":7,"forks_count":10,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-02T01:41:21.697Z","etag":null,"topics":["ai","bert","bert-embeddings","llm","machine-learning","natural-language-processing","nlp","nlp-machine-learning","tokenization","tokens","wordpiece","wordpiece-tokenization"],"latest_commit_sha":null,"homepage":"https://fastberttokenizer.gjung.com/","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/georg-jung.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-13T08:30:08.000Z","updated_at":"2025-03-28T02:18:09.000Z","dependencies_parsed_at":"2023-09-14T07:44:57.318Z","dependency_job_id":"970f7785-b7e7-4e07-9202-592ba7275517","html_url":"https://github.com/georg-jung/FastBertTokenizer","commit_stats":{"total_commits":145,"total_committers":2,"mean_commits":72.5,"dds":0.03448275862068961,"last_synced_commit":"744b92dd4d51a634122a9a3d34d3ad7f40ecb6d6"},"previous_names":["georg-jung/fastberttokenizer"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georg-jung%2FFastBertTokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georg-jung%2FFastBertTokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georg-jung%2FFastBertTokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georg-jung%2FFastBertTokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/georg-jung","download_url":"https://codeload.github.com/georg-jung/FastBertTokenizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246741185,"owners_count":20826063,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","bert","bert-embeddings","llm","machine-learning","natural-language-processing","nlp","nlp-machine-learning","tokenization","tokens","wordpiece","wordpiece-tokenization"],"created_at":"2024-12-21T13:14:44.499Z","updated_at":"2025-04-04T17:05:24.788Z","avatar_url":"https://github.com/georg-jung.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\" id=\"toplogo\"\u003e\n  \u003ca href=\"https://www.nuget.org/packages/FastBertTokenizer/\"\u003e\n    \u003c!-- https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#specifying-the-theme-an-image-is-shown-to --\u003e\n    \u003cpicture\u003e\n      \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"logo-darkmode.svg\"\u003e\n      \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"logo.svg\"\u003e\n      \u003cimg alt=\"FastBertTokenizer Logo\" src=\"logo.svg\" width=\"100\"\u003e\n    \u003c/picture\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n# FastBertTokenizer\n\n[![NuGet version (FastBertTokenizer)](https://img.shields.io/nuget/v/FastBertTokenizer.svg?style=flat)](https://www.nuget.org/packages/FastBertTokenizer/)\n[![Docs](https://img.shields.io/badge/Docs-fastberttokenizer.gjung.com-blue)](https://fastberttokenizer.gjung.com/)\n![.NET Build](https://github.com/georg-jung/FastBertTokenizer/actions/workflows/ci.yml/badge.svg)\n[![codecov](https://codecov.io/github/georg-jung/FastBertTokenizer/graph/badge.svg?token=PEINHYEBGH)](https://codecov.io/github/georg-jung/FastBertTokenizer)\n\nA fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks. Native AOT compatible and support for `netstandard2.0`.\n\n## Goals\n\n* Enabling you to run your AI workloads on .NET in production.\n* **Correctness** - Results that are equivalent to [HuggingFace Transformers' `AutoTokenizer`'s](https://huggingface.co/docs/transformers/v4.33.0/en/model_doc/auto#transformers.AutoTokenizer) in all practical cases.\n* **Speed** - Tokenization should be as fast as reasonably possible.\n* **Ease of use** - The API should be easy to understand and use.\n\n## Getting Started\n\n```bash\ndotnet new console\ndotnet add package FastBertTokenizer\n```\n\n```csharp\nusing FastBertTokenizer;\n\nvar tok = new BertTokenizer();\nawait tok.LoadFromHuggingFaceAsync(\"bert-base-uncased\");\nvar (inputIds, attentionMask, tokenTypeIds) = tok.Encode(\"Lorem ipsum dolor sit amet.\");\nConsole.WriteLine(string.Join(\", \", inputIds.ToArray()));\nvar decoded = tok.Decode(inputIds.Span);\nConsole.WriteLine(decoded);\n\n// Output:\n// 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102\n// [CLS] lorem ipsum dolor sit amet. [SEP]\n```\n\n[*example project*](src/examples/QuickStart/)\n\n## Comparison to [BERTTokenizers](https://github.com/NMZivkovic/BertTokenizers)\n\n* about 1 order of magnitude faster\n* allocates more than 1 order of magnitude less memory\n* [better whitespace handling](https://github.com/NMZivkovic/BertTokenizers/issues/24)\n* [handles unknown characters correctly](https://github.com/NMZivkovic/BertTokenizers/issues/26)\n* [does not throw if text is longer than maximum sequence length](https://github.com/NMZivkovic/BertTokenizers/issues/18)\n* handles unicode control chars\n* handles other alphabets such as greek and right-to-left languages\n\nNote that while [BERTTokenizers handles token type incorrectly](https://github.com/NMZivkovic/BertTokenizers/issues/18), it does support input of two pieces of text that are tokenized with a separator in between. *FastBertTokenizer* currently does not support this.\n\n## Speed / Benchmarks\n\n\u003e tl;dr: FastBertTokenizer can encode 1 GB of text in around 2 s on a typical notebook CPU from 2020.\n\nAll benchmarks were performed on a typical end user notebook, a ThinkPad T14s Gen 1:\n\n```txt\nBenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3527/23H2/2023Update/SunValley3)\nAMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores\n.NET SDK 8.0.204\n```\n\nSimilar results can also be observed using [GitHub Actions](https://github.com/georg-jung/FastBertTokenizer/actions/workflows/benchmark.yml). Note that using shared CI runners for benchmarking has drawbacks and can lead to varying results though.\n\n### on NET 6.0 vs. on NET 8.0\n\n* `.NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2` vs `.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2`\n* Workload: Encode up to 512 tokens from each of 15,000 articles from simple english wikipedia.\n* Results: Total tokens produced: 3,657,145; on .NET 8: ~11m tokens/s single threaded, 73m tokens/s multi threaded.\n\n| Method                       | Runtime  | Mean      | Error    | StdDev   | Ratio | Gen0       | Gen1       | Gen2     | Allocated | Alloc Ratio |\n|----------------------------- |--------- |----------:|---------:|---------:|------:|-----------:|-----------:|---------:|----------:|------------:|\n| Singlethreaded               | .NET 6.0 | 450.39 ms | 7.340 ms | 6.866 ms |  1.00 |          - |          - |        - |      2 MB |        1.00 |\n| MultithreadedMemReuseBatched | .NET 6.0 |  72.46 ms | 1.337 ms | 1.251 ms |  0.16 |   750.0000 |   250.0000 | 250.0000 |  12.75 MB |        6.39 |\n|                              |          |           |          |          |       |            |            |          |           |             |\n| Singlethreaded               | .NET 8.0 | 332.51 ms | 6.574 ms | 7.826 ms |  1.00 |          - |          - |        - |   1.99 MB |        1.00 |\n| MultithreadedMemReuseBatched | .NET 8.0 |  50.83 ms | 0.999 ms | 1.995 ms |  0.15 |   500.0000 |          - |        - |  12.75 MB |        6.40 |\n\n### vs. [SharpToken](https://github.com/dmitry-brazhenko/SharpToken)\n\n* `SharpToken v2.0.2`\n* `.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2`\n* Workload: Fully encode 15,000 articles from simple english wikipedia. Total tokens produced by FastBertTokenizer: 5,807,949 (~9.4m tokens/s single threaded).\n\nThis isn't an apples to apples comparison as BPE (what SharpToken does) and WordPiece encoding (what FastBertTokenizer does) are different tasks/algorithms. Both were applied to exactly the same texts/corpus though.\n\n| Method                        | Mean       | Error    | StdDev   | Gen0      | Gen1      | Allocated |\n|------------------------------ |-----------:|---------:|---------:|----------:|----------:|----------:|\n| SharpTokenFullArticles        | 1,551.9 ms | 25.82 ms | 24.15 ms | 5000.0000 | 2000.0000 |  32.56 MB |\n| FastBertTokenizerFullArticles |   620.3 ms |  7.00 ms |  6.21 ms |         - |         - |   2.26 MB |\n\n### vs. HuggingFace [tokenizers](https://github.com/huggingface/tokenizers) (Rust)\n\n`tokenizers v0.19.1`\n\nI'm not really experienced in benchmarking rust code, but my attempts using criterion.rs (see `src/HuggingfaceTokenizer/BenchRust`) suggest that it takes tokenizers around\n\n* batched/multi threaded: ~2 s (~2.9m tokens/s)\n* single threaded: ~10 s (~0.6m tokens/s)\n\nto produce 5,807,947 tokens from the same 15k simple english wikipedia articles. Contrary to what one might expect, this does mean that FastBertTokenizer, beeing a managed implementation, outperforms tokenizers. It should be noted though that tokenizers has a much more complete feature set while FastBertTokenizer is specifically optimized for WordPiece/Bert encoding.\n\nThe tokenizers repo states `Takes less than 20 seconds to tokenize a GB of text on a server's CPU.` As 26 MB of text take ~2s on my notebook CPU, 1 GB would take roughly 80 s. I think it makes sense that \"a server's CPU\" might be 4x as fast as my notebook's CPU and thus think my results seem plausible. It is however also possible that I unintentionally handicapped tokenizers somehow. Please let me know if you think so!\n\n### vs. [BERTTokenizers](https://github.com/NMZivkovic/BertTokenizers)\n\n* `BERTTokenizers v1.2.0`\n* `.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2`\n* Workload: Prefixes of the contents of 15k simple english wikipedia articles, preprocessed to make them encodable by BERTTokenizers.\n\n| Method                                     | Mean       | Error    | StdDev   | Gen0        | Gen1       | Gen2      | Allocated  |\n|------------------------------------------- |-----------:|---------:|---------:|------------:|-----------:|----------:|-----------:|\n| NMZivkovic_BertTokenizers                  | 2,576.0 ms | 15.49 ms | 13.73 ms | 968000.0000 | 40000.0000 | 1000.0000 | 3430.51 MB |\n| FastBertTokenizer_SameDataAsBertTokenizers |   229.8 ms |  4.55 ms |  6.23 ms |           - |          - |         - |    1.03 MB |\n\n## Logo\n\nCreated by combining \u003chttps://icons.getbootstrap.com/icons/cursor-text/\u003e in .NET brand color with \u003chttps://icons.getbootstrap.com/icons/braces/\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorg-jung%2Ffastberttokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeorg-jung%2Ffastberttokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorg-jung%2Ffastberttokenizer/lists"}