An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with tokenizers

A curated list of projects in awesome lists tagged with tokenizers .

https://github.com/chonkie-ai/autotiktokenizer

🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨

machine-learning nlp tiktoken tokenizers transformers

Last synced: 12 Apr 2025

https://github.com/sayakpaul/count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

apache-beam dataflow hf-datasets tokenizers transformers unigram-tokenization

Last synced: 05 Sep 2025

https://github.com/sappho192/tokenizers.dotnet

[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library

csharp dotnet huggingface library nuget rust tokenizers

Last synced: 16 Apr 2026

https://github.com/unfoldingword/string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation

javascript nlp nlp-library scripture-open-components segmentation tokenizers

Last synced: 11 Jun 2025

https://github.com/beomi/megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

gpt-neox megatron-lm tokenizers transformers

Last synced: 07 May 2025

https://github.com/anush008/tokenizers

Multi-arch bindings for @huggingface/tokenizers.

huggingface tokenizers

Last synced: 14 Mar 2026

https://github.com/gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers

Last synced: 28 Apr 2025

https://github.com/wenbingl/tfmtok

The tokenizer C/C++ library for transformers model

tokenization tokenizers transformers

Last synced: 01 Aug 2025

https://github.com/jawrainey/hfta

Reference implementation: run any huggingface tokenizer in Android (rust).

android machine-learning on-device-ml rust tokenizers

Last synced: 15 May 2026

https://github.com/mkashirin/cattode

Lil GPT and BPE built from scratch using PyTorch.

bpe deeplearning gpt languagemodels pytorch tokenizers

Last synced: 24 Apr 2026

https://github.com/707/ml-workbench

Compare multilingual tokenizers and models for cost, context, and deployment decisions.

llm llms ml pre-training tokenization tokenizers

Last synced: 04 Apr 2026

https://github.com/helena-intel/test-prompt-generator

Create prompts with a given token length for testing LLMs and other transformers text models.

benchmarking llm llm-inference nlp tokenizers transformers

Last synced: 20 Jun 2026

https://github.com/bluryar/tokenizers.cpp

Native C++ inference-only tokenizer runtime port

cpp ggml huggingface inference tokenizers

Last synced: 28 Jun 2026

https://github.com/lepisma/tokenizers.el

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library

emacs-lisp rust tokenizers

Last synced: 25 Dec 2025

https://github.com/wassemgtk/supertokenizer

A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.

tokenizer tokenizer-framework tokenizers

Last synced: 01 Apr 2025

https://github.com/infinilabs/pizza-stemmers

🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.

fire-search infini-pizza lanaguage pizza-fire pizza-search-engine snowball snowballstemmer stemmers tokenization tokenizers

Last synced: 09 Jun 2026

https://github.com/duoan/replicateai

Recreating every milestone in Machine Learning and Artificial Intelligence

ai ai-history bert deep-learning foundation-models llama llava llm machine-learning ml qwen reproduce reproducibility tokenizers transformer

Last synced: 09 May 2026

https://github.com/sameermanan/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers

Last synced: 10 May 2026

https://github.com/omkarborhade98/text_summarization

Text Summarization using NLP

nlp tokenizers transformers

Last synced: 04 Apr 2025