Projects in Awesome Lists tagged with tokenizers
A curated list of projects in awesome lists tagged with tokenizers .
https://github.com/xebia-functional/xef
Building applications with LLMs through composability, in Kotlin, Scala, ...
agents ai artificial-intelligence chatgpt-api embeddings functional-programming kotlin llm multiplatform openai scala tokenizers
Last synced: 04 Apr 2025
https://github.com/chonkie-ai/autotiktokenizer
🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨
machine-learning nlp tiktoken tokenizers transformers
Last synced: 12 Apr 2025
https://github.com/prismadic/magnet
the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly
apple-silicon claude distributed-computing distributed-systems embeddings fine-tuning finetuning-llms gemini huggingface inference-api langchain llm-training milvus mistral mlx nats nats-messaging nats-streaming sentence-splitting tokenizers
Last synced: 13 Apr 2025
https://github.com/sayakpaul/count-tokens-hf-datasets
This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.
apache-beam dataflow hf-datasets tokenizers transformers unigram-tokenization
Last synced: 06 May 2025
https://github.com/Prismadic/magnet
the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly
apple-silicon claude distributed-computing distributed-systems embeddings fine-tuning finetuning-llms gemini huggingface inference-api langchain llm-training milvus mistral mlx nats nats-messaging nats-streaming sentence-splitting tokenizers
Last synced: 12 Dec 2024
https://github.com/megagonlabs/ginza-transformers
Use custom tokenizers in spacy-transformers
ginza natural-language-processing nlp spacy spacy-transformers sudachitra tokenizers transformers
Last synced: 12 Apr 2025
https://github.com/sappho192/tokenizers.dotnet
[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library
csharp dotnet huggingface library nuget rust tokenizers
Last synced: 12 Apr 2025
https://github.com/unfoldingword/string-punctuation-tokenizer
Small library that provides functions to tokenize a string into an array of words with or without punctuation
javascript nlp nlp-library scripture-open-components segmentation tokenizers
Last synced: 14 Apr 2025
https://github.com/beomi/megatronlm_dataset_autotokenizer
Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.
gpt-neox megatron-lm tokenizers transformers
Last synced: 07 May 2025
https://github.com/anush008/tokenizers
Multi-arch bindings for @huggingface/tokenizers.
Last synced: 23 Mar 2025
https://github.com/jeronymous/deep_learning_notebooks
Self-containing notebooks to play simply with some particular concepts in Deep Learning
artificial-intelligence artificial-neural-networks automatic-speech-recognition deep-learning deep-neural-networks machine-learning natural-language-processing speech-recognition speech-to-text tokenization tokenizer-nlp tokenizers
Last synced: 22 Apr 2025
https://github.com/gweidart/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers
Last synced: 28 Apr 2025
https://github.com/wenbingl/tfmtok
The tokenizer C/C++ library for transformers model
tokenization tokenizers transformers
Last synced: 03 Apr 2025
https://github.com/mkashirin/cattode
Lil GPT and BPE built from scratch using PyTorch.
bpe deeplearning gpt languagemodels pytorch tokenizers
Last synced: 09 Apr 2025
https://github.com/wassemgtk/supertokenizer
A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.
tokenizer tokenizer-framework tokenizers
Last synced: 01 Apr 2025
https://github.com/lepisma/tokenizers.el
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library
Last synced: 12 Mar 2025
https://github.com/omkarborhade98/text_summarization
Text Summarization using NLP
Last synced: 04 Apr 2025
https://github.com/sameermanan/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers
Last synced: 22 Mar 2025
https://github.com/matesxs/codetransformer
github gpt gpt2 model python3 pytorch tensorflow2 tokenizer tokenizers transformers
Last synced: 02 Apr 2025
https://github.com/helena-intel/test-prompt-generator
Create prompts with a given token length for testing LLMs and other transformers text models.
benchmarking llm llm-inference nlp tokenizers transformers
Last synced: 24 Feb 2025
https://github.com/infinilabs/pizza-stemmers
🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.
fire-search infini-pizza lanaguage pizza-fire pizza-search-engine snowball snowballstemmer stemmers tokenization tokenizers
Last synced: 23 Feb 2025