An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with tokenizers

A curated list of projects in awesome lists tagged with tokenizers .

https://github.com/xebia-functional/xef

Building applications with LLMs through composability, in Kotlin, Scala, ...

agents ai artificial-intelligence chatgpt-api embeddings functional-programming kotlin llm multiplatform openai scala tokenizers

Last synced: 04 Apr 2025

https://github.com/chonkie-ai/autotiktokenizer

🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨

machine-learning nlp tiktoken tokenizers transformers

Last synced: 12 Apr 2025

https://github.com/sayakpaul/count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

apache-beam dataflow hf-datasets tokenizers transformers unigram-tokenization

Last synced: 06 May 2025

https://github.com/sappho192/tokenizers.dotnet

[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library

csharp dotnet huggingface library nuget rust tokenizers

Last synced: 12 Apr 2025

https://github.com/unfoldingword/string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation

javascript nlp nlp-library scripture-open-components segmentation tokenizers

Last synced: 14 Apr 2025

https://github.com/beomi/megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

gpt-neox megatron-lm tokenizers transformers

Last synced: 07 May 2025

https://github.com/anush008/tokenizers

Multi-arch bindings for @huggingface/tokenizers.

huggingface tokenizers

Last synced: 23 Mar 2025

https://github.com/gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers

Last synced: 28 Apr 2025

https://github.com/wenbingl/tfmtok

The tokenizer C/C++ library for transformers model

tokenization tokenizers transformers

Last synced: 03 Apr 2025

https://github.com/mkashirin/cattode

Lil GPT and BPE built from scratch using PyTorch.

bpe deeplearning gpt languagemodels pytorch tokenizers

Last synced: 09 Apr 2025

https://github.com/wassemgtk/supertokenizer

A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.

tokenizer tokenizer-framework tokenizers

Last synced: 01 Apr 2025

https://github.com/lepisma/tokenizers.el

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library

emacs-lisp rust tokenizers

Last synced: 12 Mar 2025

https://github.com/omkarborhade98/text_summarization

Text Summarization using NLP

nlp tokenizers transformers

Last synced: 04 Apr 2025

https://github.com/sameermanan/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers

Last synced: 22 Mar 2025

https://github.com/helena-intel/test-prompt-generator

Create prompts with a given token length for testing LLMs and other transformers text models.

benchmarking llm llm-inference nlp tokenizers transformers

Last synced: 24 Feb 2025

https://github.com/infinilabs/pizza-stemmers

🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.

fire-search infini-pizza lanaguage pizza-fire pizza-search-engine snowball snowballstemmer stemmers tokenization tokenizers

Last synced: 23 Feb 2025