Projects in Awesome Lists tagged with tokenizer
A curated list of projects in awesome lists tagged with tokenizer .
https://github.com/theseer/tokenizer
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Last synced: 11 May 2025
https://github.com/chevrotain/chevrotain
Parser Building Toolkit for JavaScript
grammars javascript lexer open-source parser-library parsing tokenizer typescript
Last synced: 12 Dec 2025
https://github.com/Chevrotain/chevrotain
Parser Building Toolkit for JavaScript
grammars javascript lexer open-source parser-library parsing tokenizer typescript
Last synced: 24 Mar 2025
https://github.com/roshan-research/hazm
Persian NLP Toolkit
dependency-parser embeddings farsi lemmatization natural-language-processing nlp normalization persian persian-nlp pos-tagging python text-processing tokenizer
Last synced: 20 Feb 2026
https://github.com/natasha/natasha
Solves basic Russian NLP tasks, API for lower level Natasha projects
embeddings morphology ner nlp python russian sentence-segmentation syntax tokenizer visualization
Last synced: 13 May 2025
https://github.com/lovit/soynlp
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
korean-nlp korean-text-processing nlp postagging tokenizer word-extraction
Last synced: 17 Jan 2026
https://github.com/ikawaha/kagome
Self-contained Japanese Morphological Analyzer written in pure Go
hacktoberfest japanese japanese-language korean morphological-analysis nlp-library pos-tagging segmentation tokenizer
Last synced: 03 Mar 2026
https://github.com/no-context/moo
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
javascript lexer regexp tokenizer
Last synced: 24 Apr 2025
https://github.com/mathewsanders/Mustard
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Last synced: 02 Aug 2025
https://github.com/wangfenjin/simple
支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin
chinese cpp14 fts fts5 pinyin sqlite sqlite3 sqlite3-fts5 tokenizer
Last synced: 15 May 2025
https://github.com/risesoft-y9/data-labeling
数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。
chinese data-annotation-tools data-annotations docker elasticsearch java nacos springboot2 tokenizer tokenizer-parser vue3
Last synced: 15 May 2025
https://github.com/cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
nlp nlp-library semeval spell-corrector spelling-correction text-processing text-segmentation tokenization tokenizer word-normalization word-segmentation
Last synced: 14 Jan 2026
https://github.com/open-korean-text/open-korean-text
Open Korean Text Processor - An Open-source Korean Text Processor
korean korean-text-processing korean-tokenizer natural-language-processing text-processing tokenizer
Last synced: 11 Jan 2026
https://github.com/smoothnlp/SmoothNLP
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
depedency-parsing nlp nlp-pipeline postagging python tokenizer
Last synced: 12 May 2025
https://github.com/jflex-de/jflex
The fast scanner generator for Java™ with full Unicode support
bazel-rules cup dfa dfa-minimization flex grammar java lexer lexer-generator lexical-analyzer maven-plugin nfa parsing regexp scanner scanner-generator tokenizer yacc
Last synced: 13 May 2025
https://github.com/alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
text-tokenization tokenisation tokenization tokenize tokenizer tokenizing vocabulary vocabulary-builder vocabulary-generator
Last synced: 16 Jan 2026
https://github.com/lindera/lindera
A multilingual morphological analysis library.
analyzer library morphological multilingual tokenizer
Last synced: 12 Mar 2026
https://github.com/niieani/gpt-tokenizer
The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.
bpe decoder encoder gpt-2 gpt-3 gpt-4 gpt-4o gpt-o1 machine-learning openai tokenizer
Last synced: 14 May 2025
https://github.com/glayzzle/php-parser
:herb: NodeJS PHP Parser - extract AST or tokens
ast development javascript lexer parser php php-ast php-parser static-code-analysis tokenizer
Last synced: 14 May 2025
https://github.com/lydell/js-tokens
Tiny JavaScript tokenizer.
ecmascript javascript regex tokenizer
Last synced: 13 May 2025
https://github.com/lionsoul2014/friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
c chinese-tokenizer chinese-word-segmentation cjk-tokenizer full-text-search japanese-tokenizer korean-tokenizer php-tokenizer tokenizer
Last synced: 05 Apr 2025
https://github.com/hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
machine-translation nlp tokenizer
Last synced: 20 Feb 2026
https://github.com/leodevbro/vscode-blockman
VSCode extension to highlight nested code blocks
abstract-syntax-tree ast highlight-blocks indentation parser tokenizer vscode-api vscode-blockman vscode-extension
Last synced: 21 Feb 2026
https://github.com/polm/fugashi
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
cython-wrapper japanese mecab nlp tokenizer
Last synced: 01 Feb 2026
https://github.com/CogComp/cogcomp-nlp
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
big-data cogcomp data-mining dependency-parsing lemmatization lemmatizer named-entity-recognition natural-language-processing natural-language-understanding ner nlp parts-of-speech-tagging pos pos-tagging relation-extraction similarity tokenizer transliteration
Last synced: 27 Mar 2025
https://github.com/neurosnap/sentences
A multilingual command line sentence tokenizer in Golang
cli sentence-tokenizer sentences tokenizer
Last synced: 16 May 2025
https://github.com/nlpoptimize/flash-tokenizer
EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
bert berttokenizer cpp cpp17 deep-learning flash huggingface nlp pybind11 python tokenizer trie wordpiece wordpiece-tokenization
Last synced: 15 May 2025
https://github.com/timtadh/lexmachine
Lex machinary for go.
dfa go lex lexer lexical-analysis-engines lexical-analysis-framework nfa regular-expression tokenizer
Last synced: 29 Jun 2025
https://github.com/taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
dynet japanese natural-language-processing nlp nlp-library pos-tagging sequence-labeling tokenizer word-segmentation
Last synced: 30 Apr 2025
https://github.com/zurawiki/tiktoken-rs
Ready-made tokenizer library for working with GPT and tiktoken
Last synced: 08 Apr 2026
https://github.com/daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
japanese morphological-analysis nlp rust segmentation tokenization tokenizer
Last synced: 15 May 2025
https://github.com/belladoreai/llama-tokenizer-js
JS tokenizer for LLaMA 1 and 2
javascript llama llm tokenizer
Last synced: 06 Apr 2025
https://github.com/opennmt/tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode
Last synced: 08 Oct 2025
https://github.com/guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
deep-learning rust-lang tokenizer transformer
Last synced: 15 May 2025
https://github.com/artitw/text2text
Text2Text Language Modeling Toolkit
chatbot chatgpt cross-lingual embeddings information-retrieval levenshtein-distance llama llm multi-lingual nlp question-generation rag search tf-idf tokenizer transformers translator
Last synced: 15 May 2025
https://github.com/sugarme/tokenizer
NLP tokenizers written in Go language
deep-learning golang-tokenizer nlp tokenizer
Last synced: 16 Jan 2026
https://github.com/NLPOptimize/flash-tokenizer
EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
bert berttokenizer cpp cpp17 deep-learning flash huggingface nlp pybind11 python tokenizer trie wordpiece wordpiece-tokenization
Last synced: 10 Apr 2025
https://github.com/dmitry-brazhenko/SharpToken
SharpToken is a C# library for tokenizing natural language text. It's based on the tiktoken Python library and designed to be fast and accurate.
cl100kbase csharp gpt gpt-3 gpt-4 openai tokenizer
Last synced: 08 Jun 2026
https://github.com/tlaceby/guide-to-interpreters-series
Contains source-code for viewers following along with my Beginners Guide To Building Interpreters series on my Youtube Channel.
ast ast-parser javascript lexer programming-language tokenizer typescript
Last synced: 23 Jan 2026
https://github.com/dmitry-brazhenko/sharptoken
SharpToken is a C# library for tokenizing natural language text. It's based on the tiktoken Python library and designed to be fast and accurate.
cl100kbase csharp gpt gpt-3 gpt-4 openai tokenizer
Last synced: 10 Aug 2025
https://github.com/daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
analyzer japanese morphological-analysis nlp rust segmentation tokenization tokenizer
Last synced: 12 Apr 2025
https://github.com/bnosac/udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
conll dependency-parser lemmatization natural-language-processing nlp pos-tagging r r-package r-pkg rcpp text-mining tokenizer udpipe
Last synced: 04 Apr 2025
https://github.com/netgen/query-translator
Query Translator is a search query translator with AST representation
ast edismax elasticsearch generator parser php query search solr tokenizer translator
Last synced: 12 Apr 2025
https://github.com/dadmatech/dadmatools
DadmaTools is a Persian NLP tools developed by Dadmatech Co.
chunker constituency-parser dataset-loader dependency-parser embedding-vectors embeddings lemmatizer natural-language-processing ner nlptoolkit persian persian-nlp postagger spacy tokenizer
Last synced: 25 Oct 2025
https://github.com/mck89/peast
JavaScript parser written in PHP that generates AST from your code according to ECMAScript specification
ast-generation ecmascript javascipt javascript parser parsing php syntax-tree tokenizer traverse validator
Last synced: 11 Jan 2026
https://github.com/zhenye234/xcodec
AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
audio audio-codec codec gpt language-model music self-supervised-learning semantic sound speech speech-language-model text-to-music text-to-sound text-to-speech tokenizer vall-e
Last synced: 11 Apr 2025
https://github.com/ropensci/tokenizers
Fast, Consistent Tokenization of Natural Language Text
nlp peer-reviewed r r-package rstats text-mining tokenizer
Last synced: 24 Sep 2025
https://github.com/Dadmatech/DadmaTools
DadmaTools is a Persian NLP tools developed by Dadmatech Co.
chunker constituency-parser dataset-loader dependency-parser embedding-vectors embeddings lemmatizer natural-language-processing ner nlptoolkit persian persian-nlp postagger spacy tokenizer
Last synced: 09 Jul 2025
https://github.com/botisan-ai/gpt3-tokenizer
Isomorphic JavaScript/TypeScript Tokenizer for GPT-3 and Codex Models by OpenAI.
chatgpt codex gpt-3 gpt3 javascript nodejs openai tokenizer typescript
Last synced: 08 Feb 2026
https://github.com/adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist
Last synced: 24 Dec 2025
https://github.com/gautierdag/bpeasy
Fast bare-bones BPE for modern tokenizer training
Last synced: 06 Apr 2025
https://github.com/tsproisl/SoMaJo
A tokenizer and sentence splitter for German and English web and social media texts.
english german sentence-splitter social-media tokenizer
Last synced: 12 Mar 2026
https://github.com/howl-anderson/microtokenizer
一个轻量且功能全面的中文分词器,帮助学生了解分词器的工作原理。MicroTokenizer: A lightweight Chinese tokenizer designed for educational and research purposes. Provides a practical, hands-on approach to understanding NLP concepts, featuring multiple tokenization algorithms and customizable models. Ideal for students, researchers, and NLP enthusiasts..
chinese-nlp chinese-tokenizer chinese-word-segmentation dag-network educational-project nlp-machine-learning tokenizer
Last synced: 12 Apr 2025
https://github.com/nette/tokenizer
[DISCONTINUED] Source code tokenizer
nette nette-framework php regular-expression tokenizer
Last synced: 01 Oct 2025
https://github.com/kensuke-mitsuzawa/japanesetokenizers
aim to use JapaneseTokenizer as easy as possible
dictionary-extension japanese-language juman jumanpp kytea mecab mecab-neologd-dictionary nlp tokenizer
Last synced: 17 Mar 2025
https://github.com/mykolaharmash/works-for-me
Collection of developer toolkits
developer-toolkit developer-tools development-environment development-workflow devtools lexer parser tokenizer
Last synced: 07 Mar 2026
https://github.com/MagedSaeed/farasapy
A Python implementation of Farasa toolkit
arabic arabic-nlp diacritization farasa named-entity-recognition nlp postagging python-library python3 python36 stemmers tokenizer
Last synced: 07 May 2025
https://github.com/Cledev-Limited/Cledev.OpenAI
.NET 7 SDK for OpenAI with a Blazor Server playground
azureopenai blazor blazor-server chat-gpt chatgpt chatgpt-4 chatgpt-api dall-e dontnet-core dotnet gpt-3 gpt3 net7 openai openai-api sdk sdk-dotnet tokenizer whisper whisper-ai
Last synced: 13 May 2025
https://github.com/kakaobrain/kortok
The code and models for "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks" (AACL-IJCNLP 2020)
aacl korean machine-translation natural-language-understanding tokenizer
Last synced: 24 Apr 2025
https://github.com/nooscraft/tokuin
CLI tool – estimates LLM tokens/costs and runs provider-aware load tests for OpenAI, Anthropic, OpenRouter, or custom endpoints.
llms prompt-engineering rust tokenizer
Last synced: 20 Feb 2026
https://github.com/kyegomez/mambabyte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
ai artificial-intelligence gpt4v machine-learning mamba megabyte ml multi-modality tokenizer
Last synced: 04 Apr 2025
https://github.com/clipperhouse/jargon
Tokenizers and lemmatizers for Go
data-science go lemmatizer nlp tokenizer
Last synced: 09 Apr 2025
https://github.com/belladoreai/llama3-tokenizer-js
JS tokenizer for LLaMA 3 and LLaMA 3.1
Last synced: 16 May 2025
https://github.com/togatoga/kanpyo
Japanese Morphological Analyzer written in Rust
japanese morphological rust tokenizer
Last synced: 30 Jan 2026
https://github.com/bevacqua/megamark
:heart_eyes_cat: Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer
Last synced: 04 Oct 2025
https://github.com/cledev-limited/cledev.openai
.NET 7 SDK for OpenAI with a Blazor Server playground
azureopenai blazor blazor-server chat-gpt chatgpt chatgpt-4 chatgpt-api dall-e dontnet-core dotnet gpt-3 gpt3 net7 openai openai-api sdk sdk-dotnet tokenizer whisper whisper-ai
Last synced: 22 Apr 2025
https://github.com/julialang/tokenize.jl
Tokenization for Julia source code
Last synced: 06 Apr 2025
https://github.com/chriskonnertz/string-calc
PHP calculator library for mathematical terms (expressions) passed as strings
calc calculate calculator math mathematical mathematics parser php php-calculator string term tokenizer
Last synced: 06 Apr 2025
https://github.com/kyegomez/MambaByte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
ai artificial-intelligence gpt4v machine-learning mamba megabyte ml multi-modality tokenizer
Last synced: 20 Mar 2025
https://github.com/dluc/openai-tools
A collection of tools for working with OpenAI
gpt-3 gpt3 openai tokenization tokenizer
Last synced: 19 Apr 2025
https://github.com/clipperhouse/uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.
go golang nlp tokenization tokenizer uax29 unicode
Last synced: 01 Feb 2026
https://github.com/explosion/spacy-experimental
🧪 Cutting-edge experimental spaCy components and features
lemmatizer machine-learning natural-language-processing nlp spacy spacy-extension spacy-pipeline tokenizer
Last synced: 07 Apr 2025
https://github.com/yishn/chinese-tokenizer
Tokenizes Chinese texts into words.
chinese language tokenizer words
Last synced: 08 Apr 2025
https://github.com/bzick/tokenizer
Tokenizer (lexer) for golang
golang lexer parse parser tokenizer tokenizing
Last synced: 28 Apr 2025
https://github.com/alfianlosari/gptencoder
Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.
chatgpt encoder-decoder gpt gpt-3 gpt4 openai swift tokenizer
Last synced: 10 Apr 2025
https://github.com/colindembovsky/cols-agent-tasks
Colin's ALM Corner Custom Build Tasks
build coverage dacpac replace-token tag tokenizer versioning vsts vsts-extension
Last synced: 04 Apr 2025
https://github.com/tryagi/tiktoken
High-performance .NET BPE tokenizer — up to 618 MiB/s, competitive with Rust. Zero-allocation counting, multilingual cache, o200k/cl100k/r50k/p50k encodings + HuggingFace tokenizer.json support.
ai bpe cl100k-base csharp dotnet gpt4o high-performance huggingface o200k-base openai sdk tiktoken tokenizer zero-allocation
Last synced: 01 Apr 2026
https://github.com/alfianlosari/GPTEncoder
Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.
chatgpt encoder-decoder gpt gpt-3 gpt4 openai swift tokenizer
Last synced: 18 Jul 2025
https://github.com/samber/go-gpt-3-encoder
Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3
bpe byte-pair-encoding codex decoder encoder go gpt-2 gpt-3 openai token tokenizer transformer
Last synced: 05 Apr 2025
https://github.com/venturachrisdev/djurl
Simple yet helpful library for writing Django urls by an easy, short and intuitive way.
django django-routing python regex routing tokenizer url urls web
Last synced: 08 Apr 2026
https://github.com/ikskuh/parser-toolkit
A toolkit that makes it easier to write recursive-descent parsers in Zig.
compiler compiler-frontend parser recursive-descent-parser tokenizer tokenizer-parser zig zig-package ziglang
Last synced: 02 Sep 2025
https://github.com/tangxiaolv/android-sqlite-fts5-tokenizer
集成了FTS5中文分词器的Sqlite3源码
fts5 fts5-chinese-tokenizer sqlite sqlite3-source tokenizer
Last synced: 25 Apr 2025
https://github.com/janlelis/wirb
Ruby Object Inspection for IRB
hacktoberfest irb ruby stdlib syntax-highlighting terminal tokenizer
Last synced: 09 Oct 2025
https://github.com/csstools/tokenizer
Tokenize CSS according to the CSS Syntax
Last synced: 28 Apr 2025
https://github.com/mideind/GreynirServer
The greynir.is Icelandic natural language processing API and website.
earley grammar icelandic icelandic-language icelandic-news-sites information-extraction natural-language-processing natural-language-queries nlp parse-forests parse-trees parser python tf-idf tokenizer
Last synced: 23 Mar 2025
https://github.com/mideind/greynirserver
The greynir.is Icelandic natural language processing API and website.
earley grammar icelandic icelandic-language icelandic-news-sites information-extraction natural-language-processing natural-language-queries nlp parse-forests parse-trees parser python tf-idf tokenizer
Last synced: 24 Sep 2025
https://github.com/kyubyong/neural_tokenizer
Tokenize English sentences using neural networks.
language neural-network tokenizer
Last synced: 24 Apr 2025
https://github.com/openshieldai/openshield
OpenShield is a new generation security layer for AI models
ai artificial-intelligence firewall golang guardian llama llm models openai openai-api owasp probllama python security security-tools tiktoken tokenizer
Last synced: 11 Jan 2026
https://github.com/voine/bert-vits2-mnn
TTS System Bert-VITS2 Android Ver, powered by alibaba-MNN engine.
android android-app bert bert-vits2 cppjieba mnn tokenizer tts tts-android tts-engines vits
Last synced: 30 Jul 2025
https://github.com/winkjs/wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
devanagari french german hindi konkani latin marathi multilingual tagging tokenization tokenizer wink
Last synced: 28 Oct 2025
https://github.com/piotrmurach/lex
Lex is an implementation of lex tool in Ruby.
compiler lexer lexing ruby ruby-gem state-lexer tokenizer
Last synced: 12 Jun 2025
https://github.com/dnanhkhoa/python-vncorenlp
A Python wrapper for VnCoreNLP using a bidirectional communication channel.
dependency-parser named-entity-recognition ner nlp parser pos-tagger postagger python-vncorenlp tokenizer vietnamese-nlp vncorenlp word-segmentation
Last synced: 14 Apr 2025