An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with tokenizer

A curated list of projects in awesome lists tagged with tokenizer .

https://github.com/theseer/tokenizer

A small library for converting tokenized PHP source code into XML (and potentially other formats)

php tokenizer xml

Last synced: 11 May 2025

https://github.com/natasha/natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

embeddings morphology ner nlp python russian sentence-segmentation syntax tokenizer visualization

Last synced: 13 May 2025

https://github.com/dqbd/tiktokenizer

Online playground for OpenAPI tokenizers

chatgpt nextjs openai t3-stack tiktoken tokenizer

Last synced: 15 May 2025

https://github.com/lovit/soynlp

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

korean-nlp korean-text-processing nlp postagging tokenizer word-extraction

Last synced: 17 Jan 2026

https://github.com/ikawaha/kagome

Self-contained Japanese Morphological Analyzer written in pure Go

hacktoberfest japanese japanese-language korean morphological-analysis nlp-library pos-tagging segmentation tokenizer

Last synced: 03 Mar 2026

https://github.com/no-context/moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

javascript lexer regexp tokenizer

Last synced: 24 Apr 2025

https://github.com/mathewsanders/Mustard

🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

substrings swift tokenizer

Last synced: 02 Aug 2025

https://github.com/wangfenjin/simple

支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin

chinese cpp14 fts fts5 pinyin sqlite sqlite3 sqlite3-fts5 tokenizer

Last synced: 15 May 2025

https://github.com/risesoft-y9/data-labeling

数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

chinese data-annotation-tools data-annotations docker elasticsearch java nacos springboot2 tokenizer tokenizer-parser vue3

Last synced: 15 May 2025

https://github.com/cbaziotis/ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

nlp nlp-library semeval spell-corrector spelling-correction text-processing text-segmentation tokenization tokenizer word-normalization word-segmentation

Last synced: 14 Jan 2026

https://github.com/smoothnlp/SmoothNLP

专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference

depedency-parsing nlp nlp-pipeline postagging python tokenizer

Last synced: 12 May 2025

https://github.com/alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

text-tokenization tokenisation tokenization tokenize tokenizer tokenizing vocabulary vocabulary-builder vocabulary-generator

Last synced: 16 Jan 2026

https://github.com/lindera/lindera

A multilingual morphological analysis library.

analyzer library morphological multilingual tokenizer

Last synced: 12 Mar 2026

https://github.com/niieani/gpt-tokenizer

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.

bpe decoder encoder gpt-2 gpt-3 gpt-4 gpt-4o gpt-o1 machine-learning openai tokenizer

Last synced: 14 May 2025

https://github.com/glayzzle/php-parser

:herb: NodeJS PHP Parser - extract AST or tokens

ast development javascript lexer parser php php-ast php-parser static-code-analysis tokenizer

Last synced: 14 May 2025

https://github.com/lydell/js-tokens

Tiny JavaScript tokenizer.

ecmascript javascript regex tokenizer

Last synced: 13 May 2025

https://github.com/lionsoul2014/friso

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

c chinese-tokenizer chinese-word-segmentation cjk-tokenizer full-text-search japanese-tokenizer korean-tokenizer php-tokenizer tokenizer

Last synced: 05 Apr 2025

https://github.com/hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

machine-translation nlp tokenizer

Last synced: 20 Feb 2026

https://github.com/polm/fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

cython-wrapper japanese mecab nlp tokenizer

Last synced: 01 Feb 2026

https://github.com/CogComp/cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

big-data cogcomp data-mining dependency-parsing lemmatization lemmatizer named-entity-recognition natural-language-processing natural-language-understanding ner nlp parts-of-speech-tagging pos pos-tagging relation-extraction similarity tokenizer transliteration

Last synced: 27 Mar 2025

https://github.com/neurosnap/sentences

A multilingual command line sentence tokenizer in Golang

cli sentence-tokenizer sentences tokenizer

Last synced: 16 May 2025

https://github.com/zurawiki/tiktoken-rs

Ready-made tokenizer library for working with GPT and tiktoken

bpe openai rust tokenizer

Last synced: 08 Apr 2026

https://github.com/daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

japanese morphological-analysis nlp rust segmentation tokenization tokenizer

Last synced: 15 May 2025

https://github.com/belladoreai/llama-tokenizer-js

JS tokenizer for LLaMA 1 and 2

javascript llama llm tokenizer

Last synced: 06 Apr 2025

https://github.com/opennmt/tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode

Last synced: 08 Oct 2025

https://github.com/guillaume-be/rust-tokenizers

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

deep-learning rust-lang tokenizer transformer

Last synced: 15 May 2025

https://github.com/sugarme/tokenizer

NLP tokenizers written in Go language

deep-learning golang-tokenizer nlp tokenizer

Last synced: 16 Jan 2026

https://github.com/dmitry-brazhenko/SharpToken

SharpToken is a C# library for tokenizing natural language text. It's based on the tiktoken Python library and designed to be fast and accurate.

cl100kbase csharp gpt gpt-3 gpt-4 openai tokenizer

Last synced: 08 Jun 2026

https://github.com/tlaceby/guide-to-interpreters-series

Contains source-code for viewers following along with my Beginners Guide To Building Interpreters series on my Youtube Channel.

ast ast-parser javascript lexer programming-language tokenizer typescript

Last synced: 23 Jan 2026

https://github.com/dmitry-brazhenko/sharptoken

SharpToken is a C# library for tokenizing natural language text. It's based on the tiktoken Python library and designed to be fast and accurate.

cl100kbase csharp gpt gpt-3 gpt-4 openai tokenizer

Last synced: 10 Aug 2025

https://github.com/daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

analyzer japanese morphological-analysis nlp rust segmentation tokenization tokenizer

Last synced: 12 Apr 2025

https://github.com/bnosac/udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

conll dependency-parser lemmatization natural-language-processing nlp pos-tagging r r-package r-pkg rcpp text-mining tokenizer udpipe

Last synced: 04 Apr 2025

https://github.com/netgen/query-translator

Query Translator is a search query translator with AST representation

ast edismax elasticsearch generator parser php query search solr tokenizer translator

Last synced: 12 Apr 2025

https://github.com/microsoft/tokenizer

Typescript and .NET implementation of BPE tokenizer for OpenAI LLMs.

ai gpt llm openai tokenizer

Last synced: 15 May 2025

https://github.com/mck89/peast

JavaScript parser written in PHP that generates AST from your code according to ECMAScript specification

ast-generation ecmascript javascipt javascript parser parsing php syntax-tree tokenizer traverse validator

Last synced: 11 Jan 2026

https://github.com/zhenye234/xcodec

AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

audio audio-codec codec gpt language-model music self-supervised-learning semantic sound speech speech-language-model text-to-music text-to-sound text-to-speech tokenizer vall-e

Last synced: 11 Apr 2025

https://github.com/ropensci/tokenizers

Fast, Consistent Tokenization of Natural Language Text

nlp peer-reviewed r r-package rstats text-mining tokenizer

Last synced: 24 Sep 2025

https://github.com/microsoft/Tokenizer

Typescript and .NET implementation of BPE tokenizer for OpenAI LLMs.

ai gpt llm openai tokenizer

Last synced: 09 Apr 2025

https://github.com/botisan-ai/gpt3-tokenizer

Isomorphic JavaScript/TypeScript Tokenizer for GPT-3 and Codex Models by OpenAI.

chatgpt codex gpt-3 gpt3 javascript nodejs openai tokenizer typescript

Last synced: 08 Feb 2026

https://github.com/untitaker/html5gum

A WHATWG-compliant HTML5 tokenizer and tag soup parser

html html5 lexer parser parsing sax tokenizer whatwg xml

Last synced: 15 May 2025

https://github.com/gautierdag/bpeasy

Fast bare-bones BPE for modern tokenizer training

bpe tokenization tokenizer

Last synced: 06 Apr 2025

https://github.com/tsproisl/SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.

english german sentence-splitter social-media tokenizer

Last synced: 12 Mar 2026

https://github.com/howl-anderson/microtokenizer

一个轻量且功能全面的中文分词器,帮助学生了解分词器的工作原理。MicroTokenizer: A lightweight Chinese tokenizer designed for educational and research purposes. Provides a practical, hands-on approach to understanding NLP concepts, featuring multiple tokenization algorithms and customizable models. Ideal for students, researchers, and NLP enthusiasts..

chinese-nlp chinese-tokenizer chinese-word-segmentation dag-network educational-project nlp-machine-learning tokenizer

Last synced: 12 Apr 2025

https://github.com/nette/tokenizer

[DISCONTINUED] Source code tokenizer

nette nette-framework php regular-expression tokenizer

Last synced: 01 Oct 2025

https://github.com/foonathan/lex

Replaced by foonathan/lexy

cplusplus lexer tokenizer

Last synced: 08 May 2025

https://github.com/kakaobrain/kortok

The code and models for "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks" (AACL-IJCNLP 2020)

aacl korean machine-translation natural-language-understanding tokenizer

Last synced: 24 Apr 2025

https://github.com/nooscraft/tokuin

CLI tool – estimates LLM tokens/costs and runs provider-aware load tests for OpenAI, Anthropic, OpenRouter, or custom endpoints.

llms prompt-engineering rust tokenizer

Last synced: 20 Feb 2026

https://github.com/kyegomez/mambabyte

Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta

ai artificial-intelligence gpt4v machine-learning mamba megabyte ml multi-modality tokenizer

Last synced: 04 Apr 2025

https://github.com/clipperhouse/jargon

Tokenizers and lemmatizers for Go

data-science go lemmatizer nlp tokenizer

Last synced: 09 Apr 2025

https://github.com/belladoreai/llama3-tokenizer-js

JS tokenizer for LLaMA 3 and LLaMA 3.1

llama llama3 llm tokenizer

Last synced: 16 May 2025

https://github.com/togatoga/kanpyo

Japanese Morphological Analyzer written in Rust

japanese morphological rust tokenizer

Last synced: 30 Jan 2026

https://github.com/bevacqua/megamark

:heart_eyes_cat: Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer

markdown tokenizer

Last synced: 04 Oct 2025

https://github.com/julialang/tokenize.jl

Tokenization for Julia source code

julia lexer lexing tokenizer

Last synced: 06 Apr 2025

https://github.com/chriskonnertz/string-calc

PHP calculator library for mathematical terms (expressions) passed as strings

calc calculate calculator math mathematical mathematics parser php php-calculator string term tokenizer

Last synced: 06 Apr 2025

https://github.com/kyegomez/MambaByte

Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta

ai artificial-intelligence gpt4v machine-learning mamba megabyte ml multi-modality tokenizer

Last synced: 20 Mar 2025

https://github.com/amrdeveloper/fileql

A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.

database engine files gitql parser sql tokenizer

Last synced: 04 Apr 2025

https://github.com/dluc/openai-tools

A collection of tools for working with OpenAI

gpt-3 gpt3 openai tokenization tokenizer

Last synced: 19 Apr 2025

https://github.com/clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

go golang nlp tokenization tokenizer uax29 unicode

Last synced: 01 Feb 2026

https://github.com/yishn/chinese-tokenizer

Tokenizes Chinese texts into words.

chinese language tokenizer words

Last synced: 08 Apr 2025

https://github.com/bzick/tokenizer

Tokenizer (lexer) for golang

golang lexer parse parser tokenizer tokenizing

Last synced: 28 Apr 2025

https://github.com/alfianlosari/gptencoder

Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.

chatgpt encoder-decoder gpt gpt-3 gpt4 openai swift tokenizer

Last synced: 10 Apr 2025

https://github.com/tryagi/tiktoken

High-performance .NET BPE tokenizer — up to 618 MiB/s, competitive with Rust. Zero-allocation counting, multilingual cache, o200k/cl100k/r50k/p50k encodings + HuggingFace tokenizer.json support.

ai bpe cl100k-base csharp dotnet gpt4o high-performance huggingface o200k-base openai sdk tiktoken tokenizer zero-allocation

Last synced: 01 Apr 2026

https://github.com/alfianlosari/GPTEncoder

Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.

chatgpt encoder-decoder gpt gpt-3 gpt4 openai swift tokenizer

Last synced: 18 Jul 2025

https://github.com/samber/go-gpt-3-encoder

Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3

bpe byte-pair-encoding codex decoder encoder go gpt-2 gpt-3 openai token tokenizer transformer

Last synced: 05 Apr 2025

https://github.com/venturachrisdev/djurl

Simple yet helpful library for writing Django urls by an easy, short and intuitive way.

django django-routing python regex routing tokenizer url urls web

Last synced: 08 Apr 2026

https://github.com/ikskuh/parser-toolkit

A toolkit that makes it easier to write recursive-descent parsers in Zig.

compiler compiler-frontend parser recursive-descent-parser tokenizer tokenizer-parser zig zig-package ziglang

Last synced: 02 Sep 2025

https://github.com/AmrDeveloper/FileQL

A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.

database engine files gitql parser sql tokenizer

Last synced: 06 Aug 2025

https://github.com/janlelis/wirb

Ruby Object Inspection for IRB

hacktoberfest irb ruby stdlib syntax-highlighting terminal tokenizer

Last synced: 09 Oct 2025

https://github.com/csstools/tokenizer

Tokenize CSS according to the CSS Syntax

css tokenizer

Last synced: 28 Apr 2025

https://github.com/kyubyong/neural_tokenizer

Tokenize English sentences using neural networks.

language neural-network tokenizer

Last synced: 24 Apr 2025

https://github.com/voine/bert-vits2-mnn

TTS System Bert-VITS2 Android Ver, powered by alibaba-MNN engine.

android android-app bert bert-vits2 cppjieba mnn tokenizer tts tts-android tts-engines vits

Last synced: 30 Jul 2025

https://github.com/winkjs/wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

devanagari french german hindi konkani latin marathi multilingual tagging tokenization tokenizer wink

Last synced: 28 Oct 2025

https://github.com/lindera/lindera-tantivy

Lindera tokenizer for Tantivy.

lindera tantivy tokenizer

Last synced: 04 Apr 2025

https://github.com/piotrmurach/lex

Lex is an implementation of lex tool in Ruby.

compiler lexer lexing ruby ruby-gem state-lexer tokenizer

Last synced: 12 Jun 2025