An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with tokenization

A curated list of projects in awesome lists tagged with tokenization .

https://github.com/nvidia/cosmos-tokenizer

A suite of image and video neural tokenizers

diffusion tokenization transformers

Last synced: 30 Oct 2025

https://github.com/lunasec-io/lunasec

LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/

compliance continuous-delivery cve-scanning cybersecurity dependency-analysis devsecops gdpr log4shell pci-dss sbom sbom-generator scanning scanning-tool security security-tools soc2 software-composition-analysis tokenization web-security zero-trust

Last synced: 15 May 2025

https://github.com/ravenproject/ravencoin

Ravencoin Core integration/staging tree

asset bitcoin blockchain raven ravencoin token tokenization

Last synced: 15 May 2025

https://github.com/RavenProject/Ravencoin

Ravencoin Core integration/staging tree

asset bitcoin blockchain raven ravencoin token tokenization

Last synced: 09 May 2025

https://github.com/VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

bpe natural-language-processing nlp tokenization word-segmentation

Last synced: 03 Apr 2025

https://github.com/vkcom/youtokentome

Unsupervised text tokenizer focused on computational efficiency

bpe natural-language-processing nlp tokenization word-segmentation

Last synced: 27 Sep 2025

https://github.com/cbaziotis/ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

nlp nlp-library semeval spell-corrector spelling-correction text-processing text-segmentation tokenization tokenizer word-normalization word-segmentation

Last synced: 14 Jan 2026

https://github.com/alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

text-tokenization tokenisation tokenization tokenize tokenizer tokenizing vocabulary vocabulary-builder vocabulary-generator

Last synced: 16 Jan 2026

https://github.com/adobe/NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

dependency-parser dependency-parsing embeddings information-extraction language-pipeline lemmatization machine-translation nlp-cube parse part-of-speech-tagger sentence-splitting tokenization universal-dependencies

Last synced: 27 Mar 2025

https://github.com/macmade/clangkit

ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.

c c-plus-plus clang code diagnostics llvm objective-c parsing source static-analysis syntax-highlighting tokenization

Last synced: 07 Apr 2025

https://github.com/macmade/ClangKit

ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.

c c-plus-plus clang code diagnostics llvm objective-c parsing source static-analysis syntax-highlighting tokenization

Last synced: 15 Mar 2025

https://github.com/daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

japanese morphological-analysis nlp rust segmentation tokenization tokenizer

Last synced: 15 May 2025

https://github.com/opennmt/tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode

Last synced: 08 Oct 2025

https://github.com/foundationvision/omnitokenizer

[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.

auto-regressive-model image-generation tokenization vae video-generation vqvae

Last synced: 07 Apr 2025

https://github.com/natasha/razdel

Rule-based token, sentence segmentation for Russian language

nlp python russian sentence-boundary-detection sentence-segmentation tokenization

Last synced: 04 Apr 2025

https://github.com/CodeChain-io/codechain

CodeChain's official implementation in Rust.

asset blockchain digital-securities rust tokenization

Last synced: 30 Mar 2025

https://github.com/codechain-io/codechain

CodeChain's official implementation in Rust.

asset blockchain digital-securities rust tokenization

Last synced: 06 Apr 2025

https://github.com/daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

analyzer japanese morphological-analysis nlp rust segmentation tokenization tokenizer

Last synced: 12 Apr 2025

https://github.com/milaan9/python_natural_language_processing

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.

bag-of-words inversedocumentfrequency ipython-notebook lemmatization named-entity-recognition nlp partofspeech-tagger python4datascience python4everybody sentence-segmentation stemming stopwords termfrequency tf-idf tokenization tutor-milaan9 vocabulary-matching

Last synced: 09 Apr 2025

https://github.com/milaan9/Python_Natural_Language_Processing

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.

bag-of-words inversedocumentfrequency ipython-notebook lemmatization named-entity-recognition nlp partofspeech-tagger python4datascience python4everybody sentence-segmentation stemming stopwords termfrequency tf-idf tokenization tutor-milaan9 vocabulary-matching

Last synced: 28 Aug 2025

https://github.com/gautierdag/bpeasy

Fast bare-bones BPE for modern tokenizer training

bpe tokenization tokenizer

Last synced: 06 Apr 2025

https://github.com/cohere-ai/magikarp

Code for the paper "Fishing for Magikarp"

large-language-models tokenization

Last synced: 05 Apr 2025

https://github.com/thudm/icetk

A unified tokenization tool for Images, Chinese and English.

tokenization transformer

Last synced: 06 Apr 2025

https://github.com/rth/vtext

Simple NLP in Rust with Python bindings

bag-of-words information-retrieval nlp tf-idf tokenization

Last synced: 06 Apr 2025

https://github.com/bminixhofer/zett

Code for Zero-Shot Tokenizer Transfer

language-model llm llms multilingual tokenization transfer-learning

Last synced: 05 Apr 2025

https://github.com/lucidrains/charformer-pytorch

Implementation of the GBST block from the Charformer paper, in Pytorch

artificial-intelligence deep-learning tokenization transformer

Last synced: 20 Aug 2025

https://github.com/mit-ccc/tweebanknlp

[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset

dependency-parser lemmatization machine-learning named-entity-recognition natural-language-processing ner nlp-toolkit pos-tagging text-annotation tokenization tweet-analysis twitter-nlp

Last synced: 11 May 2025

https://github.com/clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

go golang nlp tokenization tokenizer uax29 unicode

Last synced: 01 Feb 2026

https://github.com/dluc/openai-tools

A collection of tools for working with OpenAI

gpt-3 gpt3 openai tokenization tokenizer

Last synced: 19 Apr 2025

https://github.com/googlecloudplatform/dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP

beam bigquery data dataflow dlp pii tokenization

Last synced: 11 Apr 2025

https://github.com/nlpcloud/nlpcloud-python

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...

ad-generator chatbot code-generation embeddings grammar-correction keyword-extraction language-detection machine-translation ner nlp paraphrasing question-answering semantic-similarity sentiment-analysis spelling-correction text-classification text-generation text-summarization tokenization

Last synced: 28 Jan 2026

https://github.com/ARBML/tkseem

Arabic Tokenization Library. It provides many tokenization algorithms.

arabic-nlp nlp tkseem tokenization

Last synced: 19 Mar 2025

https://github.com/pythainlp/attacut

A Fast and Accurate Neural Thai Word Segmenter

cnn hacktoberfest hactoberfest2022 nlp tokenization

Last synced: 13 Apr 2025

https://github.com/av/klmbr

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

inference llm prompts tokenization

Last synced: 23 Aug 2025

https://github.com/liuzl/ling

Natural Language Processing Toolkit in Golang

corenlp lemmatization nlp normalization opencc spacy tokenization

Last synced: 30 Oct 2025

https://github.com/winkjs/wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

devanagari french german hindi konkani latin marathi multilingual tagging tokenization tokenizer wink

Last synced: 28 Oct 2025

https://github.com/cedricrupb/code_tokenize

Fast tokenization and structural analysis of any programming language

ast code-analysis language parser tokenization

Last synced: 19 Nov 2025

https://github.com/typst/unscanny

Painless string scanning.

parsing scanning tokenization

Last synced: 16 May 2025

https://github.com/nlpcloud/nlpcloud-js

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...

ad-generator chatbot code-generation conversational-ai embeddings intent-classification keywords-extraction language-detection machine-translation ner nlp paraphrasing question-answering semantic-similarity sentiment-analysis text-classification text-generation text-summarization tokenization

Last synced: 28 Jan 2026

https://github.com/cashtokens/cashtokens

A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.

bitcoin bitcoin-cash bitcoin-cash-chip cashtokens cryptocurrency tokenization

Last synced: 04 Apr 2025

https://github.com/zouharvi/tokenization-scorer

Simple-to-use scoring function for arbitrarily tokenized texts.

bpe segmentation subword tokenization

Last synced: 06 Oct 2025

https://github.com/Quillhash/Real-World-Assets-RWA

This repository comprises the theoretical and technical aspects of tokenisation of real world assets.

blockchain smart-contracts tokenization web3

Last synced: 27 Apr 2025

https://github.com/anki-code/xontrib-output-search

Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.

cli command-line console python shell terminal tmux tmux-plugin tmux-plugins tokenization tokenizer xonsh xontrib zellij

Last synced: 12 Dec 2025

https://github.com/googlecloudplatform/auto-data-tokenize

Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow

cloud-migration data-governance data-loss-prevention dataflow deidentification tokenization

Last synced: 02 Jul 2025

https://github.com/GoogleCloudPlatform/auto-data-tokenize

Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow

cloud-migration data-governance data-loss-prevention dataflow deidentification tokenization

Last synced: 04 Apr 2025

https://github.com/bastienbot/nlp-js-tools-french

POS Tagger, lemmatizer and stemmer for french language in javascript

lemmatization lemmatizer nlp postagging postgresql stemmer stemming tokenization tokenizer

Last synced: 01 Aug 2025

https://github.com/JackHCC/Chinese-Tokenization

利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】

bert-crf bilstm-crf hmm-viterbi-algorithm ngram nlp tokenization

Last synced: 12 May 2025

https://github.com/thisiscetin/textoken

Simple and customizable text tokenization gem.

nlp ruby rubynlp tokenization

Last synced: 09 Jul 2025

https://github.com/thalesgroup/ciphertrust_application_protection

Public code samples and resources for the Thales CipherTrust Application Protection products of the CipherTrust Data Security Platform

encryption key tokenization

Last synced: 16 Jul 2025

https://github.com/aboudjem/erc-3643

ERC-3643 - Raptor Version is a simple, educational look at the T-REX standard. Using Solidity and Web3, this project demystifies tokenized securities. Remember, Raptor is for learning, not production. Dive in for an accessible peek into blockchain finance!

cedefi cefi defi eip-3643 eip3643 erc-3643 erc3643 evm hardhat real-world-asset real-world-assets rwa security-token security-tokens smart-contracts solidity t-rex tokenization

Last synced: 01 Mar 2025

https://github.com/Sovichea/khmer_segmenter

A zero-dependency, high-performance Khmer word segmenter using the Viterbi algorithm. Optimized for dictionary accuracy, ultra-low memory footprint, and edge deployment.

c-language dictionary-based khmer khmer-language khmer-nlp lightweight nlp portable python tokenization viterbi-algorithm word-segmentation zero-dependency zig-build-system

Last synced: 14 Jan 2026

https://github.com/julienkay/com.doji.transformers

A Unity package to run pretrained transformer models with Unity Sentis

ai clip machine-learning sentis tokenization tokenizer transformer-models transformers unity

Last synced: 10 Apr 2025

https://github.com/dnbaker/bioseq

Tokenizers and Machine Learning Models for biological sequence data

biological-sequences machine-learning tokenization transformers

Last synced: 19 Sep 2025

https://github.com/eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer

encoding go golang language-model llm sentencepiece tokenization

Last synced: 11 Aug 2025

https://github.com/johannschopplich/tokenx

📐 GPT token estimation and context size utilities without a full tokenizer

tiktoken token-counter tokenization tokenizer

Last synced: 01 May 2025

https://github.com/ankane/youtokentome-ruby

High performance unsupervised text tokenization for Ruby

bpe byte-pair-encoding npl tokenization unsupervised-learning word-segmentation

Last synced: 16 Jul 2025

https://github.com/daac-tools/python-vaporetto

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

analyzer japanese morphological-analysis nlp python rust segmentation tokenization tokenizer

Last synced: 11 Oct 2025

https://github.com/bminixhofer/tokenkit

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

distillation jax llms machine-learning tokenization tokenizer-transfer transfer-learning

Last synced: 15 May 2025

https://github.com/yieldhabitat/yieldhabitat_

Real estate tokenization platform built on multiple blockchains enabling fractional ownership of premium properties through tokenization.

blockchain cross-chain defi ethereum fractional-ownership property-investment real-estate solana tokenization web3

Last synced: 02 Apr 2025

https://github.com/dev-protocol/niwa

🌈 Social Token Launcher.

dao tokenization web3-dapp

Last synced: 08 Jul 2025

https://github.com/vsce-toolroom/vscode-textmate-languageservice

Language APIs and support features from Textmate tokenization in Visual Studio Code.

grammar language-features syntax textmate tokenization tokenizer visual-studio-code vscode vscode-extension

Last synced: 13 Apr 2025

https://github.com/taurushq-io/private-cmtat-aztec

Private version of CMTAT security token in Noir (Aztec network DSL)

security-token smart-contracts tokenization zero-knowledge

Last synced: 23 Jan 2026

https://github.com/bnosac/tokenizers.bpe

R package for Byte Pair Encoding based on YouTokenToMe

bpe byte-pair-encoding text-mining tokenization

Last synced: 13 Jun 2025

https://github.com/jkrukowski/swift-sentencepiece

Use SentencePiece in Swift for tokenization and detokenization.

sentencepiece tokenization

Last synced: 11 Oct 2025

https://github.com/LoopscaleLabs/rwa-token

The RWA Token Program is a wrapper and extension program for Solana Token Extensions that creates a uniform approach to permissions tokens on SVM blockchains.

real-world-assets solana solana-token tokenization

Last synced: 02 Apr 2025

https://github.com/khaledashrafh/tiny-compiler

This project is a fully functional compiler for the TINY programming language, which is a language that supports basic arithmetic, boolean, and control flow operations. The compiler can scan, parse, and run code written in the TINY language.

compiler cpp parser semantic-analyzer syntax-analyzer tiny tiny-compiler tiny-language tokenization

Last synced: 17 Oct 2025

https://github.com/kensho-technologies/pathpiece

PathPiece tokenizer

tokenization

Last synced: 10 Jun 2025

https://github.com/davzim/rtiktoken

BPE Tokenizer for OpenAI's models

bpe openai r rust tokenization

Last synced: 06 May 2025

https://github.com/eklem/words-n-numbers

Tokenizing strings of text. Regex extracting arrays of words and optionally numbers, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions.

nlp offline-first regex tokenization tokenizer

Last synced: 02 Sep 2025

https://github.com/jparkerweb/llm-distillery

🍶 llm-distillery ⇢ use LLMs to run map-reduce summarization tasks on large documents until a target token size is met.

ai-text-reduction large-language-model llm openai-api semantic-chunking text-compression text-distillation text-processing text-summarization token-management tokenization

Last synced: 01 May 2025

https://github.com/kemingy/plane

A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.

chinese-nlp data-cleaning nlp preprocess regex tokenization tokenizer

Last synced: 17 Mar 2025

https://github.com/cqb13/ti-tools

TI Tools is a CLI tool designed for converting 8xp files (used by TI-83 and TI-84 calculators) to text files and vice versa. It also supports various other features for working with 8xp files.

8xp 8xp-files texas-instruments texas-instruments-calculators ti-84 ti-basic ti-calculators tokenization

Last synced: 07 Jan 2026

https://github.com/just-krivi/ethereum-kryptonite-asset-tokenization

Crowdsale Dapp for ERC-20 Kryptonite token (fake stablecoin backed by Kryptonite mineral).

dapp erc-20 ethereum ico initial-coin-offering kyc open-zeppelin reactjs tokenization truffle

Last synced: 25 Sep 2025

https://github.com/labrijisaad/twitter-sentiment-analysis-with-python

I aim in this project to analyze the sentiment of tweets provided from the Sentiment140 dataset by developing a machine learning sentiment analysis model involving the use of classifiers. The performance of these classifiers is then evaluated using accuracy and F1 scores.

accuracy-score bernoulli-naive-bayes confusion-matrix f1-score lemmatization logistic-regression machine-learning nlp roc-auc-curve sentiment-analysis sentiment140-dataset stemming support-vector-machine tokenization twitter-sentiment-analysis

Last synced: 08 Apr 2025