Projects in Awesome Lists tagged with bpe-tokenizer
A curated list of projects in awesome lists tagged with bpe-tokenizer .
https://github.com/jmaczan/bpe-tokenizer
Byte-Pair Encoding tokenizer for training large language models on huge datasets
bpe bpe-tokenizer byte-pair-encoding chunking deep-learning from-scratch large-language-models llm machine-learning python tokenizer
Last synced: 27 Dec 2024
https://github.com/gweidart/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers
Last synced: 28 Apr 2025
https://github.com/willxxy/superbpe
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
bpe bpe-tokenizer bytepairencoding rust rust-lang
Last synced: 10 Apr 2025
https://github.com/hyouteki/lanat
processing de LANguage NATural
bigram-language-model bio-encoding bpe-tokenizer named-entity-recognition natural-language-processing
Last synced: 21 Feb 2025
https://github.com/jmaczan/bpe.c
High performance Byte-Pair Encoding tokenizer for large language models
bpe bpe-tokenizer c clang llm tokenizer
Last synced: 18 Feb 2025
https://github.com/shivendrra/tokenizers
self made byte-pair-encoding tokenizer
bpe-tokenizer bytepairencoding llm tokenization tokenizer
Last synced: 31 Mar 2025
https://github.com/taabishhh/llm_preprocessing
This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.
apache-hadoop bpe-tokenizer deeplearning4j hadoop-mapreduce jtokkit llm logback nd4j scala scalatest word2vec
Last synced: 16 Mar 2025
https://github.com/estnafinema0/russian-jokes-generator
Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.
alibi bpe-tokenizer grouped-query-attention nlp pytorch rotary-position-embedding swiglu transformer-models
Last synced: 11 Mar 2025
https://github.com/sameermanan/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers
Last synced: 22 Mar 2025
https://github.com/nickscha/bpe
C89, single header, nostdlib byte pair encoding algorythm
ai bpe-tokenizer c89 neural-network nostdlib single-header
Last synced: 21 Mar 2025