An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with bpe-tokenizer

A curated list of projects in awesome lists tagged with bpe-tokenizer .

https://github.com/jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

bpe bpe-tokenizer byte-pair-encoding chunking deep-learning from-scratch large-language-models llm machine-learning python tokenizer

Last synced: 27 Dec 2024

https://github.com/gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers

Last synced: 28 Apr 2025

https://github.com/willxxy/superbpe

[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust

bpe bpe-tokenizer bytepairencoding rust rust-lang

Last synced: 10 Apr 2025

https://github.com/jmaczan/bpe.c

High performance Byte-Pair Encoding tokenizer for large language models

bpe bpe-tokenizer c clang llm tokenizer

Last synced: 18 Feb 2025

https://github.com/shivendrra/tokenizers

self made byte-pair-encoding tokenizer

bpe-tokenizer bytepairencoding llm tokenization tokenizer

Last synced: 31 Mar 2025

https://github.com/taabishhh/llm_preprocessing

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.

apache-hadoop bpe-tokenizer deeplearning4j hadoop-mapreduce jtokkit llm logback nd4j scala scalatest word2vec

Last synced: 16 Mar 2025

https://github.com/estnafinema0/russian-jokes-generator

Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.

alibi bpe-tokenizer grouped-query-attention nlp pytorch rotary-position-embedding swiglu transformer-models

Last synced: 11 Mar 2025

https://github.com/sameermanan/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

bpe bpe-tokenizer byte-pair-encoding byte-pair-tokenizer huggingface llm openai pypi-package python rust tiktoken tokenizers

Last synced: 22 Mar 2025

https://github.com/nickscha/bpe

C89, single header, nostdlib byte pair encoding algorythm

ai bpe-tokenizer c89 neural-network nostdlib single-header

Last synced: 21 Mar 2025