Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/raj-pulapakura/bpe
Implemented the Byte-Pair-Encoding sub-word tokenization algorithm
https://github.com/raj-pulapakura/bpe
Last synced: 3 days ago
JSON representation
Implemented the Byte-Pair-Encoding sub-word tokenization algorithm
- Host: GitHub
- URL: https://github.com/raj-pulapakura/bpe
- Owner: raj-pulapakura
- Created: 2024-01-24T06:22:48.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-01-25T11:53:11.000Z (12 months ago)
- Last Synced: 2024-11-10T00:29:49.446Z (2 months ago)
- Language: Python
- Size: 524 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
![Frame 1 (5)](https://github.com/raj-pulapakura/bpe/assets/87762282/61af30d7-1c95-4978-8c43-01c13def72ac)
# Byte Pair Encoding
Byte Pair Encoding (BPE) is a deterministic subword tokenization algorithm used by LLMs such as GPT-2. It is essentially an algorithm for converting text into numbers, so that they can be fed into a model.
In this repo I implement the BPE algorithm from scratch, in Python. I ran the BPE algorithm on a corpus of all of Shakespeare works, for 10000 merges, and saved the tokenizer to the `tokenizer.json` file.
The repo contains 4 files (2 scripts, 1 txt, 1 json):
- ๐ `train.py` Trains a new tokenizer, given the number of merges
- ๐ `try.py` Provides a cmd interface to tokenize new text
- ๐ `shakespeare.txt` A text file containing all of Shakespeare's works, which I trained the tokenizer on
- ๐พ `tokenizer.json` The vocabulary and merge rules of the 10000-merge tokenizer I trained## ๐๏ธ Test Drive
### โฌ๏ธ Get the code
1. Clone the repo (or download as zip):
```
git clone https://github.com/raj-pulapakura/bpe.git
```2. Navigate to the `bpe` directory:
```
cd bpe
```### ๐ช Tokenize new pieces of text
To try out the tokenizer, run the `try.py` script:
```
python try.py
```You should see the following output:
```
==================
Byte Pair Encoding
==================Enter your text:
```Just type in your text, and it will be tokenized!
```
==================
Byte Pair Encoding
==================Enter your text: Well you are quite magnificent!
Integer Tokens: [2642, 127, 231, 4043, 9925, 410, 306, 6531, 2]
Text Tokens: ['Well ', 'you ', 'are ', 'quite ', 'mag', 'ni', 'fi', 'cent', '!']Enter your text:
```### ๐ฅท Train a new tokenizer
To train a new tokenizer, run the `train.py` script, specifying the number of merges you would like to compute (default is 1000):
```
python train.py -m 2000
```The tokenizer will be trained by default on the `shakespeare.txt` corpus which is already loaded in the directory, but you can swap this out for your own text corpus, as long as it's in a text file. If you do choose to train on your own corpus, make sure you point to the corpus using the `-c` command:
```
python train.py -m 2000 -c amazon_reviews.txt
```## ๐งถ Algorithm Overview
Check out [my YouTube tutorial](https://www.youtube.com/watch?v=BcxJk4WQVIw), where I explain the BPE algorithm in-depth with examples.
The BPE algorithm can be summarized in the following procedure:
- Start with a vocabulary of all the individual characters in the corpus
- Repeat *k* times:
- 2.1 Choose the two symbols that are most frequently adjacent in the the training corpus, let's say "A" and "B"
- 2.2 Add a new merged symbol "AB" to the vocabulary
- 2.3 Replace very adjacent "A" "B" with the single token "AB" in the corpus