https://github.com/teleprint-me/byte-pair
Byte Pair Encoding (BPE) Tokenization for Natural Language Processing
https://github.com/teleprint-me/byte-pair
bpe nlp tokenizer
Last synced: 3 months ago
JSON representation
Byte Pair Encoding (BPE) Tokenization for Natural Language Processing
- Host: GitHub
- URL: https://github.com/teleprint-me/byte-pair
- Owner: teleprint-me
- License: agpl-3.0
- Created: 2024-01-27T03:37:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-06T19:47:23.000Z (about 1 year ago)
- Last Synced: 2025-01-03T10:45:49.554Z (4 months ago)
- Topics: bpe, nlp, tokenizer
- Language: Python
- Homepage:
- Size: 200 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Byte Pair Encoding
## Overview
The Byte-Pair Encoder (BPE) is a powerful tokenization method widely used in natural language processing. This Python implementation of BPE is inspired by the paper [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909v5) and guided by [Lei Mao's educational tutorial](https://leimao.github.io/blog/Byte-Pair-Encoding/).## Features
- **Tokenization**: Efficient tokenization using Byte-Pair Encoding.
- **Vocabulary Management**: Tools for managing and analyzing vocabulary.
- **Token Pair Frequency**: Calculate token pair frequencies for subword units.## Getting Started
To get started with Byte-Pair Encoder, follow these simple steps:1. **Clone the Repository**
```sh
git clone https://github.com/teleprint-me/byte-pair.git
```2. **Install Dependencies**
```sh
virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt
```3. **Run the Code**
```sh
python -m byte_pair.encode --input_file samples/taming_shrew.md --output_file local/vocab.json --n_merges 5000
```## Usage
For comprehensive usage instructions and options, consult the documentation:```sh
python -m byte_pair.encode --help
```## Documentation
Detailed information on how to use and contribute to the project is available in the [documentation](docs).## Contributing
Contributions are welcome! If you have suggestions, bug reports, or improvements, please don't hesitate to submit issues or pull requests.## License
This project is licensed under the AGPL (GNU Affero General Public License). For detailed information, see the [LICENSE](LICENSE) file.## Acknowledgments
Special thanks to Lei Mao for the blog tutorial that inspired this implementation.## Additional Resources
- Original Paper: [A New Algorithm for Data Compression Optimization](https://arxiv.org/abs/1209.1045)
- Johns Hopkins Paper: [A Formal Perspective on Byte-Pair Encoding](https://aclanthology.org/2023.findings-acl.38.pdf)
- Amazon Research: [A Statistical Extension of Byte-Pair Encoding](https://paperswithcode.com/paper/a-statistical-extension-of-byte-pair-encoding)