Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ztjhz/word-piece-tokenizer
A Lightweight Word Piece Tokenizer
https://github.com/ztjhz/word-piece-tokenizer
bert natural-language-processing nlp tokeniser tokenizer word-piece
Last synced: 2 months ago
JSON representation
A Lightweight Word Piece Tokenizer
- Host: GitHub
- URL: https://github.com/ztjhz/word-piece-tokenizer
- Owner: ztjhz
- License: cc0-1.0
- Created: 2022-09-24T16:21:41.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-09-27T07:02:03.000Z (over 2 years ago)
- Last Synced: 2024-09-18T15:44:18.092Z (4 months ago)
- Topics: bert, natural-language-processing, nlp, tokeniser, tokenizer, word-piece
- Language: Python
- Homepage: https://pypi.org/project/word-piece-tokenizer/
- Size: 121 KB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# A Lightweight Word Piece Tokenizer
[![PyPI version shields.io](https://img.shields.io/pypi/v/word-piece-tokenizer.svg)](https://pypi.org/project/word-piece-tokenizer/)
This library is an implementation of a modified version of [Huggingface's Bert Tokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) in pure python.
## Table of Contents
1. [Usage](#usage)
- [Installing](#installing)
- [Example](#example)
- [Running Tests](#running-tests)
1. [Making it Lightweight](#making-it-lightweight)
- [Optional Features](#optional-features)
- [Unused Features](#unused-features)
1. [Matching Algorithm](#matching-algorithm)
- [The Trie](#the-trie)## Usage
### Installing
Install and update using [pip](https://pip.pypa.io/en/stable/getting-started/)
```shell
pip install word-piece-tokenizer
```### Example
```python
from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()ids = tokenizer.tokenize('reading a storybook!')
# [101, 3752, 1037, 2466, 8654, 999, 102]tokens = tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']tokenizer.convert_tokens_to_string(tokens)
# '[CLS] reading a storybook ! [SEP]'
```### Running Tests
Test the tokenizer against hugging's face implementation:
```bash
pip install transformers
python tests/tokenizer_test.py
```
## Making It Lightweight
To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.
### Optional Features
The following features has been enabled by default instead of being configurable:
| Category | Feature |
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Tokenizer | - The tokenizer utilises the pre-trained [bert-based-uncased](https://huggingface.co/bert-base-uncased) vocab list.
- Basic tokenization is performed before word piece tokenization |
| Text Cleaning | - Chinese characters are padded with whitespace
- Characters are converted to lowercase
- Input string is stripped of accent |### Unused Features
The following features has been removed from the tokenizer:
- `pad_token`, `mask_token`, and special tokens
- Ability to add new tokens to the tokenizer
- Ability to never split certain strings (`never_split`)
- Unused functions such as `build_inputs_with_special_tokens`, `get_special_tokens_mask`, `get_vocab`, `save_vocabulary`, and more...
## Matching Algorithm
The tokenizer's _longest substring token matching_ algorithm is implemented using a `trie` instead of _greedy longest-match-first_
### The Trie
The original `Trie` class has been modified to adapt to the modified _longest substring token matching_ algorithm.
Instead of a `split` function that seperates the input string into substrings, the new trie implements a `getLongestMatchToken` function that returns the _token value `(int)`_ of the longest substring match, and the _remaining unmatched substring `(str)`_