Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ztjhz/word-piece-tokenizer

A Lightweight Word Piece Tokenizer
https://github.com/ztjhz/word-piece-tokenizer

bert natural-language-processing nlp tokeniser tokenizer word-piece

Last synced: 2 months ago
JSON representation

A Lightweight Word Piece Tokenizer

Host: GitHub
URL: https://github.com/ztjhz/word-piece-tokenizer
Owner: ztjhz
License: cc0-1.0
Created: 2022-09-24T16:21:41.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-09-27T07:02:03.000Z (over 2 years ago)
Last Synced: 2024-09-18T15:44:18.092Z (4 months ago)
Topics: bert, natural-language-processing, nlp, tokeniser, tokenizer, word-piece
Language: Python
Homepage: https://pypi.org/project/word-piece-tokenizer/
Size: 121 KB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

        # A Lightweight Word Piece Tokenizer

[![PyPI version shields.io](https://img.shields.io/pypi/v/word-piece-tokenizer.svg)](https://pypi.org/project/word-piece-tokenizer/)

This library is an implementation of a modified version of [Huggingface's Bert Tokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) in pure python.

## Table of Contents

1. [Usage](#usage)

   - [Installing](#installing)

   - [Example](#example)

   - [Running Tests](#running-tests)

1. [Making it Lightweight](#making-it-lightweight)

   - [Optional Features](#optional-features)

   - [Unused Features](#unused-features)

1. [Matching Algorithm](#matching-algorithm)

   - [The Trie](#the-trie)

## Usage

### Installing

Install and update using [pip](https://pip.pypa.io/en/stable/getting-started/)

```shell

pip install word-piece-tokenizer

```

### Example

```python

from word_piece_tokenizer import WordPieceTokenizer

tokenizer = WordPieceTokenizer()

ids = tokenizer.tokenize('reading a storybook!')

# [101, 3752, 1037, 2466, 8654, 999, 102]

tokens = tokenizer.convert_ids_to_tokens(ids)

# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']

tokenizer.convert_tokens_to_string(tokens)

# '[CLS] reading a storybook ! [SEP]'

```

### Running Tests

Test the tokenizer against hugging's face implementation:

```bash

pip install transformers

python tests/tokenizer_test.py

```




## Making It Lightweight

To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.

### Optional Features

The following features has been enabled by default instead of being configurable:

| Category      | Feature                                                                                                                                                                                 |

| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

| Tokenizer     | - The tokenizer utilises the pre-trained [bert-based-uncased](https://huggingface.co/bert-base-uncased) vocab list.
- Basic tokenization is performed before word piece tokenization |

| Text Cleaning | - Chinese characters are padded with whitespace
- Characters are converted to lowercase
- Input string is stripped of accent                                                      |

### Unused Features

The following features has been removed from the tokenizer:

- `pad_token`, `mask_token`, and special tokens

- Ability to add new tokens to the tokenizer

- Ability to never split certain strings (`never_split`)

- Unused functions such as `build_inputs_with_special_tokens`, `get_special_tokens_mask`, `get_vocab`, `save_vocabulary`, and more...




## Matching Algorithm

The tokenizer's _longest substring token matching_ algorithm is implemented using a `trie` instead of _greedy longest-match-first_

### The Trie

The original `Trie` class has been modified to adapt to the modified _longest substring token matching_ algorithm.

Instead of a `split` function that seperates the input string into substrings, the new trie implements a `getLongestMatchToken` function that returns the _token value `(int)`_ of the longest substring match, and the _remaining unmatched substring `(str)`_