Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mokkemeguru/meguru_tokenizer
https://github.com/mokkemeguru/meguru_tokenizer
nlp pytorch tensorflow2
Last synced: 2 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/mokkemeguru/meguru_tokenizer
- Owner: MokkeMeguru
- Created: 2020-07-12T09:52:53.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2020-09-20T05:21:57.000Z (about 4 years ago)
- Last Synced: 2024-09-25T01:02:43.861Z (2 days ago)
- Topics: nlp, pytorch, tensorflow2
- Language: Python
- Homepage: https://mokkemeguru.github.io/meguru_tokenizer/
- Size: 6.14 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# meguru tokenizer
# installation and initialization
```shell
pip install meguru_tokenizer
sudachipy link -t full
```# Abstruction of Usage
1. Preprocess Using Each Tokenizer
e.g. sentencepiece preprocess / sudachi preprocess
2. Tokenize in your code using its Tokenizer
- basis
see. [official docs](https://mokkemeguru.github.io/meguru_tokenizer/index.html)
- Tensorflow
see. [tutorial](./tutorials/01_tokenize_tf.ipynb)
- TODO: PyTorch# RealWorld Example
```python
import meguru_tokenizer.whitespace_tokenizer import WhitespaceTokenizer
import pprintsentences = [
"Hello, I don't know how to use it?",
"Tensorflow is awesome!",
"it is good framework.",
]# define tokenizer and vocaburary
tokenizer = WhitespaceTokenizer(lower=True)
vocab = Vocab()# build vocaburary
for sentence in sentences:
vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab()# set vocaburary into tokenizer to enable encoding
tokenizer.vocab = vocab# save vocaburary information
vocab.dump_vocab(Path("vocab.txt"))
print("vocabs:")
pprint.pprint(vocab.i2w)# tokenize
print("tokenized sentence")
pprint.pprint(tokenizer.tokenize_list(sentences))# [['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'],
# ['tensorflow', 'is', 'awesome', '!'],
# ['it', 'is', 'good', 'framework', '.']]# encode
print("encoded sentence")
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
print("decoded sentence")
pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
# ["hello , i do n't know how to use it ?",
# 'tensorflow is awesome !',
# 'it is good framework .']vocab_size = len(vocab)
# restore the vocaburary from dumped file
print("reload from dump file")
vocab = Vocab()
vocab.load_vocab(Path("vocab.txt"))
assert vocab_size == len(vocab)tokenizer = WhitespaceTokenizer(vocab=vocab)
pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])# [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
# vocaburary with minimum frequency limitation
vocab = Vocab()
for sentence in sentences:
vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(min_freq=2)
assert vocab_size != len(vocab)# vocaburary with maximum voaburary size
vocab = Vocab()
for sentence in sentences:
vocab.add_vocabs(tokenizer.tokenize(sentence))
vocab.build_vocab(vocab_size=10)
assert 10 == len(vocab)
```