Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/retarfi/jptranstokenizer
Japanese Tokenizer for transformers library
https://github.com/retarfi/jptranstokenizer
japanese natural-language-processing nlp transformer
Last synced: about 17 hours ago
JSON representation
Japanese Tokenizer for transformers library
- Host: GitHub
- URL: https://github.com/retarfi/jptranstokenizer
- Owner: retarfi
- License: mit
- Created: 2022-08-24T10:35:03.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-02T18:08:10.000Z (10 months ago)
- Last Synced: 2024-10-14T06:37:46.081Z (about 1 month ago)
- Topics: japanese, natural-language-processing, nlp, transformer
- Language: Python
- Homepage:
- Size: 813 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
jptranstokenizer: Japanese Tokenzier for transformers
This is a repository for japanese tokenizer with HuggingFace library.
You can use `JapaneseTransformerTokenizer` like `transformers.BertJapaneseTokenizer`.
**issue は日本語でも大丈夫です。**## Documentations
Documentations are available on [readthedoc](https://jptranstokenizer.readthedocs.io/en/latest/index.html).
## Install
```
pip install jptranstokenizer
```## Quickstart
This is the example to use `jptranstokenizer.JapaneseTransformerTokenizer` with [sentencepiece model of nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) and Juman++.
Before the following steps, you need to **install pyknp and Juman++**.```python
>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
>>> tokens = tokenizer.tokenize("外国人参政権")
# tokens: ['▁外国', '▁人', '▁参政', '▁権']
```Note that different dependencies are required depending on the type of tokenizer you use.
See also [Quickstart on Read the Docs](https://jptranstokenizer.readthedocs.io/en/latest/quickstart.html)## Citation
**There will be another paper.
Be sure to check here again when you cite.**### This Implementation
```
@inproceedings{Suzuki-2023-nlp,
jtitle = {{異なる単語分割システムによる日本語事前学習言語モデルの性能評価}},
title = {{Performance Evaluation of Japanese Pre-trained Language Models with Different Word Segmentation Systems}},
jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 和泉, 潔},
author = {Suzuki, Masahiro and Sakaji, Hiroki and Izumi, Kiyoshi},
jbooktitle = {言語処理学会 第29回年次大会 (NLP2023)},
booktitle = {29th Annual Meeting of the Association for Natural Language Processing (NLP)},
year = {2023},
pages = {894-898}
}
```## Related Work
- Pretrained Japanese BERT models (containing Japanese tokenizer)
- Autor NLP Lab. in Tohoku University
- https://github.com/cl-tohoku/bert-japanese
- SudachiTra
- Author Works Applications
- https://github.com/WorksApplications/SudachiTra
- UD_Japanese-GSD
- Author megagonlabs
- https://github.com/megagonlabs/UD_Japanese-GSD
- Juman++
- Author Kurohashi Lab. in University of Kyoto
- https://github.com/ku-nlp/jumanpp