Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/retarfi/jptranstokenizer

Japanese Tokenizer for transformers library
https://github.com/retarfi/jptranstokenizer

japanese natural-language-processing nlp transformer

Last synced: about 17 hours ago
JSON representation

Japanese Tokenizer for transformers library

Host: GitHub
URL: https://github.com/retarfi/jptranstokenizer
Owner: retarfi
License: mit
Created: 2022-08-24T10:35:03.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-02-02T18:08:10.000Z (10 months ago)
Last Synced: 2024-10-14T06:37:46.081Z (about 1 month ago)
Topics: japanese, natural-language-processing, nlp, transformer
Language: Python
Homepage:
Size: 813 KB
Stars: 5
Watchers: 2
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

        


jptranstokenizer: Japanese Tokenzier for transformers




  

  

    

  

  

    

  

  

    

  

  

    

  

  

    

  



This is a repository for japanese tokenizer with HuggingFace library.  

You can use `JapaneseTransformerTokenizer` like `transformers.BertJapaneseTokenizer`.  

**issue は日本語でも大丈夫です。**

## Documentations

Documentations are available on [readthedoc](https://jptranstokenizer.readthedocs.io/en/latest/index.html).

## Install

```

pip install jptranstokenizer

```

## Quickstart

This is the example to use `jptranstokenizer.JapaneseTransformerTokenizer` with [sentencepiece model of nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) and Juman++.  

Before the following steps, you need to **install pyknp and Juman++**.

```python

>>> from jptranstokenizer import JapaneseTransformerTokenizer

>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")

>>> tokens = tokenizer.tokenize("外国人参政権")

# tokens: ['▁外国', '▁人', '▁参政', '▁権']

```

Note that different dependencies are required depending on the type of tokenizer you use.  

See also [Quickstart on Read the Docs](https://jptranstokenizer.readthedocs.io/en/latest/quickstart.html)

## Citation

**There will be another paper.

Be sure to check here again when you cite.**

### This Implementation

```

@inproceedings{Suzuki-2023-nlp,

  jtitle = {{異なる単語分割システムによる日本語事前学習言語モデルの性能評価}},

  title = {{Performance Evaluation of Japanese Pre-trained Language Models with Different Word Segmentation Systems}},

  jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 和泉, 潔},

  author = {Suzuki, Masahiro and Sakaji, Hiroki and Izumi, Kiyoshi},

  jbooktitle = {言語処理学会 第29回年次大会 (NLP2023)},

  booktitle = {29th Annual Meeting of the Association for Natural Language Processing (NLP)},

  year = {2023},

  pages = {894-898}

}

```

## Related Work

- Pretrained Japanese BERT models (containing Japanese tokenizer)

  - Autor NLP Lab. in Tohoku University

  - https://github.com/cl-tohoku/bert-japanese

- SudachiTra

  - Author Works Applications

  - https://github.com/WorksApplications/SudachiTra

- UD_Japanese-GSD

  - Author megagonlabs

  - https://github.com/megagonlabs/UD_Japanese-GSD

- Juman++

  - Author Kurohashi Lab. in University of Kyoto

  - https://github.com/ku-nlp/jumanpp