https://github.com/sappho192/bertjapanesetokenizer
Minimal Tokenizer implementation of BertJapanese(cl-tohoku/bert-base-japanese) in C#
https://github.com/sappho192/bertjapanesetokenizer
bert bert-japanese csharp dotnet library nuget tokenizer
Last synced: 3 months ago
JSON representation
Minimal Tokenizer implementation of BertJapanese(cl-tohoku/bert-base-japanese) in C#
- Host: GitHub
- URL: https://github.com/sappho192/bertjapanesetokenizer
- Owner: sappho192
- License: mit
- Created: 2024-01-18T02:38:21.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-06T16:15:52.000Z (9 months ago)
- Last Synced: 2024-10-14T03:02:56.991Z (7 months ago)
- Topics: bert, bert-japanese, csharp, dotnet, library, nuget, tokenizer
- Language: C#
- Homepage:
- Size: 45.9 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BertJapaneseTokenizer
Minimal Tokenizer implementation of BertJapanese([cl-tohoku/bert-base-japanese](https://github.com/cl-tohoku/bert-japanese)) in C## Quickstart
1. Just add `BertJapaneseTokenizer` package from Nuget.
2. Download unidic mecab dictionary `unidic-mecab-2.1.2_bin.zip` from https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/ and unzip the archive into somewhere.
3. Download vocab file BertJapanese from Huggingface. For example, `vocab.txt` of bert-base-japanese-v2 can be accessed from [[here](https://huggingface.co/tohoku-nlp/bert-base-japanese-v2/tree/main)].
**(Or you can simply use my extension method `GetVocabFromHub()`. See the example below.)**
4. Check the example code below and you are good to go.```CSharp
using BertJapaneseTokenizer;var dicPath = @"D:\DATASET\unidic-mecab-2.1.2_bin";
//var vocabPath = @"D:\DATASET\bert-japanese\bert-base-japanese-v2\vocab.txt";
var vocabPath = await HuggingFace.GetVocabFromHub("tohoku-nlp/bert-base-japanese-v2");
var tokenizer = new BertJapaneseTokenizer.BertJapaneseTokenizer(dicPath, vocabPath);var sentence = "打ち合わせが終わった後にご飯を食べましょう。";
//var sentence = "ご飯を食べましょう。";
//var sentence = "打ち合わせ";(var tokenIds, var attentionMask) = tokenizer.EncodePlus(sentence);
Console.WriteLine($"Sentence: {sentence}");
Console.WriteLine($"Token IDs: {string.Join(", ", tokenIds)}");var decoded = tokenizer.Decode(tokenIds);
Console.WriteLine($"Decoded: {decoded}");
```# To-do List
- [x] Implement Decode() method
- [ ] Support BPE-type vocabulary (like `cl-tohoku/bert-base-japanese-char`)