https://github.com/sappho192/edmtranslator
.NET Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace
https://github.com/sappho192/edmtranslator
dotnet encoder-decoder-model huggingface library llm
Last synced: 3 months ago
JSON representation
.NET Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace
- Host: GitHub
- URL: https://github.com/sappho192/edmtranslator
- Owner: sappho192
- License: mit
- Created: 2024-06-10T04:13:27.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-09-06T16:34:57.000Z (9 months ago)
- Last Synced: 2025-02-04T15:47:14.649Z (4 months ago)
- Topics: dotnet, encoder-decoder-model, huggingface, library, llm
- Language: C#
- Homepage:
- Size: 86.9 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
- [EDMTranslator](#edmtranslator)
- [Nuget Package list](#nuget-package-list)
- [Requirements](#requirements)
- [Supported models](#supported-models)
- [Quickstart](#quickstart)
- [Install the packages](#install-the-packages)
- [Prepare the required data](#prepare-the-required-data)
- [Japanese dictionary](#japanese-dictionary)
- [Fine-tuned translator model](#fine-tuned-translator-model)
- [Implement the driver code](#implement-the-driver-code)
- [How to build](#how-to-build)# EDMTranslator
Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace
# Nuget Package list
| Package | repo | description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------- | ------------ |
| EDMTranslator | [](https://www.nuget.org/packages/EDMTranslator/) | Main library |# Requirements
* .NET 6 or above
* **Free RAM spaces at least 3.5GB** before running the translator# Supported models
* JESCJaEnTranslator([sappho192/jesc-ja-en-translator](https://huggingface.co/sappho192/jesc-ja-en-translator)): Japanese-to-English translator based on `tohoku-nlp/bert-base-japanese-v2` and `openai-community/gpt2`, fine-tuned with JESC dataset
* FF14JaKoTranslator([sappho192/ffxiv-ja-ko-translator](https://github.com/sappho192/ffxiv-ja-ko-translator)): Japanese-to-Korean translator based on `tohoku-nlp/bert-base-japanese-v2` and `skt/kogpt2-base-v2`, fine-tuned with FF14 dataset
* AihubJaKoTranslator([sappho192/aihub-ja-ko-translator](https://huggingface.co/sappho192/aihub-ja-ko-translator)): Japanese-to-Korean translator based on `tohoku-nlp/bert-base-japanese-v2` and `skt/kogpt2-base-v2`, fine-tuned with AIHub dataset
* More to be added...# Quickstart
Following guide **supposes that you are to use JESCJaEnTranslator** mentioned above.
## Install the packages
1. From the NuGet, install `EDMTranslator` package
2. And then, install `Tokenizers.DotNet.runtime.win` package too## Prepare the required data
### Japanese dictionary
* Download unidic mecab dictionary `unidic-mecab-2.1.2_bin.zip` from https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/ and unzip the archive into somewhere
### Fine-tuned translator model
* Download the translator model from [sappho192/jesc-ja-en-translator](https://huggingface.co/sappho192/jesc-ja-en-translator/blob/main/onnx_jesc-ja-en.7z) (especially `onnx_jesc-ja-en.7z`) and unzip the archive into somewhere
## Implement the driver code
Write the code like below and you are good to go 🫡
Note that you need to fix the path of `encoderDictDir` and `modelDir` correctly.```csharp
// Console application which translates Japanese sentence to English with JESCJaEnTranslatorusing EDMTranslator.Tokenization;
using EDMTranslator.Translation;// Prepare the tokenizer
var encoderVocabPath = await BertJapaneseTokenizer.HuggingFace.GetVocabFromHub("tohoku-nlp/bert-base-japanese-v2");
var hubName = "openai-community/gpt2";
var decoderVocabFilename = "tokenizer.json";
var decoderVocabPath = await Tokenizers.DotNet.HuggingFace.GetFileFromHub(hubName, decoderVocabFilename, "deps");string encoderDictDir = @"D:\DATASET\unidic-mecab-2.1.2_bin";
var tokenizer = new BertJa2GPTTokenizer(
encoderDictDir: encoderDictDir, encoderVocabPath: encoderVocabPath,
decoderVocabPath: decoderVocabPath);void TestTokenizer(ITokenizer tokenizer)
{
Console.WriteLine("--Tokenizer test--");
Console.WriteLine("[Encode]");
var sentenceJa = "打ち合わせが終わった後にご飯を食べましょう。";
Console.WriteLine($"Input: {sentenceJa}");
var (embeddingsJa, attentionMask) = tokenizer.Encode(sentenceJa);
Console.WriteLine($"Encoded: {string.Join(", ", embeddingsJa)}");Console.WriteLine("[Decode]");
// Tokens of "i was nervous before the exam, and i had a fever."
var tokens = new uint[] { 72, 373, 10927, 878, 262, 2814, 11, 290, 1312, 550, 257, 17372, 13 };
Console.WriteLine($"Input: {string.Join(", ", tokens)}");
var decoded = tokenizer.Decode(tokens);
Console.WriteLine($"Decoded: {decoded}");
}
TestTokenizer(tokenizer);// Prepare the translator
string modelDir = @"D:\MODEL\jesc-ja-en-translator\onnx"; // The folder should contains encoder_model.onnx and decoder_model_merged.onnx
var translator = new JESCJaEnTranslator(tokenizer, modelDir);
void TestTranslator(JESCJaEnTranslator translator)
{
Console.WriteLine("--Translator test--");
Translate(translator, "打ち合わせが終わった後にご飯を食べましょう。");
Translate(translator, "試験前に緊張したあまり、熱がでてしまった。");
Translate(translator, "山田は英語にかけてはクラスの誰にも負けない。");
Translate(translator, "この本によれば、最初の人工橋梁は新石器時代にさかのぼるという。");
}
TestTranslator(translator);static void Translate(JESCJaEnTranslator translator, string sentence)
{
Console.WriteLine($"SourceText: {sentence}");
string translated = translator.Translate(sentence);
Console.WriteLine($"Translated: {translated}");
}
```# How to build
1. Prepare following stuff:
1. .NET build system (`dotnet 6.0, 7.0, 8.0`)
2. PowerShell (Recommend `7.4.2` or above)
2. Run `cbuild.ps1`The build artifact will be saved in `nuget` directory.