https://github.com/zwhe99/llm-mt-eval
{DeepL, Google, WMT-Best, davinci-003, turbo, gpt-4} × {En-De, En-Cs, En-Ru, En-Zh, De-Fr, En-Ja, Uk-En, Uk-Cs, En-Hr, En-Ha, En-Is}
https://github.com/zwhe99/llm-mt-eval
Last synced: 3 months ago
JSON representation
{DeepL, Google, WMT-Best, davinci-003, turbo, gpt-4} × {En-De, En-Cs, En-Ru, En-Zh, De-Fr, En-Ja, Uk-En, Uk-Cs, En-Hr, En-Ha, En-Is}
- Host: GitHub
- URL: https://github.com/zwhe99/llm-mt-eval
- Owner: zwhe99
- Created: 2023-06-13T10:39:07.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-06-18T11:01:11.000Z (almost 2 years ago)
- Last Synced: 2025-01-30T06:42:26.743Z (4 months ago)
- Language: Smalltalk
- Homepage:
- Size: 45.1 MB
- Stars: 14
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LLM-MT-Eval
This repo evaluates
* DeepL
* Google Trans
* WMT22 Best
* text-davinci-003
* gpt-3.5-turbo-0301
* gpt-4-0314in automatic metrics:
* COMET
* BLEURT
* BLEU
* chrF
* chrF++on WMT22 general translation tasks:
* English<->German
* English<->Czech
* English<->Russian
* English<->Chinese
* German<->French
* English<->Japanese
* Ukrainian<->English
* Ukrainian<->Czech
* English->Croatianand WMT21 news translation tasks:
* English<->Hausa
* English<->Icelandic### Results
#### System outputs
```
output/
|-- deepl
|-- google-cloud
|-- gpt-3.5-turbo-0301
|-- gpt-4-0314
|-- text-davinci-003
`-- wmt-winner
```#### Full results
![]()
#### **Average performance**
**All language pairs** (except for those not supported by DeepL)
![]()
**High resource**
* En<->De, En<->Cs, En<->Ru, En<->Zh
![]()
**Medium resource**
* De<->Fr, En<->Uk, En<->Ja
![]()
**Low resource**
* Uk<->Cs, En<->Hr, En<->Ha, En<->Is
![]()
### Evaluation
```sh
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip
python3 evaluation/eval.log --bleurt-ckpt BLEURT-20
```