An open API service indexing awesome lists of open source software.

https://github.com/zwhe99/llm-mt-eval

{DeepL, Google, WMT-Best, davinci-003, turbo, gpt-4} × {En-De, En-Cs, En-Ru, En-Zh, De-Fr, En-Ja, Uk-En, Uk-Cs, En-Hr, En-Ha, En-Is}
https://github.com/zwhe99/llm-mt-eval

Last synced: 3 months ago
JSON representation

{DeepL, Google, WMT-Best, davinci-003, turbo, gpt-4} × {En-De, En-Cs, En-Ru, En-Zh, De-Fr, En-Ja, Uk-En, Uk-Cs, En-Hr, En-Ha, En-Is}

Awesome Lists containing this project

README

        

# LLM-MT-Eval

This repo evaluates

* DeepL
* Google Trans
* WMT22 Best
* text-davinci-003
* gpt-3.5-turbo-0301
* gpt-4-0314

in automatic metrics:

* COMET
* BLEURT
* BLEU
* chrF
* chrF++

on WMT22 general translation tasks:

* English<->German
* English<->Czech
* English<->Russian
* English<->Chinese
* German<->French
* English<->Japanese
* Ukrainian<->English
* Ukrainian<->Czech
* English->Croatian

and WMT21 news translation tasks:

* English<->Hausa
* English<->Icelandic

### Results

#### System outputs

```
output/
|-- deepl
|-- google-cloud
|-- gpt-3.5-turbo-0301
|-- gpt-4-0314
|-- text-davinci-003
`-- wmt-winner
```

#### Full results


main

#### **Average performance**

**All language pairs** (except for those not supported by DeepL)


avg_all

**High resource**

* En<->De, En<->Cs, En<->Ru, En<->Zh


avg_high

**Medium resource**

* De<->Fr, En<->Uk, En<->Ja


avg_mid

**Low resource**

* Uk<->Cs, En<->Hr, En<->Ha, En<->Is


avg_low

### Evaluation

```sh
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip
python3 evaluation/eval.log --bleurt-ckpt BLEURT-20
```