https://github.com/jancervenka/czech-simpleqa
How well can language models answer questions in Czech?
https://github.com/jancervenka/czech-simpleqa
ai artificial-intelligence claude evals gpt language-model llm
Last synced: about 1 year ago
JSON representation
How well can language models answer questions in Czech?
- Host: GitHub
- URL: https://github.com/jancervenka/czech-simpleqa
- Owner: jancervenka
- License: mit
- Created: 2025-01-09T23:42:02.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-12T16:57:11.000Z (over 1 year ago)
- Last Synced: 2025-01-12T17:35:21.034Z (over 1 year ago)
- Topics: ai, artificial-intelligence, claude, evals, gpt, language-model, llm
- Language: Python
- Homepage:
- Size: 2.48 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Czech-SimpleQA
[eval-data]: https://raw.githubusercontent.com/jancervenka/czech-simpleqa/refs/heads/main/src/czech_simpleqa/czech_simpleqa.csv.gz
[simple-evals]: https://github.com/openai/simple-evals/tree/main
[simpleqa-arxiv]: https://arxiv.org/abs/2411.04368
[blogpost]: https://jancervenka.github.io/2025/01/12/czech-simpleqa.html
Problems and answers from [OpenAI's SimpleQA eval][simple-evals] translated into Czech. This work is
based on the data from [the paper][simpleqa-arxiv]:
>**Measuring short-form factuality in large language models**
>*Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus*
>arXiv preprint arXiv:2411.04368, 2024. [https://arxiv.org/abs/2411.04368](https://arxiv.org/abs/2411.04368)
| model | SimpleQA[^1] | Czech-SimpleQA |
|---------------------------:|---------:|---------------:|
| gpt-4o-mini-2024-07-18 | 9.5 | 8.1 |
| gpt-4o-2024-11-20 | 38.8 | 31.4 |
| claude-3-5-sonnet-20240620 | 35.0 | 25.8 |
| claude-3-5-sonnet-20241022 | N/A | 31.1 |
| claude-3-5-haiku-20241022 | N/A | 9.3 |
**[There is a post on my blog with more detailed results!][blogpost]**
[^1]: As reported in the [SimpleQA README.md][simple-evals] and in [the paper][simpleqa-arxiv].
## I Just Want the Eval Data
The file with the data lives at `src/czech_simpleqa/czech_simpleqa.csv.gz`, [this is the full URL][eval-data].
Getting it with `pandas` looks like this:
```python
import pandas as pd
eval_data = pd.read_csv(
"https://raw.githubusercontent.com/jancervenka/"
"czech-simpleqa/refs/heads/main/src/czech_simpleqa/czech_simpleqa.csv.gz"
)
```
| problem | target | czech_problem | czech_target |
|:--------------------------------------------------------------------------:|:------------------------:|:-----------------------------------------------------------------------:|:-----------------------:|
| What was the population count in the 2011 census of the Republic of Nauru? | 10,084 | Jaký byl počet obyvatel při sčítání lidu v roce 2011 v Republice Nauru? | 10 084 |
## I Want to Use the Python Package
The package contains everything required to run the eval end-to-end and collect the results.
You can install it with `pip` or any other Python package manager:
```bash
pip install czech-simpleqa
python -m czech_simpleqa.eval \
--answering_model claude-3-5-haiku-20241022 \
--grading_model gpt-4o \
--output_file_path output/claude-3-5-haiku-20241022.csv \
--max_concurrent_tasks 30
```
### CLI Arguments
- `--answering_model`: Model that will generate predicted answers to the problems in the eval.
- `--grading_model`: Model that will grade the predicted answers from the answering model.
- `--output_file_path`: Where to store the `.csv` file with the eval results.
- `--max_concurrent_tasks`: Maximum number of concurrent model calls (default 20).
### Output File Schema
| problem | target | predicted_answer | grade |
|:---------------------------------------:|:---------:|:----------------------------------------:|:-----:|
| Jaké je rozlišení Cat B15 Q v pixelech? | 480 x 800 | Cat B15 Q má rozlišení 480 x 800 pixelů. | A |
### Supported Models
Models from OpenAI and Anthropic are currently supported. Environment variables `OPENAI_API_KEY` or
`ANTHROPIC_API_KEY` need to be configured.
## Model Results
Answers with their grades from all the evaluated models can be found in the `model_results/` directory.