https://github.com/freedomintelligence/freshbench
https://github.com/freedomintelligence/freshbench
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/freedomintelligence/freshbench
- Owner: FreedomIntelligence
- Created: 2024-01-11T07:06:19.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-13T08:54:27.000Z (over 1 year ago)
- Last Synced: 2025-11-10T00:21:04.922Z (7 months ago)
- Language: Python
- Size: 18.9 MB
- Stars: 6
- Watchers: 10
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
#### about the prediction data
At `/GoodJudgeOpen_crawler`
There is a readme.md and a pipeline.sh.
You can crawl the data incrementally using playwright by yourself, then you can update the current gjo file from crawler.
```
base_path="./"
current_gjo_file_path="${base_path}test/raw_question/gjo_transformed_questions_no_output.jsonl"
gjo_raw_crawl_file_path="${base_path}GoodJudgeOpen_crawler/gjo.json"
python update_from_crawl.py \
--gjo_raw_crawl_file_path "$gjo_raw_crawl_file_path" \
--current_gjo_file_path "$current_gjo_file_path"
```
This updates the gjo questions in gjo_raw_crawl_file_path: by default `test/raw_question/gjo_transformed_questions_no_output.jsonl`
### Letting models answer and judge the choice incrementally
#### online:
fill /script/base-url.txt and /script/api-key.txt
fill the model names in `pipeline_api.sh`
`bash ./pipeline_api.sh`
#### local:
fill the model names in `pipeline_local.sh`
`bash ./pipeline_local.sh`
### answer will be at /answer_csv/.csv
### classify Nostalgia and Neophilia
specify the periods in analysis/acc_p1p2.py and analysis/acc_p1p2_T2.py, they test the two hyposis, make sure the time_intervals_dic are the same, [(-80, -60), (-20, 0)] means comparing the accuracy during 80 to 60 months before model release and accuracy during the 20 months before release.
```
time_intervals_dic = {
'[(-80, -60), (-20, 0)]':[(-80, -60), (-20, 0)],
'[(-60, -40), (-20, 0)]':[(-60, -40), (-20, 0)],
'[(-40, -20), (-20, 0)]':[(-40, -20), (-20, 0)],
}
```
```
python analysis/acc_p1p2.py
python analysis/acc_p1p2_T2.py
python analysis/classify.py
```
#### classify will generate a latex table
```
gpt4_1106 & Nostalgia *** & Nostalgia *** & Nostalgia *** \\
gpt_3.5_turbo_0613 & Balanced & Nostalgia *** & Nostalgia *** \\
```
# Citation
```
@inproceeding{zhu2024freshbench,
title={Is Your LLM Outdated? A Deep Look at Temporal Generalization},
author={Chenghao Zhu and Nuo Chen and Yufei Gao and Yunyi Zhang and Prayag Tiwari and Benyou Wang},
booktitle = {Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics},
year={2024}
}
```