{"id":20046598,"url":"https://github.com/ncsoft/offsetbias","last_synced_at":"2025-05-05T09:31:45.466Z","repository":{"id":249609020,"uuid":"824470608","full_name":"ncsoft/offsetbias","owner":"ncsoft","description":"Official implementation of \"OffsetBias: Leveraging Debiased Data for Tuning Evaluators\"","archived":false,"fork":false,"pushed_at":"2024-09-09T05:43:05.000Z","size":42,"stargazers_count":9,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-09-09T06:58:18.380Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ncsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-05T07:50:45.000Z","updated_at":"2024-09-09T05:43:08.000Z","dependencies_parsed_at":"2024-07-22T07:43:23.794Z","dependency_job_id":"c635b91d-329a-4cd0-8857-faf701969b44","html_url":"https://github.com/ncsoft/offsetbias","commit_stats":null,"previous_names":["ncsoft/offsetbias"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ncsoft%2Foffsetbias","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ncsoft%2Foffsetbias/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ncsoft%2Foffsetbias/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ncsoft%2Foffsetbias/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ncsoft","download_url":"https://codeload.github.com/ncsoft/offsetbias/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224437948,"owners_count":17311109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T11:24:58.664Z","updated_at":"2024-11-13T11:24:59.215Z","avatar_url":"https://github.com/ncsoft.png","language":"Python","funding_links":[],"categories":["Tools"],"sub_categories":["LLM Evaluations and Benchmarks"],"readme":"# OffsetBias: Leveraging Debiased Data for Tuning Evaluators\n\n\u003cp align=\"center\"\u003e\n        🤗 \u003ca href=\"https://huggingface.co/datasets/NCSOFT/offsetbias\"\u003eDataset\u003c/a\u003e\u0026nbsp | \u003ca href=\"https://huggingface.co/NCSOFT/Llama-3-OffsetBias-8B\"\u003eGeneration Model\u003c/a\u003e\u0026nbsp | \u003ca href=\"https://huggingface.co/NCSOFT/Llama-3-OffsetBias-RM-8B\"\u003eReward Model\u003c/a\u003e\u0026nbsp  | 📜 \u003ca href=\"https://arxiv.org/abs/2407.06551\"\u003ePaper\u003c/a\u003e\u0026nbsp\n\u003cbr\u003e\n\nOfficial implementation for paper **OffsetBias: Leveraging Debiased Data for Tuning Evaluators**. In the paper we present:\n- **EvalBiasBench**, a meta-evaluation benchmark for testing judge models, \n- **OffsetBias Data**, a training dataset for pairwise preference evaluation,\n- **OffsetBias Model**, a judge model trained using Offsetbias.\n\nThis repository contains sample code for running **Offsetbias Model** for evaluation, **EvalBiasBench** dataset, and an inference script for various evaluation models on various meta-evaluation benchmarks.\n\n## Requirements\n\n```sh\npip install -r requirements.txt\n```\n\n## Evaluation Inference with OffsetBias Model\n\nOffsetBias Model works as a judge model that performs pairwise preference evaluation task, where *Instruction*, *Output (a)*, *Output (b)* are given, and a better output to the instruction needs to be found. You can use modules from this repository for simple and quick inference. Example code is in `offsetbias_inference.py`.\n```python\nfrom module import VllmModule\n\ninstruction = \"explain like im 5\"\noutput_a = \"Scientists are studying special cells that could help treat a sickness called prostate cancer. They even tried these cells on mice and it worked!\"\noutput_b = \"Sure, I'd be happy to help explain something to you! What would you like me to explain?\"\n\nmodel_name = \"NCSOFT/Llama-3-OffsetBias-8B\"\nmodule = VllmModule(prompt_name=\"offsetbias\", model_name=model_name)\n\nconversation = module.make_conversation(\n  instruction=instruction,\n  response1=output_a,\n  response2=output_b,\n  swap=False)\n\noutput = module.generate([conversation])\nprint(output[0])\n# The model should output \"Output (b)\"\n```\n\n## Running EvalBiasBench\n\n**EvalBiasBench**, ia benchmark for testing *judge models* robustness to evaluation scenarios containing biases. You can find the benchmark data under `data/evalbiasbench/`. The following shows instructions for running inference with various *judge models*, including *OffsetBias* model, on several benchmarks, including **EvalBiasBench**.\n\n### Configuration\n\nPrepare model configuration under `config/`. For OpenAI models, api key is required.\n\nA configuration file, `offsetbias-8b.yaml`, looks like the following:\n```yaml\nprompt: llmbar # name of prompt file under prompt/\n\nvllm_args:\n  model_args: # args for vllm.LLM()\n    model: NCSOFT/Llama-3-OffsetBias-8B\n    dtype: float16\n  sampling_params: # args for vllm.SamplingParams()\n    temperature: 0\n    max_tokens: 20\n\nhf_args:\n  model_args: # args for AutoModelForCausalLM.from_pretrained()\n    model: NCSOFT/Llama-3-OffsetBias-8B\n    dtype: float16\n  generate_kwargs: # args for model.generate()\n    max_new_tokens: 20\n    pad_token_id: 128001\n    do_sample: false\n    temperature: 0\n\n```\n\n\n### Run Inference\n\nRunning inference will automatically create inference result file and score file under `result/`. Below are various possible commands.\n```sh\n# run offsetbias inference on BiasBench dataset\npython run_bench.py --config config/offsetbias-8b.yaml\n\n# run offsetbias with custom name on all benchmarks\npython run_bench.py --name my_inference --config config/offsetbias-8b.yaml --benchmarks llmbar,hhh,mtbench,biasbench\n\n# no inference, redo parsing on existing inference result\npython run_bench.py --name my_inference --config config/offsetbias-8b.yaml --benchmarks biasbench --parse\n\n# no inference, recalculate score\npython run_bench.py --name my_inference --score\n```\n\n# Citation\n\nIf you find our work useful, please cite our paper:\n```bibtex\n@misc{park2024offsetbias,\n      title={OffsetBias: Leveraging Debiased Data for Tuning Evaluators},\n      author={Junsoo Park and Seungyeon Jwa and Meiying Ren and Daeyoung Kim and Sanghyuk Choi},\n      year={2024},\n      eprint={2407.06551},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fncsoft%2Foffsetbias","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fncsoft%2Foffsetbias","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fncsoft%2Foffsetbias/lists"}