{"id":19542053,"url":"https://github.com/bigscience-workshop/evaluation","last_synced_at":"2025-04-26T17:31:01.881Z","repository":{"id":40521999,"uuid":"370478694","full_name":"bigscience-workshop/evaluation","owner":"bigscience-workshop","description":"Code and Data for Evaluation WG","archived":false,"fork":false,"pushed_at":"2022-05-04T03:04:06.000Z","size":130,"stargazers_count":41,"open_issues_count":50,"forks_count":24,"subscribers_count":23,"default_branch":"main","last_synced_at":"2025-04-26T06:16:56.958Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigscience-workshop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-05-24T20:37:20.000Z","updated_at":"2024-01-04T16:57:56.000Z","dependencies_parsed_at":"2022-08-09T22:24:35.088Z","dependency_job_id":null,"html_url":"https://github.com/bigscience-workshop/evaluation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fevaluation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fevaluation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fevaluation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Fevaluation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigscience-workshop","download_url":"https://codeload.github.com/bigscience-workshop/evaluation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251025654,"owners_count":21524840,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T03:13:00.058Z","updated_at":"2025-04-26T17:31:01.611Z","avatar_url":"https://github.com/bigscience-workshop.png","language":"Python","readme":"# BigScience Evaluation\nCode and data for the [BigScience Evaluation WG](https://bigscience.huggingface.co/en/#!pages/working-groups.md).\n\n## Upcoming Milestones for Contributors\n- September 1, 2021: Eval Engineering Subgroup release toy tasks/dummy code to define API\n- September 1, 2021: New task-based subgroups established and begin work\n- October 1, 2021: Finalize GitHub with all data and scripts for generating raw evaluation results\n- October 15, 2021: General meeting to discuss longer research project proposals for fall/spring \n- October 15, 2021: Form subgroup on data presentation/visualization to create final report card\n\n## Quickstart\n\nTo benchmark a baseline GPT-2 model with WMT and TyDiQA datasets on GPU, run\n\n```shell\npython3 -m evaluation.eval \\\n    --model_name_or_path gpt2 \\\n    --eval_tasks wmt tydiqa_secondary \\\n    --device cuda \\\n    --output_dir outputs\n```\n\nNote: For toxicity dataset, you have to download the dataset manually from Kaggle [here](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) and also pass the `data_dir` argument to the folder.\n\n## Setup\n\n1. Create virtual environment (one-time).\n\n   ```shell\n   python3 -m venv venv # create a virtual environment called 'venv'\n   ```\n2. Activate the virtual environment.\n\n   ```shell\n   source venv/bin/activate\n   ```\n\n3. Install package requirements.\n\n   ```shell\n   python3 -m pip install -r requirements.txt\n   python3 -m pip install -r requirements-dev.txt\n   ```\n## Tasks\n\nThis project plans to support all datasets listed under `docs/datasets.md`.  The sections below detail task-independent inner-workings of this repository.\n\n### AutoTask\n\nEvery task/dataset lives as a submodule within `evaluation.tasks`. The core of these submodules inherit from `evaluation.tasks.auto_task.AutoTask`, which is a base class that houses all abstract functions, as well has holds `model`, `tokenizer`, and `task_config` as its attributes. \n\n`AutoTask` makes it incredibly easy to load any dataset for a benchmark. The basic signature is\n\n```python\ntask = AutoTask.from_task_name(\n    \"task_name\", model, tokenizer, device, english_only\n)\n```\n\nAlternatively, if the model has to be recreated for each task, a task object can be created from string specifications.\n\n```python\ntask = AutoTask.from_spec(\n    \"task_name\", \n    \"model_name_or_path\", \n    \"tokenizer_name\",\n    device,\n    english_only,\n    data_dir: Optional\n)\n```\n\n### Evaluation\n\nEvery `AutoTask` subclass has a `.evaluate()` function wherein all evaluation logic resides, i.e. loading the dataset (and the dataloader, if necessary), and computing reporting metrics. At the end of the evaluation, metrics are saved as a class attribute in `task.metrics`. For more details on the full pipeline, refer to the main evaluation script, [`evaluation/eval.py`](evaluation/eval.py). \n\n## Contributing\n\nRefer to [`CONTRIBUTING.md`](CONTRIBUTING.md).  \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fevaluation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigscience-workshop%2Fevaluation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Fevaluation/lists"}