{"id":15829126,"url":"https://github.com/gersteinlab/ML-bench","last_synced_at":"2025-10-16T21:31:30.098Z","repository":{"id":207907693,"uuid":"719415913","full_name":"gersteinlab/ML-Bench","owner":"gersteinlab","description":"The Official Repo of ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code (https://arxiv.org/abs/2311.09835)","archived":false,"fork":false,"pushed_at":"2024-11-19T03:30:00.000Z","size":222297,"stargazers_count":286,"open_issues_count":0,"forks_count":8,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-01-27T06:08:58.293Z","etag":null,"topics":["code-generation","gpt-4","llm"],"latest_commit_sha":null,"homepage":"https://ml-bench.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gersteinlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-16T05:42:31.000Z","updated_at":"2025-01-18T02:29:13.000Z","dependencies_parsed_at":"2024-06-16T09:50:35.478Z","dependency_job_id":"ae48a906-1b1d-4eb8-880e-84f8869e9dc1","html_url":"https://github.com/gersteinlab/ML-Bench","commit_stats":null,"previous_names":["gersteinlab/ml-bench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FML-Bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FML-Bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FML-Bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FML-Bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gersteinlab","download_url":"https://codeload.github.com/gersteinlab/ML-Bench/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236749064,"owners_count":19198617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-generation","gpt-4","llm"],"created_at":"2024-10-05T11:00:38.470Z","updated_at":"2025-10-16T21:31:30.092Z","avatar_url":"https://github.com/gersteinlab.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# ML-Bench: Benchmarking Large Language Models and Agents for End-to-End Machine Learning Workflows\n\n\n\n\n\u003cp align=\"left\"\u003e\n      \u003ca href='https://arxiv.org/abs/2311.09835'\u003e\u003cimg src='https://img.shields.io/badge/ML Bench-arXiv-d63031?logo=arxiv\u0026logoColor=white'\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/gersteinlab/ML-Bench/blob/public-release/LICENSE\" alt=\"license\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/license-Apache--2.0-%23002FA7\" /\u003e\u003c/a\u003e\n\u003ca href=\"https://img.shields.io/github/stars/gersteinlab/ML-Bench/\" alt=\"arXiv\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/gersteinlab/ML-Bench\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n\n\n\n\n\n![Alt text](assets/distribution.png)\n\n## Table of Contents\n- 📋 [Prerequisites](#-prerequisites)\n- 📊 [Data Preparation](#-data-preparation)\n- 🦙 [ML-LLM-Bench](#-ml-llm-bench)\n  - 📋 [Prerequisites](#-prerequisites-1)\n  - 🌍 [Environment Setup](#-environment-setup)\n  - 🛠️ [Usage](#%EF%B8%8F-usage)\n  - 📞 [API Calling](#-api-calling)\n  - 🔧 [Open Source Model Fine-tuning](#-open-source-model-fine-tuning)\n    - 📋 [Prerequisites](#-prerequisites-2)\n    - 🏋️ [Fine-tuning](#%EF%B8%8F-fine-tuning)\n    - 🔍 [Inference](#-inference)\n- 🤖 [ML-Agent-Bench](#-ml-agent-bench)\n  - 🌍 [Environment Setup](#-environment-setup-1)\n- 📝 [Cite Us](#-cite-us)\n- 📜 [License](#-license)\n\n\n## 📋 Prerequisites\n\n  To clone this repository with all its submodules, use the `--recurse-submodules` flag:\n\n  ```bash\n  git clone --recurse-submodules https://github.com/gersteinlab/ML-Bench.git\n  cd ML-Bench\n  ```\n\n  If you have already cloned the repository without the `--recurse-submodules` flag, you can run the following commands to fetch the submodules:\n\n  ```bash\n  git submodule update --init --recursive\n  ```\n\n  Then run\n  ```bash\n  pip install -r requirements.txt\n  ```\n\n## 📊 Data Preparation\n\nYou can load the dataset using the following code:\n\n```python\nfrom datasets import load_dataset\n\nml_bench = load_dataset(\"super-dainiu/ml-bench\")    # splits: ['full', 'quarter']\n```\n\nThe dataset contains the following columns:\n- `github_id`: The ID of the GitHub repository.\n- `github`: The URL of the GitHub repository.\n- `repo_id`: The ID of the sample within each repository.\n- `id`: The unique ID of the sample in the entire dataset.\n- `path`: The path to the corresponding folder in LLM-Bench.\n- `arguments`: The arguments specified in the user requirements.\n- `instruction`: The user instructions for the task.\n- `oracle`: The oracle contents relevant to the task.\n- `type`: The expected output type based on the oracle contents.\n- `output`: The ground truth output generated based on the oracle contents.\n- `prefix_code`: The code snippet for preparing the execution environment\n\nIf you want to run ML-LLM-Bench, you need to do post-processing on the dataset. You can use the following code to post-process the dataset:\n\n```bash\nbash scripts/post_process/prepare.sh\n```\n\nSee [post_process](scripts/post_process/README.md) for more details.\n\n## 🦙 ML-LLM-Bench\n\n### 📋 Prerequisites\n\n   After clone submodules, you can run \n\n   `cd scripts/post_process`\n\n   `bash prepare.sh` to generate full and quarter benchmark into `merged_full_benchmark.jsonl` and `merged_quarter_benchmark.jsonl`\n\n   You can change `readme_content = fr.read()` in `merge.py`, line 50 to `readme_content = fr.read()[:100000]` to get 32k length README contents or to `readme_content = fr.read()[:400000]` to get 128k length README contents.\n   \n   Under the 128k setting, users can prepare trainset and testset in 10 mins with 10 workers. Without token limitation, users may need 2 hours to prepare the whole dataset and get a huge dataset.\n\n### 🌍 Environment Setup\n\n\n   To run the ML-LLM-Bench Docker container, you can use the following command:\n   \n   ```bash\n   docker pull public.ecr.aws/i5g0m1f6/ml-bench\n   docker run -it -v ML_Bench:/deep_data public.ecr.aws/i5g0m1f6/ml-bench /bin/bash\n   ```\n\n   To download model weights and prepare files, you can use the following command:\n\n   ```bash\n   bash utils/download_model_weight_pics.sh\n   ```\n\n   It may take 2 hours to automatically prepare them.\n\n### 🛠️ Usage\n\n\n   Place your results in `output/` directory, and update the `--input_path` in `exec.sh` with your path. Also, modify the log address. \n   \n   Then run `bash utils/exec.sh`. And you can check the run logs in your log file, view the overall results in `output/{{MODEL_NAME}}_{{TASK}}_results_{{TIMESTAMP}}.jsonl`, and see the results for each repository in `output/{{MODEL_NAME}}_{{TASK}}_results_{{TIMESTAMP}}.jsonl`.\n   \n   \n   Both JSONL files starting with `eval_result` and `eval_total` contain partial execution results in our paper.\n   \n  - The `output/` folder includes the model-generated outputs we used for testing.\n      \n  - The `logs/` folder saves our the execute log.\n      \n  - The `utils/temp.py` file is not for users, it is used to store the code written by models.\n      \n  - Additionally, the execution process may generate new unnecessary files.\n\n\n### 📞 API Calling\n\nTo reproduce OpenAI's performance on this task, use the following script:\n```bash\nbash script/openai/run.sh\n```\n\nYou need to change the parameter settings in `script/openai/run.sh`:\n\n- `type`: Choose from `quarter` or `full`.\n- `model`: Model name.\n- `input_file`: File path of the dataset.\n- `answer_file`: Original answer in JSON format from GPT.\n- `parsing_file`: Post-process the output of GPT in JSONL format to obtain executable code segments.\n- `readme_type`: Choose from `oracle_segment` and `readme`.\n  - `oracle_segment`: The code paragraph in the README that is most relevant to the task.\n  - `readme`: The entire text of the README in the repository where the task is located.\n- `engine_name`: Choose from `gpt-35-turbo-16k` and `gpt-4-32`.\n- `n_turn`: Number of executable codes GPT returns (5 times in the paper experiment).\n- `openai_key`: Your OpenAI API key.\n\nPlease refer to [openai](scripts/openai/README.md) for details.\n\n### 🔧 Open Source Model Fine-tuning\n\n#### 📋 Prerequisites\nLlama-recipes provides a pip distribution for easy installation and usage in other projects. Alternatively, it can be installed from the source.\n\n1. **Install with pip**\n```\npip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes\n```\n2. **Install from source**\nTo install from source e.g. for development use this command. We're using hatchling as our build backend which requires an up-to-date pip as well as setuptools package.\n```\ngit clone https://github.com/facebookresearch/llama-recipes\ncd llama-recipes\npip install -U pip setuptools\npip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .\n```\n\n#### 🏋️ Fine-tuning\nBy definition, we have three tasks in the paper.\n* Task 1: Given a task description + Code, generate a code snippet.\n* Task 2: Given a task description + Retrieval, generate a code snippet.\n* Task 3: Given a task description + Oracle, generate a code snippet.\n\nYou can use the following script to reproduce CodeLlama-7b's fine-tuning performance on this task：\n```bash\ntorchrun --nproc_per_node 2 finetuning.py \\\n    --use_peft \\\n    --peft_method lora \\\n    --enable_fsdp \\\n    --model_name codellama/CodeLlama-7b-Instruct-hf \\\n    --context_length 8192 \\\n    --dataset mlbench_dataset \\\n    --output_dir OUTPUT_PATH \\\n    --task TASK \\\n    --data_path DATA_PATH \\\n```\n\nYou need to change the parameter settings of `OUTPUT_PATH`, `TASK`, and `DATA_PATH` correspondingly.\n* `OUTPUT_DIR`: The directory to save the model.\n* `TASK`: Choose from `1`, `2` and `3`.\n* `DATA_PATH`: The directory of the dataset.\n\n#### 🔍 Inference\nYou can use the following script to reproduce CodeLlama-7b's inference performance on this task：\n```bash\npython chat_completion.py \\\n    --model_name 'codellama/CodeLlama-7b-Instruct-hf' \\\n    --peft_model PEFT_MODEL \\\n    --prompt_file PROMPT_FILE \\\n    --task TASK \\\n```\n\nYou need to change the parameter settings of `PEFT_MODEL`, `PROMPT_FILE`, and `TASK` correspondingly.\n* `PEFT_MODEL`: The path of the PEFT model.\n* `PROMPT_FILE`: The path of the prompt file.\n* `TASK`: Choose from `1`, `2` and `3`.\n\nPlease refer to [finetune](scripts/finetune/README.md) for details.\n\n## 🤖 ML-Agent-Bench\n### 🌍 Environment Setup\n\nTo run the ML-Agent-Bench Docker container, you can use the following command:\n\n```bash\ndocker pull public.ecr.aws/i5g0m1f6/ml-bench\ndocker run -it public.ecr.aws/i5g0m1f6/ml-bench /bin/bash\n```\n\nThis will pull the latest ML-Agent-Bench Docker image and run it in an interactive shell. The container includes all the necessary dependencies to run the ML-Agent-Bench codebase.\n\nFor ML-Agent-Bench in OpenDevin, please refer to the [OpenDevin setup guide](https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/ml_bench/README.md).\n\nPlease refer to [envs](envs/README.md) for details.\n\n## Cite Us\n\n```\n@article{tang2023ml,\n  title={ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code},\n  author={Tang, Xiangru and Liu, Yuliang and Cai, Zefan and Shao, Yanjun and Lu, Junjie and Zhang, Yichi and Deng, Zexuan and Hu, Helan and An, Kaikai and Huang, Ruijun and others},\n  journal={arXiv preprint arXiv:2311.09835},\n  year={2023}\n}\n```\n\n\n## 📜 License\n\nDistributed under the MIT License. See [`LICENSE`](./LICENSE) for more information.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgersteinlab%2FML-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgersteinlab%2FML-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgersteinlab%2FML-bench/lists"}