{"id":13589660,"url":"https://github.com/FSoft-AI4Code/CodeCapybara","last_synced_at":"2025-04-08T09:33:39.269Z","repository":{"id":154492112,"uuid":"630865521","full_name":"FSoft-AI4Code/CodeCapybara","owner":"FSoft-AI4Code","description":"Open-source Self-Instruction Tuning Code LLM","archived":false,"fork":false,"pushed_at":"2023-04-26T06:58:29.000Z","size":944,"stargazers_count":170,"open_issues_count":2,"forks_count":11,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-28T08:36:06.656Z","etag":null,"topics":["ai4code","alpaca","codellm","instruction-tuning","llama"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FSoft-AI4Code.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-21T10:28:53.000Z","updated_at":"2025-02-08T03:11:51.000Z","dependencies_parsed_at":"2023-04-30T10:01:55.746Z","dependency_job_id":null,"html_url":"https://github.com/FSoft-AI4Code/CodeCapybara","commit_stats":{"total_commits":28,"total_committers":3,"mean_commits":9.333333333333334,"dds":0.5714285714285714,"last_synced_commit":"9bc7ab4305444cca625ceed7b1aaca987a193bec"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FSoft-AI4Code%2FCodeCapybara","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FSoft-AI4Code%2FCodeCapybara/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FSoft-AI4Code%2FCodeCapybara/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FSoft-AI4Code%2FCodeCapybara/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FSoft-AI4Code","download_url":"https://codeload.github.com/FSoft-AI4Code/CodeCapybara/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247804584,"owners_count":20999017,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai4code","alpaca","codellm","instruction-tuning","llama"],"created_at":"2024-08-01T16:00:32.690Z","updated_at":"2025-04-08T09:33:38.878Z","avatar_url":"https://github.com/FSoft-AI4Code.png","language":"Python","funding_links":[],"categories":["Projects"],"sub_categories":[],"readme":"\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca  target=\"_blank\"\u003e\u003cimg src=\"assets/logo.png\" alt=\"Code-Capybara\" style=\"width: 80%; min-width: 500px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n# CodeCapybara: Open Source LLaMA Model that Follow Instruction-Tuning for Code Generation.\n[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/AI4Code-Research/CodeCapybara/blob/main/LICENSE)\n[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/AI4Code-Research/CodeCapybara/blob/main/DATA_LICENSE)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\nWe introduce CodeCapybara - A Code specialized Instruction-following Large Language Model. This repo also attempts to evaluate and reproduce performance results of existing LLMs for code, such as Llama, Alpaca and CodeAlpaca for code generation benchmarks (HumanEval and MBPP).\n\n- ***First attempt to reproduce of LLaMA results*** on widely recognized Code Generation benchmarks\n- CodeCapybara is fine-tuned from Llama 7B. Larger models will be available soon. You can find our checkpoints at [this](#instruction-tuning).\n- We use ***our own dataset in larger scale and more diverse*** to fine-tune Llama under an instruction-tuning style.\n- ***Improved evaluation results on HumanEval*** in comparison to LLaMA, Alpaca and CodeAlpaca.\n- Full transparency with open source availability: ***all scripts and models are accessible to the community***.\nWe encourage you to contribute to CodeCapybara and help advance the field of code generation. \n\n## Table of Contents\n\n- [CodeCapybara: Open Source LLaMA Model that Follow Instruction-Tuning for Code Generation.](#codecapybara-open-source-llama-model-that-follow-instruction-tuning-for-code-generation)\n  - [Table of Contents](#table-of-contents)\n  - [Overview](#overview)\n    - [Data Collection](#data-collection)\n      - [Only Instruction Generation](#only-instruction-generation)\n      - [Code Alpaca](#code-alpaca)\n      - [DeepMind's Code Contests](#deepminds-code-contests)\n    - [Instruction Tuning](#instruction-tuning)\n  - [Results](#results)\n    - [HumanEval Results](#humaneval-results)\n    - [MBPP Results](#mbpp-results)\n  - [Data Release](#data-release)\n  - [Checkpoint Release](#checkpoint-release)\n  - [Installation](#installation)\n  - [Usage](#usage)\n    - [Loading model](#loading-model)\n      - [Loading CodeCapybara](#loading-codecapybara)\n      - [Loading CodeCapybara-LoRA](#loading-codecapybara-lora)\n    - [Generate](#generate)\n  - [Instruction Tuning](#instruction-tuning-1)\n  - [Benchmarking](#benchmarking)\n    - [HumanEval](#humaneval)\n    - [MBPP](#mbpp)\n  - [Reproducing LLaMA Results](#reproducing-llama-results)\n  - [Example Outputs](#example-outputs)\n  - [Future Plans](#future-plans)\n  - [Contributing](#contributing)\n  - [License](#license)\n\n## Overview\nWe follow several recent techniques of instruction tuning to collect data and train an instruction-following model with ability to generate executable code from human language description.\n\nWe can divide our process for training CodeyCapybara into two stages:\n1. **Data Collection**: We collect data generated through OpenAI `gpt-3.5-turbo` as well as code generation supervised dataset.\n2. **Instruction Tuning**: We fine-tune our model from MetaAI's LLaMA checkpoint with parameter-efficient fine-tuning methods.\n\n### Data Collection\nIn this stage, we follow previous works to collect instruction data. To ensure the quality of the code data used in the fine-tuning stage, we make some modifications from data Self-Instruct data generation procedure.\n\u003c!-- | Data source | No. samples |\n|-|-|\n|Only Instruction Generation| 20,574|\n|CodeAlpaca| 20,022 |\n|DeepMind's Code Contests| 13,328 |\n| **Total**| **53,924**| --\u003e\n\n#### Only Instruction Generation\nTo ensure the code quality for later use as targets in the fine-tuning step,  we leverage an unsupervised dataset that only contains code snippets crawled from open-sources. We then design a prompt to ask `gpt-3.5-turbo` to generate a corresponding instruction for each code snippet. In other words, to obtain a pair (instruction-output), we ask `gpt-3.5-turbo` to generate the instruction given the output as human written code snippet.\n\nOur unsupervised dataset contains code functions that covers a wide range of programming problem in 10 programming languages, i.e `Python, Javascript, Java, Golang, Ruby, Rust, PHP, C, C++, C#`\n\nWe obtain our dataset through `gpt-3.5-turbo` OpenAI API. Each instruction-output pair is generated through 2 rounds of API calling.\n- In 1st round, we include a code function (i.e output) in the prompt, and ask `gpt-3.5-turbo` to generate a corresponding instruction.\n- In 2nd round, since the code function does not guarantee an executable program, we include both 1st round generated instruction and code function to a new prompt and ask the model to generate an executable program with libraries imported and dependencies implementation along with the given code function.\n \n- Our prompt template can be found [here](./data/prompts/prompt.py).\n- Our script for 2 rounds of data generation can be found [here](./data_generation/data_generation.py).\n\n#### [Code Alpaca](https://github.com/sahil280114/codealpaca)\nFor the second source of data, our intention is to follow [Self-Instruct](https://arxiv.org/abs/2212.10560) paper to completely generate various code problems in the format of (Instruction-Input-Output) data from a seed dataset.\n\nWe reuse the generated instruction data from [Code Alpaca](https://github.com/sahil280114/codealpaca/blob/master/data/code_alpaca_20k.json) to reduce API calling cost since what they did is similar to our purpose.\n\n#### [DeepMind's Code Contests](https://github.com/deepmind/code_contests)\nWe also leverage the supervised code generation dataset. There are various code generation dataset with high quality and quantity, such as APPS (5,000 problems in train split), MBPP (500 problems in train split).\n\nIn this version, we select [DeepMind's Code Contests](https://github.com/deepmind/code_contests) dataset, which contains competitive programming problems with detailed description and test cases. The train split we employ to fine-tune our model contains 13,328 problems which results in 51,766 instruction-output pairs.\n\n### Instruction Tuning\nWe tried 2 approaches to fine-tune LLaMA-7B checkpoint on the collected data, including:\n- Full-parameter Fine-tuning\n- Parameter-efficient Fine-tuning with HuggingFace's PEFT\n\nPlease refer to [Checkpoint Release](#checkpoint-release) section for accessing to our checkpoints.\n\n## Results\n\nWe evaluate our models as well as reproduce other models' results on 2 benchmarks, HumanEval and MBPP. All numbers are reported in zero-shot settings.\n\n### HumanEval Results\n| Model |Base checkpoint | pass@1 | pass@10 | pass@100 |\n| - | - | - | -  | - |\n| LLaMA |  decapoda-research/llama-7b-hf | 10.70| 13.29 | **13.41** |\n| LLaMA | huggyllama/llama-7b  |9.7  | 12.66| 12.80 |\n| Alpaca-LoRA |  decapoda-research/llama-7b-hf | 8.00 | 10.00 | 10.37|\n| CodeCapybara-LoRA |  decapoda-research/llama-7b-hf | 9.61 | 11.62 | 12.02 |\n| CodeCapybara | huggyllama/llama-7b | **11.10** | **13.33** | **13.41** |\n\n### MBPP Results\n\n## Data Release\nWe release our data as well as other data sources used for training our models\n- [Our Instruction Only Generation data](./data/raw-data/generated_data.jsonl)\n- [Code Apaca data](https://github.com/sahil280114/codealpaca/blob/master/data/code_alpaca_20k.json)\n- [Deepmind's CodeContests](https://huggingface.co/datasets/deepmind/code_contests) hosted on HuggingFace\n\u003c!You can find our used datasets in the folder `data/raw-data`, namely `code_alpaca_20k.json` (from CodeAlpaca) and `generated_data.jsonl` (our own dataset).!\u003e\n\n## Checkpoint Release\nWe release our checkpoints hosted on HuggingFace\n- [CodeCapybara](https://huggingface.co/Fsoft-AIC/CodeCapybara) - Full-parameter Fine-tuning\n- [CodeCapypara-LoRA](https://huggingface.co/Fsoft-AIC/CodeCapybara-LoRA) - Parameter-efficient Fine-tuning\n\n## Installation\n\n```bash\nconda create -n codecapybara -y\nconda activate codecapybara\nconda install pip -y\npip install -r requirements.txt\n```\n\n## Usage\nLet's define a function to convert `instruction` and `input` into a single prompt as input to our `model.generate`\n```python\ndef generate_prompt(instruction, input=None):\n\t# Templates used by Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca\n\tif input is not None:\n\t\tprompt = f\"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\\n\\n### Instruction:\\n{instruction}\\n\\n### Input:\\n{input}\\n\\n### Response:\"\n\telse:\n\t\tprompt = f\"prompt_no_input\": \"Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\n{instruction}\\n\\n### Response:\"\n\treturn prompt\n```\n\n### Loading model\nYou can choose to load full-parameter `CodeCapybara` or `CodeCapybara-LoRA`\n#### Loading CodeCapybara\n\n```python\nimport sys\nimport torch\nfrom transformers import LlamaTokenizer, LlamaForCausalLM\n\ntokenizer = LlamaTokenizer.from_pretrained(\"Fsoft-AIC/CodeCapybara\")\nmodel = LlamaForCausalLM.from_pretrain(\"Fsoft-AIC/CodeCapybara\",\n\t\t\t\t\tload_in_8bit=True,\n\t\t\t\t\tdtype=torch.float16,\n\t\t\t\t\tdevice_map=\"auto\")\n\nmodel.config.pad_token_id = tokenizer.pad_token_id = 0\nmodel.config.bos_token_id = 1\nmodel.config.eos_token_id = 2\n\nmodel.eval()\nif torch.__version__ \u003e= \"2\" and sys.platform != \"win32\":\n\tmodel = torch.compile(model)\n```\n\n#### Loading CodeCapybara-LoRA\n\n```python\nimport sys\nimport torch\nfrom transformers import LlamaTokenizer, LlamaForCausalLM\nfrom peft import PeftModel\n\ntokenizer = LlamaTokenizer.from_pretrained(\"decapoda-research/llama-7b-hf\")\nmodel = LlamaForCausalLM.from_pretrained(\"decapoda-research/llama-7b-hf\",\n\t\t\t\t\t load_in_8bit=True,\n\t\t\t\t\t dtype=torch.float16,\n\t\t\t\t\t device_map=\"auto\")\nmodel = PeftModel.from_pretrained(\"Fsoft-AIC/CodeCapybara-LoRA\",\n\t\t\t\t  load_in_8bit=True,\n\t\t\t\t  dtype=torch.float16,\n\t\t\t\t  device_map=\"auto\")\n\nmodel.config.pad_token_id = tokenizer.pad_token_id = 0\nmodel.config.bos_token_id = 1\nmodel.config.eos_token_id = 2\n\nmodel.eval()\nif torch.__version__ \u003e= \"2\" and sys.platform != \"win32\":\n\tmodel = torch.compile(model)\n```\n### Generate\nAfter loading model to your device, add the following script to generate prediction\n```python\ninstruction = \"Write a Python program that prints the first 10 Fibonacci numbers\"\nprompt = generate_prompt(instruction)\n\ninput_ids = tokenizer(prompt)[\"input_ids\"]\n\ngeneration_config = GenerationConfig(temperature=0.1,\n\t\t\t\t     top_k=40,\n\t\t\t\t     top_p=0.75)\nwith torch.no_grad():\n\toutput_ids = model.generate(inputs,\n\t\t\t\t    generation_config=generation_config,\n\t\t\t\t    max_new_tokens=128)\noutput = tokenizer.decode(output_ids, skip_special_tokens=True, ignore_tokenization_space=True)\nprint(output)\n```\n## Instruction Tuning\nWe support 2 settings to fine-tune LLaMA models. In the first setting, we refine all the parameters using Fully Sharded Data Parallel, and for the rest, we currently utilize LoRA to adapt the models to the instruction tuning task. You can easily run such settings by the command\n```bash\n    bash scripts/train.sh\n```\n\nwhich calls `main/train.py`. We also provide some arguments to customize the training process\n- --train-batch-size: batch-size of each gpu for training\n- --val-batch-size: batch-size of each gpu for validating\n- --num-workers: number of workers in the DataLoader\n- --config-path: the path of the configuration file. We provide a template in the folder `configs`\n- --model-type: setting's used to fine-tune. There are 2 valid values: `fine-tunning` and `lora`.\n- --use-wandb: 0 if you don't use *wandb* for logging; otherwise, wandb is used.\nMoreover, you can edit the configuration file `configs/config.yml` which contains some notable fields:\n- checkpoint\n  - dir: the folder contains all the checkpoints\n  - old_checkpoint: the path of the old checkpoint. If it is null, the model'll train from scratch; otherwise, it continues training from this checkpoint.\n  - epochs: the number of epochs between 2 consecutive model saves.\n- epochs: number of epochs for training\n- model:\n  - hf_model: LLaMA model in HuggingFace format\n  - lora: settings for LoRA method\n- optimizer: specify optimizer\n- scheduler: configurate the hypermeters for a warm-up learning-rate schedule\n- max-seq-length: maximum length of the instruction and the response.\n\n## Benchmarking\nTo evaluate checkpoints on HumanEval or MBPP benchmark, navigate to `main/`\n```bash\ncd main/\n```\n\nWe use nucleus sampling for sampling next-token in each prediction step to generate multiple difference code outputs for each problem. Hyperparameter configuration used for our evaluation is specified in the command below.\n\n### HumanEval\nThe first part of the below command generates multiple `.jsonl` files, which will be saved into `path/to/prediction/directory` by inference the model. The command follows after taking predictions as input to calculate pass@k.\n```bash\n# model inference\nexport CUDA_VISIBLE_DEVICES=0,1\nN_PROCS=$(echo $CUDA_VISIBLE_DEVICES | tr \",\" \"\\n\" | wc -l)\nNUM_ITERATIONS=10\n\nfor _ in $(seq $NUM_ITERATIONS);\ndo\n    python -m torch.distributed.run --nprocs ${N_PROCS} generate.py \\\n        --output_dir path/to/prediction/directory \\\n        --dataset_name 'humaneval' \\\n        --base_model 'Fsoft-AIC/CodeCapybara' \\\n        --lora_weights '' \\\n        --batch_size 1 \\\n        --num_return_sequences 20 \\\n        --load_8bit True \\\n        --temperature 0.1 \\\n        --top_p 0.75 \\\n        --top_k 40\ndone\n\n# Calculating pass@k with k=1,10,100\npython eval_humaneval.py --prediction_dir path/to/prediction/directory\n```\n\n`n = NUM_ITERATIONS * batch_size * num_return_sequences`, where `n` is used to estimate `pass@k` as in the [Codex](https://arxiv.org/pdf/2107.03374.pdf) paper.\n\n$${pass@k} = \\underset{\\text { Problems }}{\\mathbb{E}}\\left[1-\\frac{C^{k}_{n-c}}{C^{k}_{n}}\\right]$$\n\nHere we choose `n = 200` as employed in the paper, which results in\n- `NUM_ITERATIONS=10`\n- `batch_size=1`\n- `num_return_sequences=20`\n\n### MBPP\nReplacing the `humaneval` by `mbpp`\n```bash\n# model inference\nexport CUDA_VISIBLE_DEVICES=0,1\nN_PROCS=$(echo $CUDA_VISIBLE_DEVICES | tr \",\" \"\\n\" | wc -l)\nNUM_ITERATIONS=10\n\nfor _ in $(seq $NUM_ITERATIONS);\ndo\n    python -m torch.distributed.run --nprocs ${N_PROCS} generate.py \\\n        --output_dir path/to/prediction/directory \\\n        --dataset_name 'mbpp' \\\n        --base_model 'Fsoft-AIC/CodeCapybara' \\\n        --lora_weights '' \\\n        --batch_size 1 \\\n        --num_return_sequences 20 \\\n        --load_8bit True \\\n        --temperature 0.1 \\\n        --top_p 0.75 \\\n        --top_k 40\ndone\n\n# Calculating pass@k with k=1,10,80,100\npython eval_mbpp.py --prediction_dir path/to/prediction/directory\n```\n\n##  Reproducing LLaMA Results\nSince MetaAI released their official LLaMA checkpoints, there have been questions and efforts on reproducing their results on HumanEval and MBPP reported in [paper](https://arxiv.org/pdf/2302.13971.pdf). This repo wishes to reproduce LLaMA and other LLMs results on widely recognized Code Generation benchmarks.\n\nTo evaluate a HuggingFace LLaMA checkpoint on HumanEval or MBPP,  please pass the values of `--base_model` and `--dataset_name` the corresponding model and benchmark in the [evaluation script example](#humaneval).\n\nYou can also tweak hyperparameters i.e  `temperature`, `top-p`, `top-k` for trade-off between accuracy and diversity and in prediction. Tuning hyperparameters will lead to change in final results. Community is welcome for seeking optimal hyperparameter values.\n\nWe are in our progress of evaluating LLaMA official checkpoints without HuggingFace format checkpoint conversion.\n\n## Example Outputs\n\n## Future Plans\n\n## Contributing\n\n## License\n\nFeel free to cite us\n```bibtex\n@misc{codecapybara,\n\ttitle = {CodeCapybara: Code Instruction Tuning},\n\tauthor = {},\n\tyear = {2023},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFSoft-AI4Code%2FCodeCapybara","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFSoft-AI4Code%2FCodeCapybara","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFSoft-AI4Code%2FCodeCapybara/lists"}