{"id":13437490,"url":"https://github.com/sahil280114/codealpaca","last_synced_at":"2025-05-16T07:00:27.808Z","repository":{"id":147134657,"uuid":"617429416","full_name":"sahil280114/codealpaca","owner":"sahil280114","description":null,"archived":false,"fork":false,"pushed_at":"2023-05-12T17:41:28.000Z","size":9347,"stargazers_count":1468,"open_issues_count":17,"forks_count":111,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-04-08T16:06:56.186Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sahil280114.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-22T11:28:54.000Z","updated_at":"2025-04-04T04:46:55.000Z","dependencies_parsed_at":"2024-01-14T12:38:37.194Z","dependency_job_id":null,"html_url":"https://github.com/sahil280114/codealpaca","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sahil280114%2Fcodealpaca","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sahil280114%2Fcodealpaca/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sahil280114%2Fcodealpaca/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sahil280114%2Fcodealpaca/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sahil280114","download_url":"https://codeload.github.com/sahil280114/codealpaca/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254485025,"owners_count":22078764,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:00:57.645Z","updated_at":"2025-05-16T07:00:27.694Z","avatar_url":"https://github.com/sahil280114.png","language":"Python","funding_links":[],"categories":["Statistics","Python","Learning Resources","Instruction Fine-tuning Datasets","Dataset Detail","A01_文本生成_文本对话","📚 Paper","Skill Distillation","Paper List","Projects","4. Datasets"],"sub_categories":["Datasets","Domain-specific Instruction Fine-tuning Datasets","大语言对话模型及数据","▶️ Instruction Tuning","NLP Task Specialization","2. Aligning through behaviour imitation","3.10. Factuality"],"readme":"# Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions \n[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE) \n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/) \n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) \n\nThis is the repo for the Code Alpaca project, which aims to build and share an instruction-following LLaMA model for code generation. This repo is fully based on [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) ,and only changes the data used for training. Training approach is the same.\n\nThe repo contains:\n- The [20K data](#data-release) used for fine-tuning the model\n- The code for [generating the data](#data-generation-process)\n- The code for [fine-tuning the model](#fine-tuning)\n\nDemo for the model can be found [https://code-alpaca-demo.vercel.app/](https://code-alpaca-demo.vercel.app/)\n\n## Overview\n\nThe Code Alpaca models are fine-tuned from a 7B and 13B LLaMA model on 20K instruction-following data generated by the techniques in the Self-Instruct [1] paper, with some modifications that we discuss in the next section.\nEvals are still a todo.\n\nThe model is not finetuned to be safe and harmless, so be cautious.\n\nCurrent release contains the data generation procedure, dataset, and training code. Model weights aren't part of the release for now, to respect OpenAI TOS and LLaMA license.\n\n[1]: Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. https://arxiv.org/abs/2212.10560\n\n\n## Data Release\n[`data/code_alpaca_20k.json`](./data/code_alpaca_20k.json) contains 20K instruction-following data used for fine-tuning the Code Alpaca model.\nThis JSON file is a list of dictionaries, each dictionary contains the following fields:\n- `instruction`: `str`, describes the task the model should perform. Each of the 20K instructions is unique.\n- `input`: `str`, optional context or input for the task. For example, when the instruction is \"Amend the following SQL query to select distinct elements\", the input is the SQL query. Around 40% of the examples have an input.\n- `output`: `str`, the answer to the instruction as generated by `text-davinci-003`.\n\nWe used the following prompts for fine-tuning the model:\n- for examples with a non-empty input field:\n ```\n Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n \n ### Instruction:\n {instruction}\n \n ### Input:\n {input}\n \n ### Response:\n ```\n- for examples with an empty input field:\n ```\n Below is an instruction that describes a task. Write a response that appropriately completes the request.\n \n ### Instruction:\n {instruction}\n \n ### Response:\n ```\n \n During inference (eg for the web demo), we use the user instruction with an empty input field (second option).\n\n## Data Generation Process\n\n\u003cdetails\u003e\n\u003csummary\u003e \u003cstrong\u003e Running the code \u003c/strong\u003e \u003c/summary\u003e\n\n1. Set environment variables `OPENAI_API_KEY` to your OpenAI API key.\n2. Install the dependencies with `pip install -r requirements.txt`.\n3. Run `python -m generate_instruction generate_instruction_following_data` to generate the data.\n\n\u003c/details\u003e\nData generation pipeline had minor changes from [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)\n- Modified prompt to focus on code generation/editing/optimization tasks instead of general tasks.\n- Modified seed tasks to only be related to code generation.\n\nThis produced an instruction-following dataset with 20K examples obtained at a much lower cost (less than $200). Also including a smaller 2k samples dataset which was used to derisk the approach and quality of the model.\n\n## Fine-tuning\nFinetuned the models using standard Hugging Face training code and deepspeed with the following hyperparameters:\n\n| Hyperparameter | Value |\n|----------------|-------|\n| Learning rate  | 2e-5  |\n| Epochs         | 3     |\n| Max length     | 512   |\n| Weight decay   | 0     |\n\nGiven Hugging Face hasn't officially supported the LLaMA models, we fine-tuned LLaMA with Hugging Face's transformers library by installing it from a particular fork (i.e. this [PR](https://github.com/huggingface/transformers/pull/21955) to be merged).\nThe hash of the specific commit we installed was `68d640f7c368bcaaaecfc678f11908ebbd3d6176`.\n\nThe code runs on a 8xA100 80GB, but can also run on 8xA10040GB or 4xA100 with lower batch size and gradient accumulation steps. To get the GPUs, I suggest using [Lambda Labs](https://cloud.lambdalabs.com/login?redirect_to=/instances?), best pricing for the best hardware.\n\nTo reproduce the fine-tuning runs for LLaMA, first install the requirements \n```bash\npip install -r requirements.txt\n```\nThen, install the particular fork of Hugging Face's transformers library.\n\nBelow is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs using deepspeed. \n\nReplace `\u003cyour_random_port\u003e` with a port of your own, `\u003cyour_path_to_hf_converted_llama_ckpt_and_tokenizer\u003e` with the \npath to your converted checkpoint and tokenizer (following instructions in the PR), and `\u003cyour_output_dir\u003e` with where you want to store your outputs.\n\n```bash\ntorchrun --nproc_per_node=8 --master_port=\u003cyour_random_port\u003e train.py \\\n    --model_name_or_path \u003cyour_path_to_hf_converted_llama_ckpt_and_tokenizer\u003e\n    --data_path ./data/code_alpaca_20k.json \\\n    --fp16 True \\\n    --output_dir \u003cyour_output_dir\u003e \\\n    --num_train_epochs 3 \\\n    --per_device_train_batch_size 8 \\\n    --per_device_eval_batch_size 8 \\\n    --gradient_accumulation_steps 4 \\\n    --evaluation_strategy \"no\" \\\n    --save_strategy \"steps\" \\\n    --save_steps 500 \\\n    --save_total_limit 1 \\\n    --learning_rate 2e-5 \\\n    --weight_decay 0. \\\n    --warmup_ratio 0.03 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --deepspeed ds_config.json\n    --tf32 False\n```\n\nNote the given training script is meant to be simple and easy to use, and is not particularly optimized.\n\nFor convenience I have included the [`convert_to_hf.py`](./convert_to_hf.py) to covnert llama checkpoints to huggingface compatible checkpoints. (This file is taken from the hugginface transformers repo)\n\n### Citation\n\nCite this repo if you want to, or don't, both are fine.\n```\n@misc{codealpaca,\n  author = {Sahil Chaudhary},\n  title = {Code Alpaca: An Instruction-following LLaMA model for code generation},\n  year = {2023},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/sahil280114/codealpaca}},\n}\n```\n\nNaturally, you should also cite the original LLaMA paper [1] and the Self-Instruct paper [2] and the [Stanford Alpaca repo](https://github.com/tatsu-lab/stanford_alpaca).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsahil280114%2Fcodealpaca","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsahil280114%2Fcodealpaca","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsahil280114%2Fcodealpaca/lists"}