{"id":13469210,"url":"https://github.com/bigcode-project/starcoder","last_synced_at":"2025-05-14T06:13:53.612Z","repository":{"id":161730690,"uuid":"631962458","full_name":"bigcode-project/starcoder","owner":"bigcode-project","description":"Home of StarCoder: fine-tuning \u0026 inference!","archived":false,"fork":false,"pushed_at":"2024-02-27T02:05:57.000Z","size":69,"stargazers_count":7404,"open_issues_count":101,"forks_count":528,"subscribers_count":72,"default_branch":"main","last_synced_at":"2025-04-11T01:41:47.323Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigcode-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-24T12:32:21.000Z","updated_at":"2025-04-10T15:47:45.000Z","dependencies_parsed_at":"2024-01-13T18:17:49.237Z","dependency_job_id":"bfc82281-df0f-4deb-b1d1-b581440179bc","html_url":"https://github.com/bigcode-project/starcoder","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigcode-project","download_url":"https://codeload.github.com/bigcode-project/starcoder/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254083777,"owners_count":22011902,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T15:01:29.272Z","updated_at":"2025-05-14T06:13:53.580Z","avatar_url":"https://github.com/bigcode-project.png","language":"Python","funding_links":[],"categories":["Python","\u003cspan id=\"code\"\u003eCode\u003c/span\u003e","A01_文本生成_文本对话","HarmonyOS","others","Applications","Repos","排行榜 [2025-03-18]"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e","大语言对话模型及数据","Windows Manager","提示语（魔法）"],"readme":"# 💫 StarCoder\n\n[Paper](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [Model](https://huggingface.co/bigcode/starcoder) | [Playground](https://huggingface.co/spaces/bigcode/bigcode-playground) | [VSCode](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode) | [Chat](https://huggingface.co/spaces/HuggingFaceH4/starchat-playground)\n\n# What is this about?\n💫 StarCoder is a language model (LM) trained on source code and natural language text. Its training data incorporates more that 80 different programming languages as well as text extracted from GitHub issues and commits and from notebooks. This repository showcases how we get an overview of this LM's capabilities.\n\n# News\n\n* **May 9, 2023:** We've fine-tuned StarCoder to act as a helpful coding assistant 💬! Check out the `chat/` directory for the training code and play with the model [here](https://huggingface.co/spaces/HuggingFaceH4/starchat-playground).\n\n# Disclaimer\n\nBefore you can use the model go to `hf.co/bigcode/starcoder` and accept the agreement. And make sure you are logged into the Hugging Face hub with:\n```bash\nhuggingface-cli login\n```\n\n# Table of Contents\n1. [Quickstart](#quickstart)\n    - [Installation](#installation)\n    - [Code generation with StarCoder](#code-generation)\n    - [Text-generation-inference code](#text-generation-inference)\n2. [Fine-tuning](#fine-tuning)\n    - [Step by step installation with conda](#step-by-step-installation-with-conda)\n    - [Datasets](#datasets)\n      - [Stack Exchange](#stack-exchange-se)\n    - [Merging PEFT adapter layers](#merging-peft-adapter-layers)\n3. [Evaluation](#evaluation)\n4. [Inference hardware requirements](#inference-hardware-requirements)\n\n# Quickstart\nStarCoder was trained on GitHub code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the 🤗's [transformers](https://github.com/huggingface/transformers) library.\n\n## Installation\nFirst, we have to install all the libraries listed in `requirements.txt`\n```bash\npip install -r requirements.txt\n```\n## Code generation\nThe code generation pipeline is as follows\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ncheckpoint = \"bigcode/starcoder\"\ndevice = \"cuda\" # for GPU usage or \"cpu\" for CPU usage\n\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\n# to save memory consider using fp16 or bf16 by specifying torch_dtype=torch.float16 for example\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)\n\ninputs = tokenizer.encode(\"def print_hello_world():\", return_tensors=\"pt\").to(device)\noutputs = model.generate(inputs)\n# clean_up_tokenization_spaces=False prevents a tokenizer edge case which can result in spaces being removed around punctuation\nprint(tokenizer.decode(outputs[0], clean_up_tokenization_spaces=False))\n```\nor\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\ncheckpoint = \"bigcode/starcoder\"\n\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint)\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\n\npipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, device=0)\nprint( pipe(\"def hello():\") )\n```\nFor hardware requirements, check the section [Inference hardware requirements](#inference-hardware-requirements).\n\n## Text-generation-inference\n\n```bash\ndocker run -p 8080:80 -v $PWD/data:/data -e HUGGING_FACE_HUB_TOKEN=\u003cYOUR BIGCODE ENABLED TOKEN\u003e -d  ghcr.io/huggingface/text-generation-inference:latest --model-id bigcode/starcoder --max-total-tokens 8192\n```\nFor more details, see [here](https://github.com/huggingface/text-generation-inference).\n\n# Fine-tuning\n\nHere, we showcase how we can fine-tune this LM on a specific downstream task.\n\n## Step by step installation with conda \n\nCreate a new conda environment and activate it\n```bash\nconda create -n env\nconda activate env\n```\nInstall the `pytorch` version compatible with your version of cuda [here](https://pytorch.org/get-started/previous-versions/), for example the following command works with cuda 11.6\n```bash\nconda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia\n```\nInstall `transformers` and `peft`\n```bash\nconda install -c huggingface transformers \npip install git+https://github.com/huggingface/peft.git\n```\nNote that you can install the latest stable version of transformers by using\n\n```bash\npip install git+https://github.com/huggingface/transformers\n```\n\nInstall `datasets`, `accelerate` and `huggingface_hub`\n\n```bash\nconda install -c huggingface -c conda-forge datasets\nconda install -c conda-forge accelerate\nconda install -c conda-forge huggingface_hub\n```\n\nFinally, install `bitsandbytes` and `wandb`\n```bash\npip install bitsandbytes\npip install wandb\n```\nTo get the full list of arguments with descriptions you can run the following command on any script:\n```\npython scripts/some_script.py --help\n```\nBefore you run any of the scripts make sure you are logged in and can push to the hub:\n```bash\nhuggingface-cli login\n```\nMake sure you are logged in `wandb`:\n```bash\nwandb login\n```\nNow that everything is done, you can clone the repository and get into the corresponding directory.\n\n## Datasets\n💫 StarCoder can be fine-tuned to achieve multiple downstream tasks. Our interest here is to fine-tune StarCoder in order to make it follow instructions. [Instruction fine-tuning](https://arxiv.org/pdf/2109.01652.pdf) has gained a lot of attention recently as it proposes a simple framework that teaches language models to align their outputs with human needs. That procedure requires the availability of quality instruction datasets, which contain multiple `instruction - answer` pairs. Unfortunately such datasets are not ubiquitous but thanks to Hugging Face 🤗's [datasets](https://github.com/huggingface/datasets) library we can have access to some good proxies. To fine-tune cheaply and efficiently, we use Hugging Face 🤗's [PEFT](https://github.com/huggingface/peft) as well as Tim Dettmers' [bitsandbytes](https://github.com/TimDettmers/bitsandbytes).\n\n\n### Stack Exchange SE\n[Stack Exchange](https://en.wikipedia.org/wiki/Stack_Exchange) is a well-known network of Q\u0026A websites on topics in diverse fields. It is a place where a user can ask a question and obtain answers from other users. Those answers are scored and ranked based on their quality. [Stack exchange instruction](https://huggingface.co/datasets/ArmelR/stack-exchange-instruction) is a dataset that was obtained by scrapping the site in order to build a collection of Q\u0026A pairs. A language model can then be fine-tuned on that dataset to make it elicit strong and diverse question-answering skills.\n\nTo execute the fine-tuning script run the following command:\n```bash\npython finetune/finetune.py \\\n  --model_path=\"bigcode/starcoder\"\\\n  --dataset_name=\"ArmelR/stack-exchange-instruction\"\\\n  --subset=\"data/finetune\"\\\n  --split=\"train\"\\\n  --size_valid_set 10000\\\n  --streaming\\\n  --seq_length 2048\\\n  --max_steps 1000\\\n  --batch_size 1\\\n  --input_column_name=\"question\"\\\n  --output_column_name=\"response\"\\ \n  --gradient_accumulation_steps 16\\\n  --learning_rate 1e-4\\\n  --lr_scheduler_type=\"cosine\"\\\n  --num_warmup_steps 100\\\n  --weight_decay 0.05\\\n  --output_dir=\"./checkpoints\" \\\n```\nThe size of the SE dataset is better manageable when using streaming. We also have to precise the split of the dataset that is used. For more details, check the [dataset's page](https://huggingface.co/datasets/ArmelR/stack-exchange-instruction) on 🤗. Similarly we can modify the command to account for the availability of GPUs\n\n```bash\npython -m torch.distributed.launch \\\n  --nproc_per_node number_of_gpus finetune/finetune.py \\\n  --model_path=\"bigcode/starcoder\"\\\n  --dataset_name=\"ArmelR/stack-exchange-instruction\"\\\n  --subset=\"data/finetune\"\\\n  --split=\"train\"\\\n  --size_valid_set 10000\\\n  --streaming \\\n  --seq_length 2048\\\n  --max_steps 1000\\\n  --batch_size 1\\\n  --input_column_name=\"question\"\\\n  --output_column_name=\"response\"\\ \n  --gradient_accumulation_steps 16\\\n  --learning_rate 1e-4\\\n  --lr_scheduler_type=\"cosine\"\\\n  --num_warmup_steps 100\\\n  --weight_decay 0.05\\\n  --output_dir=\"./checkpoints\" \\\n```\n## Merging PEFT adapter layers\nIf you train a model with PEFT, you'll need to merge the adapter layers with the base model if you want to run inference / evaluation. To do so, run:\n```bash\npython finetune/merge_peft_adapters.py --base_model_name_or_path model_to_merge --peft_model_path model_checkpoint\n\n# Push merged model to the Hub\npython finetune/merge_peft_adapters.py --base_model_name_or_path model_to_merge --peft_model_path model_checkpoint --push_to_hub\n```\nFor example\n\n```bash\npython finetune/merge_peft_adapters.py --model_name_or_path bigcode/starcoder --peft_model_path checkpoints/checkpoint-1000 --push_to_hub\n```\n\n# Evaluation\nTo evaluate StarCoder and its derivatives, you can use the [BigCode-Evaluation-Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) for evaluating Code LLMs.\n\n# Inference hardware requirements\nIn FP32 the model requires more than 60GB of RAM, you can load it in FP16 or BF16 in ~30GB, or in 8bit under 20GB of RAM with\n```python\n# make sure you have accelerate and bitsandbytes installed\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\"bigcode/starcoder\")\n# for fp16 replace with  `load_in_8bit=True` with   `torch_dtype=torch.float16`\nmodel = AutoModelForCausalLM.from_pretrained(\"bigcode/starcoder\", device_map=\"auto\", load_in_8bit=True)\nprint(f\"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB\")\n````\n```\nMemory footprint: 15939.61 MB\n```\nYou can also try [starcoder.cpp](https://github.com/bigcode-project/starcoder.cpp), a C++ implementation with [ggml](https://github.com/ggerganov/ggml) library.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigcode-project%2Fstarcoder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigcode-project%2Fstarcoder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigcode-project%2Fstarcoder/lists"}