{"id":13723492,"url":"https://github.com/bigcode-project/starcoder2","last_synced_at":"2025-05-15T16:06:16.047Z","repository":{"id":224944497,"uuid":"729030266","full_name":"bigcode-project/starcoder2","owner":"bigcode-project","description":"Home of StarCoder2!","archived":false,"fork":false,"pushed_at":"2024-03-21T11:54:59.000Z","size":46,"stargazers_count":1897,"open_issues_count":17,"forks_count":171,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-04-07T21:13:56.169Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigcode-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-08T08:46:25.000Z","updated_at":"2025-04-07T18:36:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"970ca6b0-0204-4a39-b655-1907dcd5b613","html_url":"https://github.com/bigcode-project/starcoder2","commit_stats":null,"previous_names":["bigcode-project/starcoder2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigcode-project%2Fstarcoder2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigcode-project","download_url":"https://codeload.github.com/bigcode-project/starcoder2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254374470,"owners_count":22060611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T01:01:42.222Z","updated_at":"2025-05-15T16:06:16.029Z","avatar_url":"https://github.com/bigcode-project.png","language":"Python","readme":"# StarCoder 2\n\n\u003cp align=\"center\"\u003e\u003ca href=\"https://huggingface.co/bigcode\"\u003e[🤗 Models \u0026 Datasets]\u003c/a\u003e | \u003ca href=\"https://arxiv.org/abs/2402.19173\"\u003e[Paper]\u003c/a\u003e\u003c/a\u003e \n\u003c/p\u003e\n\nStarCoder2 is a family of code generation models (3B, 7B, and 15B), trained on 600+ programming languages from [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and some natural language text such as Wikipedia, Arxiv, and GitHub issues. The models use Grouped Query Attention, a context window of 16,384 tokens, with sliding window attention of 4,096 tokens. The 3B \u0026 7B models were trained on 3+ trillion tokens, while the 15B was trained on 4+ trillion tokens. For more details check out the [paper](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view).\n\n# Table of Contents\n1. [Quickstart](#quickstart)\n    - [Installation](#installation)\n    - [Model usage and memory footprint](#model-usage-and-memory-footprint)\n    - [Text-generation-inference code](#text-generation-inference)\n2. [Fine-tuning](#fine-tuning)\n    - [Setup](#setup)\n    - [Training](#training)\n3. [Evaluation](#evaluation)\n\n# Quickstart\nStarCoder2 models are intended for code completion, they are not instruction models and commands like \"Write a function that computes the square root.\" do not work well. \n\n## Installation\nFirst, we have to install all the libraries listed in `requirements.txt`\n```bash\npip install -r requirements.txt\n# export your HF token, found here: https://huggingface.co/settings/account\nexport HF_TOKEN=xxx\n```\n\n## Model usage and memory footprint\nHere are some examples to load the model and generate code, with the memory footprint of the largest model, `StarCoder2-15B`. Ensure you've installed `transformers` from source (it should be the case if you used `requirements.txt`)\n```bash\npip install git+https://github.com/huggingface/transformers.git\n```\n\n### Running the model on CPU/GPU/multi GPU\n* _Using full precision_\n```python\n# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ncheckpoint = \"bigcode/starcoder2-15b\"\ndevice = \"cuda\" # for GPU usage or \"cpu\" for CPU usage\n\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\n# to use Multiple GPUs do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=\"auto\")`\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)\n\ninputs = tokenizer.encode(\"def print_hello_world():\", return_tensors=\"pt\").to(device)\noutputs = model.generate(inputs)\nprint(tokenizer.decode(outputs[0]))\n```\n\n* _Using `torch.bfloat16`_\n```python\n# pip install accelerate\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ncheckpoint = \"bigcode/starcoder2-15b\"\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\n\n# for fp16 use `torch_dtype=torch.float16` instead\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=\"auto\", torch_dtype=torch.bfloat16)\n\ninputs = tokenizer.encode(\"def print_hello_world():\", return_tensors=\"pt\").to(\"cuda\")\noutputs = model.generate(inputs)\nprint(tokenizer.decode(outputs[0]))\n```\n```bash\n\u003e\u003e\u003e print(f\"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB\")\nMemory footprint: 32251.33 MB\n```\n\n### Quantized Versions through `bitsandbytes`\n* _Using 8-bit precision (int8)_\n\n```python\n# pip install bitsandbytes accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\n# to use 4bit use `load_in_4bit=True` instead\nquantization_config = BitsAndBytesConfig(load_in_8bit=True)\n\ncheckpoint = \"bigcode/starcoder2-15b_16k\"\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\nmodel = AutoModelForCausalLM.from_pretrained(\"bigcode/starcoder2-15b_16k\", quantization_config=quantization_config)\n\ninputs = tokenizer.encode(\"def print_hello_world():\", return_tensors=\"pt\").to(\"cuda\")\noutputs = model.generate(inputs)\nprint(tokenizer.decode(outputs[0]))\n```\n```bash\n\u003e\u003e\u003e print(f\"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB\")\n# load_in_8bit\nMemory footprint: 16900.18 MB\n# load_in_4bit\n\u003e\u003e\u003e print(f\"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB\")\nMemory footprint: 9224.60 MB\n```\nYou can also use `pipeline` for the generation:\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\ncheckpoint = \"bigcode/starcoder2-15b\"\n\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint)\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\n\npipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, device=0)\nprint( pipe(\"def hello():\") )\n```\n\n## Text-generation-inference: \n\n```bash\ndocker run -p 8080:80 -v $PWD/data:/data -e HUGGING_FACE_HUB_TOKEN=\u003cYOUR BIGCODE ENABLED TOKEN\u003e -d  ghcr.io/huggingface/text-generation-inference:latest --model-id bigcode/starcoder2-15b --max-total-tokens 8192\n```\nFor more details, see [here](https://github.com/huggingface/text-generation-inference).\n\n# Fine-tuning\n\nHere, we showcase how you can fine-tune StarCoder2 models. For more fine-tuning resources you can check [StarCoder's GitHub repository](https://github.com/bigcode-project/starcoder) and [SantaCoder-Finetuning](https://github.com/loubnabnl/santacoder-finetuning).\n\n## Setup\n\nInstall `pytorch` [see documentation](https://pytorch.org/), for example the following command works with cuda 12.1:\n```bash\nconda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia\n```\n\nInstall the requirements (this installs `transformers` from source to support the StarCoder2 architecture):\n```bash\npip install -r requirements.txt\n```\n\nBefore you run any of the scripts make sure you are logged in `wandb` and HuggingFace Hub to push the checkpoints:\n```bash\nwandb login\nhuggingface-cli login\n``` \nNow that everything is done, you can clone the repository and get into the corresponding directory.\n\n## Training\nTo fine-tune efficiently with a low cost, we use [PEFT](https://github.com/huggingface/peft) library for Low-Rank Adaptation (LoRA) training and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for 4bit quantization. We also use the `SFTTrainer` from [TRL](https://github.com/huggingface/trl).\n\n\nFor this example, we will fine-tune StarCoder2-3b on the `Rust` subset of [the-stack-smol](https://huggingface.co/datasets/bigcode/the-stack-smol). This is just for illustration purposes; for a larger and cleaner dataset of Rust code, you can use [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup). \n\nTo launch the training:\n```bash\naccelerate launch finetune.py \\\n        --model_id \"bigcode/starcoder2-3b\" \\\n        --dataset_name \"bigcode/the-stack-smol\" \\\n        --subset \"data/rust\" \\\n        --dataset_text_field \"content\" \\\n        --split \"train\" \\\n        --max_seq_length 1024 \\\n        --max_steps 10000 \\\n        --micro_batch_size 1 \\\n        --gradient_accumulation_steps 8 \\\n        --learning_rate 2e-5 \\\n        --warmup_steps 20 \\\n        --num_proc \"$(nproc)\"\n```\n\nIf you want to fine-tune on other text datasets, you need to change `dataset_text_field` argument to the name of the column containing the code/text you want to train on.\n \n# Evaluation\nTo evaluate StarCoder2 and its derivatives, you can use the [BigCode-Evaluation-Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) for evaluating Code LLMs. You can also check the [BigCode Leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard).\n","funding_links":[],"categories":["🔥 2024-2025 Trending Models","\u003cspan id=\"code\"\u003eCode\u003c/span\u003e","Other my awesome lists","Python","🔓 Open Source LLM Models"],"sub_categories":["🚀 Specialized Models","\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e","Local / Self-hosted"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigcode-project%2Fstarcoder2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigcode-project%2Fstarcoder2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigcode-project%2Fstarcoder2/lists"}