{"id":13775159,"url":"https://github.com/ibm-granite/granite-code-models","last_synced_at":"2026-02-20T17:31:23.057Z","repository":{"id":238521214,"uuid":"790926471","full_name":"ibm-granite/granite-code-models","owner":"ibm-granite","description":"Granite Code Models: A Family of Open Foundation Models for Code Intelligence","archived":false,"fork":false,"pushed_at":"2025-06-25T20:34:38.000Z","size":17928,"stargazers_count":1234,"open_issues_count":4,"forks_count":88,"subscribers_count":24,"default_branch":"main","last_synced_at":"2025-10-29T07:48:06.437Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ibm-granite.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-04-23T19:23:54.000Z","updated_at":"2025-10-27T10:00:35.000Z","dependencies_parsed_at":"2024-06-21T22:43:21.257Z","dependency_job_id":"b35011b4-8942-44eb-be38-88f24eb6061e","html_url":"https://github.com/ibm-granite/granite-code-models","commit_stats":{"total_commits":39,"total_committers":8,"mean_commits":4.875,"dds":0.5897435897435898,"last_synced_commit":"05926e7ec65dd12ef0c49cab044b4e41253bdc2b"},"previous_names":["ibm-granite/granite-code-models"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ibm-granite/granite-code-models","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibm-granite%2Fgranite-code-models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibm-granite%2Fgranite-code-models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibm-granite%2Fgranite-code-models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibm-granite%2Fgranite-code-models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ibm-granite","download_url":"https://codeload.github.com/ibm-granite/granite-code-models/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibm-granite%2Fgranite-code-models/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29658368,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-20T16:33:43.953Z","status":"ssl_error","status_checked_at":"2026-02-20T16:33:43.598Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T17:01:34.663Z","updated_at":"2026-02-20T17:31:23.020Z","avatar_url":"https://github.com/ibm-granite.png","language":null,"funding_links":[],"categories":["A01_文本生成_文本对话","Projekte","Others"],"sub_categories":["大语言对话模型及数据","🦄 LLMs"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/granite-code-models-3x-v4.png\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  :books: \u003ca href=\"https://arxiv.org/abs/2405.04324\"\u003ePaper\u003c/a\u003e\u0026nbsp | :hugs: \u003ca href=\"https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330\"\u003eHuggingFace Collection\u003c/a\u003e\u0026nbsp | \n  :speech_balloon: \u003ca href=\"https://github.com/orgs/ibm-granite/discussions\"\u003eDiscussions Page\u003c/a\u003e\u0026nbsp\n\u003cbr\u003e\n\n---\n## Introduction to Granite Code Models\nWe introduce the Granite series of decoder-only code models for code generative tasks (e.g., fixing bugs, explaining code, documenting code), trained with code written in 116 programming languages. A comprehensive evaluation of the Granite Code model family on diverse tasks demonstrates that our models consistently reach state-of-the-art performance among available open source code LLMs.  \n\nThe key advantages of Granite Code models include:\n* All-rounder Code LLM: Granite Code models achieve competitive or state-of-the-art performance on different kinds of code-related tasks, including code generation, explanation, fixing, editing, translation, and more. Demonstrating their ability to solve diverse coding tasks.\n* Trustworthy Enterprise-Grade LLM: All our models are trained on license-permissible data collected following [IBM's AI Ethics principles](https://www.ibm.com/impact/ai-ethics) and guided by IBM’s Corporate Legal team for trustworthy enterprise usage. We release all our Granite Code models under an [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) license for research and commercial use.\n\nThe family of **Granite Code Models** comes in two main variants:\n\n* Granite Code Base Models: base foundational models designed for code-related tasks (e.g., code repair, code explanation, code synthesis).\n* Granite Code Instruct Models: instruction following models finetuned using a combination of Git commits paired with human instructions and open source synthetically generated code instruction datasets.\n\nBoth base and instruct models are available in sizes of 3B, 8B, 20B, and 34B parameters.\n\n## Data Collection\nOur process to prepare code pretraining data involves several stages. First, we collect a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub. Second, we filter the code data collected based on the programming language in which data is written (which we determined based on file extension). Then, we also filter out data with low code quality. Third, we adopt an aggressive deduplication strategy that includes both exact and fuzzy deduplication to remove documents having (near) identical code content. Finally, we apply a HAP content filter that reduces models' likelihood of generating hateful, abusive, or profane language. We also make sure to redact Personally Identifiable Information (PII) by replacing PII content (e.g., names, email addresses, keys, passwords) with corresponding tokens (e.g., ⟨NAME⟩, ⟨EMAIL⟩, ⟨KEY⟩, ⟨PASSWORD⟩). We also scan all datasets using ClamAV to identify and remove instances of malware in the source code. In addition to collecting code data for model training, we curate several publicly available high-quality natural language datasets for improving the model’s proficiency in language understanding and mathematical reasoning.\n\n## Pretraining\nThe **Granite Code Base** models are trained on 3-4T tokens of code data and natural language datasets related to code. Data is tokenized via byte pair encoding (BPE), employing the same tokenizer as StarCoder. We utilize high-quality data with two phases of training as follows:\n\n* Phase 1 (code only training): During phase 1, 3B and 8B models are trained for 4 trillion tokens of code data comprising 116 languages. The 20B parameter model is trained on 3 trillion tokens of code. The 34B model is trained on 1.4T tokens after the depth upscaling which is done on the 1.6T checkpoint of 20B model.\n* Phase 2 (code + language training): In phase 2, we include additional high-quality publicly available data from various domains, including technical, mathematics, and web documents, to further improve the model’s performance. We train all our models for 500B tokens (80% code-20% language mixture) in phase 2 training.\n\n## Instruction Tuning\nGranite Code Instruct models are finetuned on the following types of instruction data: 1) code commits sourced from [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft), 2) high-quality math datasets, specifically we used [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct) and [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA), 3) Code instruction datasets such as [Glaive-Code-Assistant-v3](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3), [Self-OSS-Instruct-SC2](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), [Glaive-Function-Calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2), [NL2SQL11](https://huggingface.co/datasets/bugdaryan/sql-create-context-instruction) and a small collection of synthetic API calling datasets, and 4) high-quality language instruction datasets such as [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer) and an open license-filtered version of [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).\n\n## Evaluation Results\nWe conduct an extensive evaluation of our code models on a comprehensive list of benchmarks that includes but is not limited to HumanEvalPack, MBPP, and MBPP+. This set of benchmarks encompasses different coding tasks across commonly used programming languages (e.g., Python, JavaScript, Java, Go, C++, Rust).\n\nOur findings reveal that Granite Code models outperform strong open source models across model sizes. The figure below illustrates how `Granite-8B-Code-Base` outperforms `Mistral-7B`, `LLama-3-8B`, and other open source models in three coding tasks. We provide further evaluation results in our [paper](https://arxiv.org/abs/2405.04324).\n\n\u003cimg src=\"./figures/GraniteCodeFigure1.jpg\" /\u003e\n\n## How to Use our Models?\n\nTo use any of our models, pick an appropriate `model_path` from:\n1. `ibm-granite/granite-3b-code-base-2k`\n2. `ibm-granite/granite-3b-code-instruct-2k`\n3. `ibm-granite/granite-8b-code-base-4k`\n4. `ibm-granite/granite-8b-code-instruct-4k`\n5. `ibm-granite/granite-20b-code-base-8k`\n6. `ibm-granite/granite-20b-code-instruct-8k`\n7. `ibm-granite/granite-34b-code-base-8k`\n8. `ibm-granite/granite-34b-code-instruct-8k`\n\n### Inference\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndevice = \"cuda\" # or \"cpu\"\nmodel_path = \"ibm-granite/granite-3b-code-base-2k\" # pick anyone from above list\n\ntokenizer = AutoTokenizer.from_pretrained(model_path)\n\n# drop device_map if running on CPU\nmodel = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)\nmodel.eval()\n\n# change input text as desired\ninput_text = \"def generate():\"\n# tokenize the text\ninput_tokens = tokenizer(input_text, return_tensors=\"pt\")\n\n# transfer tokenized inputs to the device\nfor i in input_tokens:\n    input_tokens[i] = input_tokens[i].to(device)\n\n# generate output tokens\noutput = model.generate(**input_tokens)\n# decode output tokens into text\noutput = tokenizer.batch_decode(output)\n\n# loop over the batch to print, in this example the batch size is 1\nfor i in output:\n    print(i)\n```\n\n### Finetuning\nWe use [Dolomite Engine](https://github.com/IBM/dolomite-engine/) for finetuning (or instruction tuning) all our models. We provide sample scripts for finetuning `ibm-granite/granite-3b-code-base`. To finetune the models, simply follow these steps:\n```shell\ngit clone https://github.com/IBM/dolomite-engine/\ncd dolomite-engine\n\n# you might need to modify configs/granite-example/training.yml\nsh scripts/finetune.sh configs/granite-example/training.yml\n\n# once the model is trained, convert to HuggingFace-compatible safetensors\nsh scripts/export.sh configs/granite-example/export.yml\n```\n\n\u003e [!TIP]\n\u003e If you would like to use [padding-free transformers](https://huggingface.co/blog/mayank-mishra/padding-free-transformer) to save memory footprint and FLOPs during training, follow the instructions in the [Dolomite Engine README](https://github.com/IBM/dolomite-engine?tab=readme-ov-file#huggingface-compatible-custom-models) for more details.\n\n## How to Contribute to this Project?\nPlese check our [Guidelines](/CONTRIBUTING.md) and [Code of Conduct](/CODE_OF_CONDUCT.md) to contribute to our project.\n\n## Model Cards\nThe model cards for each model variant are available in their respective HuggingFace repository. Please visit our collection [here](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330).\n\n## How to Download our Models?\nThe model of choice (granite-3b-code-base in this example) can be cloned using:\n```shell\ngit clone https://huggingface.co/ibm-granite/granite-3b-code-base-2k\n```\n\n## License \nAll Granite Code Models are distributed under [Apache 2.0](./LICENSE) license.\n\n## Would you like to provide feedback?\nPlease let us know your comments about our family of code models by visiting our [collection](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330). Select the repository of the model you would like to provide feedback about. Then, go to *Community* tab, and click on *New discussion*. Alternatively, you can also post any questions/comments on our [github discussions page](https://github.com/orgs/ibm-granite/discussions).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fibm-granite%2Fgranite-code-models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fibm-granite%2Fgranite-code-models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fibm-granite%2Fgranite-code-models/lists"}