https://github.com/salesforce/CodeGen

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
https://github.com/salesforce/CodeGen

codex generativemodel languagemodel llm programsynthesis tpu-acceleration

Last synced: 4 months ago
JSON representation

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

Host: GitHub
URL: https://github.com/salesforce/CodeGen
Owner: salesforce
License: apache-2.0
Created: 2022-03-28T20:48:29.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-03-17T22:00:24.000Z (over 1 year ago)
Last Synced: 2024-12-27T13:02:53.673Z (7 months ago)
Topics: codex, generativemodel, languagemodel, llm, programsynthesis, tpu-acceleration
Language: Python
Homepage:
Size: 1.35 MB
Stars: 4,965
Watchers: 81
Forks: 382
Open Issues: 45
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: CODEOWNERS
- Security: SECURITY.md

Awesome Lists containing this project

awesome-ai-coding - CodeGen 350M/2B/6B/16B
awesome-coding-assistants - CodeGen
awesome-llmops - CodeGen - source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex. | ![GitHub Badge](https://img.shields.io/github/stars/salesforce/CodeGen.svg?style=flat-square) | (Code AI / Vector search)
ai-game-devtools - CodeGen - source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex. |[arXiv](https://arxiv.org/abs/2203.13474) | | Code | (<span id="code">Code</span> / <span id="tool">Tool (AI LLM)</span>)
StarryDivineSky - salesforce/CodeGen - v4 训练。与 OpenAI Codex 竞争。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-code-ai - Salesforce CodeGen - source) (Code completion LLMs)
my-awesome - salesforce/CodeGen - acceleration pushed_at:2025-01 star:5.1k fork:0.4k CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex. (Python)
awesome_ai_agents - Codegen - CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex. (Building / LLM Models)
awesome_ai_agents - Codegen - CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex. (Building / LLM Models)

README

        


  



# CodeGen

Official release for the **CodeGen1** and **CodeGen2** models (`350M`, `1B`, `3B`, `7B` `16B`) for **Program Synthesis** by [Salesforce AI Research](https://www.salesforceairesearch.com/).



  



## News

**July 2023**

[**CodeGen2.5**](https://github.com/salesforce/CodeGen/tree/main/codegen25) released outperforming 16B parameter models with only 7B.

**May 2023**

**CodeGen2.0** released with strong infill sampling capability.

**March 2022**

**CodeGen1.0** released on par with OpenAI Codex at the time.

## Publications

[CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474)  

[Erik Nijkamp](https://enijkamp.github.io/)\*, [Bo Pang](https://scholar.google.com/citations?user=s9fNEVEAAAAJ&hl=en)\*, [Hiroaki Hayashi](https://hiroakih.me/)\*, [Lifu Tu](https://home.ttic.edu/~lifu/), [Huan Wang](https://scholar.google.com/citations?user=7NpTttkAAAAJ&hl=en), [Yingbo Zhou](https://scholar.google.com/citations?user=H_6RQ7oAAAAJ&hl=en), [Silvio Savarese](https://scholar.google.com/citations?user=ImpbxLsAAAAJ&hl=en), and [Caiming Xiong](https://scholar.google.com/citations?user=vaSdahkAAAAJ&hl=en)   

ICLR, 2023

[CodeGen2: Lessons for Training LLMs on Programming and Natural Languages](https://arxiv.org/abs/2305.02309)   

[Erik Nijkamp](https://enijkamp.github.io/)\*, [Hiroaki Hayashi](https://hiroakih.me/)\*, [Caiming Xiong](https://scholar.google.com/citations?user=vaSdahkAAAAJ&hl=en), [Silvio Savarese](https://scholar.google.com/citations?user=ImpbxLsAAAAJ&hl=en), and [Yingbo Zhou](https://scholar.google.com/citations?user=H_6RQ7oAAAAJ&hl=en)  

ICLR, 2023

## Usage

The models are available on the [Hugging Face Hub](https://huggingface.co/models?search=salesforce+codegen).

**CodeGen1.0**

```python

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")

model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")

inputs = tokenizer("# this function prints hello world", return_tensors="pt")

sample = model.generate(**inputs, max_length=128)

print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

```

**CodeGen2.0**

```python

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-7B")

model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-7B", trust_remote_code=True, revision="main")

inputs = tokenizer("# this function prints hello world", return_tensors="pt")

sample = model.generate(**inputs, max_length=128)

print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

```

**CodeGen2.5**

```python

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-mono", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-mono")

inputs = tokenizer("# this function prints hello world", return_tensors="pt")

sample = model.generate(**inputs, max_length=128)

print(tokenizer.decode(sample[0]))

```

## Training

The Jaxformer library for data pre-processing, training and fine-tuning the CodeGen models can be found here:

https://github.com/salesforce/jaxformer

## Citation

If you find our code or paper useful, please cite the paper:

```bibtex

@article{nijkamp2022codegen,

  title={CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis},

  author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming},

  journal={ICLR},

  year={2023}

}

@article{nijkamp2023codegen2,

  title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},

  author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},

  journal={ICLR},

  year={2023}

}

```

## Ethics disclaimer for Salesforce AI models, data, code

This release is for research purposes only in support of an academic

paper. Our models, datasets, and code are not specifically designed or

evaluated for all downstream purposes. We strongly recommend users

evaluate and address potential concerns related to accuracy, safety, and

fairness before deploying this model. We encourage users to consider the

common limitations of AI, comply with applicable laws, and leverage best

practices when selecting use cases, particularly for high-risk scenarios

where errors or misuse could significantly impact people’s lives, rights,

or safety. For further guidance on use cases, refer to our standard

[AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf)

and [AI AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ai-acceptable-use-policy.pdf).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/salesforce/CodeGen

Awesome Lists containing this project

README