{"id":18428962,"url":"https://github.com/codefuse-ai/codefuse-cge","last_synced_at":"2025-04-07T17:32:14.096Z","repository":{"id":255702022,"uuid":"852194721","full_name":"codefuse-ai/CodeFuse-CGE","owner":"codefuse-ai","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-26T02:49:18.000Z","size":95,"stargazers_count":19,"open_issues_count":2,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-03-22T21:51:07.227Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codefuse-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-04T11:48:13.000Z","updated_at":"2025-03-18T06:51:22.000Z","dependencies_parsed_at":"2024-09-06T19:52:52.911Z","dependency_job_id":"08d408e7-d6f1-4a34-993f-d0c2a6c90ebd","html_url":"https://github.com/codefuse-ai/CodeFuse-CGE","commit_stats":null,"previous_names":["codefuse-ai/codefuse-cge"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FCodeFuse-CGE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FCodeFuse-CGE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FCodeFuse-CGE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codefuse-ai%2FCodeFuse-CGE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codefuse-ai","download_url":"https://codeload.github.com/codefuse-ai/CodeFuse-CGE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247697854,"owners_count":20981260,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T05:15:07.878Z","updated_at":"2025-04-07T17:32:14.090Z","avatar_url":"https://github.com/codefuse-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## CodeFuse-CGE\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://modelscope.cn/api/v1/models/codefuse-ai/CodeFuse-QWen-14B/repo?Revision=master\u0026FilePath=LOGO.jpg\u0026View=true\" width=\"800\"/\u003e\n\u003cp\u003e\n\nIn this project, we introduce CodeFuse-CGE(Code General Embedding), which is distinguish on text2code task for it's powerful ability of capturing the semantic relationship between text and code.  \nThis model has the following notable features:  \n● Instruction-tuning is enabled for both query and code snippet sides.  \n● The model obtains sentence-level and code-level representations through a layer of cross-attention computation module.   \n● The model has a smaller dimensional size without significant degradation in performance.\n\nCodeFuse-CGE-Large Model Configuration  \nhuggingface：[codefuse-ai/CodeFuse-CGE-Large](https://huggingface.co/codefuse-ai/CodeFuse-CGE-Large)   \nBase Model: CodeQwen1.5-7B-Chat  \nModel Size: 7B  \nEmbedding Dimension: 1024  \nHidden Layers: 32  \n\nRequirements  \n```\nflash_attn==2.4.2\ntorch==2.1.0\naccelerate==0.28.0\ntransformers==4.39.2 \nvllm=0.5.3\n```\n\n\nCodeFuse-CGE-Small Model Configuration  \nhuggingface：[codefuse-ai/CodeFuse-CGE-Small](https://huggingface.co/codefuse-ai/CodeFuse-CGE-Small)    \nBase Model: Phi-3.5-mini-instruct  \nModel Size: 3.8B  \nEmbedding Dimension: 1024  \nHidden Layers: 32  \n\nRequirements  \n```\nflash_attn==2.4.2\ntorch==2.1.0\naccelerate==0.28.0\ntransformers\u003e=4.43.0\n```\n\n\n## Benchmark the Performance\nWe use MRR metric to evaluate the ability on text2code retrieval tasks: AdvTest, CosQA, CSN  \n\n![result](./resources/result.png)\n\n## How to Use\n\nYou should download model file for huggingface at first.\n\n### Transformers\n```\nfrom transformers import AutoTokenizer, AutoModel\n\nmodel_name_or_path = \"CodeFuse-CGE-Large\"\nmodel = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, truncation_side='right', padding_side='right')\n\nif torch.cuda.is_available():\n    device = 'cuda'\nelse:\n    device = 'cpu'\nmodel.to(device)\n\nprefix_dict =  {'python':{'query':'Retrieve the Python code that solves the following query:', 'passage':'Python code:'},\n                'java':{'query':'Retrieve the Java code that solves the following query:', 'passage':'Java code:'},\n                'go':{'query':'Retrieve the Go code that solves the following query:', 'passage':'Go code:'},\n                'c++':{'query':'Retrieve the C++ code that solves the following query:', 'passage':'C++ code:'},\n                'javascript':{'query':'Retrieve the Javascript code that solves the following query:', 'passage':'Javascript code:'},\n                'php':{'query':'Retrieve the PHP code that solves the following query:', 'passage':'PHP code:'},\n                'ruby':{'query':'Retrieve the Ruby code that solves the following query:', 'passage':'Ruby code:'},\n                'default':{'query':'Retrieve the code that solves the following query:', 'passage':'Code:'}\n                }\n\ntext = [\"Writes a Boolean to the stream.\",\n        \"def writeBoolean(self, n): t = TYPE_BOOL_TRUE if n is False: t = TYPE_BOOL_FALSE self.stream.write(t)\"]\ntext[0] += prefix_dict['python']['query']\ntext[1] += prefix_dict['python']['passage']\nembed = model.encode(tokenizer, text)\nscore = embed[0] @ embed[1].T\nprint(\"score\", score)\n```\n\n### Vllm\nWe have also adapted Vllm to reduce latency during deployment.\n```\nfrom vllm import ModelRegistry\nfrom utils.vllm_codefuse_cge_large import CodeFuse_CGE_Large\nfrom vllm.model_executor.models import ModelRegistry\nfrom vllm import LLM\n\ndef always_true_is_embedding_model(model_arch: str) -\u003e bool:\n    return True\nModelRegistry.is_embedding_model = always_true_is_embedding_model\nModelRegistry.register_model(\"CodeFuse_CGE_Large\", CodeFuse_CGE_Large)\n\n\nmodel_name_or_path = \"CodeFuse-CGE-Large\"\nmodel = LLM(model=model_name_or_path, trust_remote_code=True, enforce_eager=True, enable_chunked_prefill=False)\nprefix_dict =  {'python':{'query':'Retrieve the Python code that solves the following query:', 'passage':'Python code:'},\n                'java':{'query':'Retrieve the Java code that solves the following query:', 'passage':'Java code:'},\n                'go':{'query':'Retrieve the Go code that solves the following query:', 'passage':'Go code:'},\n                'c++':{'query':'Retrieve the C++ code that solves the following query:', 'passage':'C++ code:'},\n                'javascript':{'query':'Retrieve the Javascript code that solves the following query:', 'passage':'Javascript code:'},\n                'php':{'query':'Retrieve the PHP code that solves the following query:', 'passage':'PHP code:'},\n                'ruby':{'query':'Retrieve the Ruby code that solves the following query:', 'passage':'Ruby code:'},\n                'default':{'query':'Retrieve the code that solves the following query:', 'passage':'Code:'}\n                }\n\ntext = [\"Return the best fit based on rsquared\",\n        \"def find_best_rsquared ( list_of_fits ) : res = sorted ( list_of_fits , key = lambda x : x . rsquared ) return res [ - 1 ]\"]\ntext[0] += prefix_dict['python']['query']\ntext[1] += prefix_dict['python']['passage']\nembed_0 = model.encode([text[0]])[0].outputs.embedding\nembed_1 = model.encode([text[1]])[0].outputs.embedding\n```\nNote:  \n1. After adapting Vllm, the model's input can only have a batch size of 1; otherwise, it will result in an array overflow error.  \n2. Only the CodeFuse-CGE-Large model has been adapted, and support for the CodeFuse-CGE-Small model will be available soon.\n\n## Contact us\n\n![CodeFuse-AI](./resources/CodeFuse-AI.png)\n\n\n\n## Acknowledgement\nThanks to the authors of open-sourced datasets, including CSN, Adv, CoSQA.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodefuse-ai%2Fcodefuse-cge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodefuse-ai%2Fcodefuse-cge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodefuse-ai%2Fcodefuse-cge/lists"}