https://github.com/unit-mesh/unit-gen
UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。UnitGen is a code fine-tuning data framework that generates data from your existing codebase.
https://github.com/unit-mesh/unit-gen
data-engineering evaluating finetuning llm
Last synced: 8 months ago
JSON representation
UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。UnitGen is a code fine-tuning data framework that generates data from your existing codebase.
- Host: GitHub
- URL: https://github.com/unit-mesh/unit-gen
- Owner: unit-mesh
- License: mpl-2.0
- Created: 2023-12-01T09:59:03.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-07-01T13:27:39.000Z (almost 2 years ago)
- Last Synced: 2025-04-01T07:54:21.740Z (about 1 year ago)
- Topics: data-engineering, evaluating, finetuning, llm
- Language: Kotlin
- Homepage: https://gen.unitmesh.cc/
- Size: 1.26 MB
- Stars: 54
- Watchers: 4
- Forks: 11
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
UnitGen
> UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。
Docs: [https://gen.unitmesh.cc/](https://gen.unitmesh.cc/)
Thanks to [OpenBayes](https://openbayes.com/console/signup?r=phodal_uVxU) for providing computing resources.
Finetune Model Examples:
| name | model download (HuggingFace) | finetune Notebook | model download (OpenBayes) |
|---------------|---------------------------------------------------------------------------|--------------------------------------|-------------------------------------------------------------------------------------|
| DeepSeek 6.7B | [unit-mesh/autodev-coder](https://huggingface.co/unit-mesh/autodev-coder) | [finetune.ipynb](finetunes/deepseek) | [AutoDev Coder](https://openbayes.com/console/phodal/models/rCmer1KQSgp/9/overview) |
Language support by [Chapi](https://github.com/phodal/chapi)
- supported:
- [x] Java
- [x] Kotlin
- doing:
- [x] TypeScript/JavaScript
- [x] Rust
- future:
- [ ] Go
- [ ] Python
- [ ] C/C++
- [ ] C#
- [ ] Scala
Features:
- Code context
strategy: [Related code completion](https://gen.unitmesh.cc/instruction/related-code-completion), [Similar Code Completion](https://gen.unitmesh.cc/instruction/similar-code-completion)
- Instruction Builder type: inline, block, after block, documentation, test gen
- [Code quality](https://gen.unitmesh.cc/quality) filter and pipeline. Code smell, test smell, estimation and more.
## Architecture
Layered Architecture

Workflow

### Design Philosophy
- Unique prompt. Integrated use of fine-tuning, evaluation, and tooling.
- Code quality pipeline. With estimate with code complex, bad smell, test bad smell, and more rules.
- Extendable customize quality thresholds. Custom rules, custom thresholds, custom quality type or more.
### Unique Prompt
Keep the same prompt: AutoDev <-> UnitGen <-> UnitEval
#### AutoDev prompt
AutoDev prompt template example:
Write unit test for following code.
${context.coc}
${context.framework}
${context.related_model}
```${context.language}
${context.selection}
```
#### Unit Picker prompt
Unit Picker prompt should keep the same structure as the AutoDev prompt. Prompt example:
```kotlin
Instruction(
instruction = "Complete ${it.language} code, return rest code, no explaining",
output = it.output,
input = """
|```${it.language}
|${it.relatedCode}
|```
|
|Code:
|```${it.language}
|${it.beforeCursor}
|```""".trimMargin()
)
```
#### UnitGen prompt
UnitGen prompt should keep the same structure as the AutoDev prompt. Prompt example:
Complete ${language} code, return rest code, no explaining
```${language}
${relatedCode}
```
Code:
```${language}
${beforeCursor}
```
### Code quality pipeline

### Extendable customize quality thresholds
Optional quality type:
```kotlin
enum class CodeQualityType {
BadSmell,
TestBadSmell,
JavaController,
JavaRepository,
JavaService,
}
```
Custom thresholds' config:
```kotlin
data class BsThresholds(
val bsLongParasLength: Int = 5,
val bsIfSwitchLength: Int = 8,
val bsLargeLength: Int = 20,
val bsMethodLength: Int = 30,
val bsIfLinesLength: Int = 3,
)
```
Custom rules:
```kotlin
val apis = apiAnalyser.toContainerServices()
val ruleset = RuleSet(
RuleType.SQL_SMELL,
"normal",
UnknownColumnSizeRule(),
LimitTableNameLengthRule()
// more rules
)
val issues = WebApiRuleVisitor(apis).visitor(listOf(ruleset))
// if issues are not empty, then the code has bad smell
```
## Quick Start
for examples, see: [examples](https://github.com/unit-mesh/unit-gen/tree/master/examples) folder
### use CLI
see in [config-examples](https://github.com/unit-mesh/unit-gen/tree/master/examples/config-examples/)
download the latest version from [GitHub Release](https://github.com/unit-mesh/unit-gen/releases)
#### Generate Instructions
1. config project by `processor.yml`
2. run picker: `java -jar unit-gen.jar`
### use Java API
see in [config-example](examples/project-example/)
1.add dependency
```groovy
dependencies {
implementation("cc.unitmesh:unit-picker:0.1.5")
implementation("cc.unitmesh:code-quality:0.1.5")
}
```
2.config the `unit-gen.yml` file and `connection.yml`
3.write code
```java
public class App {
public static void main(String[] args) {
List builderTypes = new ArrayList<>();
builderTypes.add(InstructionType.RELATED_CODE_COMPLETION);
List codeQualityTypes = new ArrayList<>();
codeQualityTypes.add(CodeQualityType.BadSmell);
codeQualityTypes.add(CodeQualityType.JavaService);
PickerOption pickerOption = new PickerOption(
"https://github.com/unit-mesh/unit-gen-testing", "master", "java",
".", builderTypes, codeQualityTypes, new BuilderConfig()
);
SimpleCodePicker simpleCodePicker = new SimpleCodePicker(pickerOption);
List output = simpleCodePicker.blockingExecute();
// handle output in here
}
}
```
## Thanks to
- abstract syntax tree: [Chapi](https://github.com/phodal/chapi). Used features: multiple language to same data
structure.
- legacy system analysis: [Coca](https://github.com/phodal/coca). Inspired: Bad Smell, Test Bad Smell
- architecture governance tool: [ArchGuard](https://github.com/archguard/archguard).
Used features: Estimation, Rule Lint (API, SQL)
- code database [CodeDB](https://github.com/archguard/codedb). Used features: Code analysis pipeline
## LICENSE
This code is distributed under the MPL 2.0 license. See `LICENSE` in this directory.