Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ise-uiuc/magicoder
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct
https://github.com/ise-uiuc/magicoder
ai4code large-language-models llm llm4code
Last synced: about 15 hours ago
JSON representation
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct
- Host: GitHub
- URL: https://github.com/ise-uiuc/magicoder
- Owner: ise-uiuc
- License: mit
- Created: 2023-11-10T07:35:29.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-01T17:28:52.000Z (about 2 months ago)
- Last Synced: 2024-12-14T01:01:06.162Z (8 days ago)
- Topics: ai4code, large-language-models, llm, llm4code
- Language: Python
- Homepage: https://proceedings.mlr.press/v235/wei24h.html
- Size: 2.4 MB
- Stars: 1,983
- Watchers: 25
- Forks: 166
- Open Issues: 5
-
Metadata Files:
- Readme: README-DEV.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
- StarryDivineSky - ise-uiuc/magicoder - Intit 提供支持的模型系列,这是一种新颖的方法LLMs,通过开源代码片段为代码生成低偏差和高质量的指令数据。OSS-Instruct 通过赋予LLM它们丰富的开源引用来产生更多样化、更真实和可控的数据,从而减轻了合成指令数据的固有偏见。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# Implementation Details of 🎩Magicoder
> [!WARNING]
> This documentation is still WIP. Raise an [issue](https://github.com/ise-uiuc/magicoder/issues) in case you found any errors.## Data collection and generation
Make sure you have set up your `OPENAI_API_KEY` and optionally `OPENAI_BASE_URL`. Then run with
```bash
python src/magicoder/generate_data.py \
--seed_code_start_index ${START_INDEX_OF_RAW_DATA} \
--max_new_data ${MAX_DATA_TO_GENERATE} \
--data_dir python \
--tag python
```To continue an interrupted run, use `--continue_from` flag:
```bash
python src/magicoder/generate_data.py \
--seed_code_start_index ${START_INDEX_OF_RAW_DATA} \
--max_new_data ${MAX_DATA_TO_GENERATE} \
--data_dir python \
--continue_from ${PATH_TO_DATA_FILE}
```## Data cleaning and decontamination
After the data collection, clean and decontaminate the data with the following command:
```bash
python src/magicoder/clean_data.py --data_files {PATH_TO_DATA_FILE} --output_file {CLEANING_OUTPUT_PATH}python -m magicoder.decontamination.find_substrings \
--dataset_name "json" \
--output_file ${DECONTAM_OUTPUT_PATH} \
--output_dir ${OUTPUT_DIR} \
--columns problem solution \
--data_files ${PATH_TO_DATA_FILE}
```You probably need to run this multiple times with different data files.
## Data preprocessing
Before instruction tuning, let's reformat the data into instruction-response pairs:
```bash
python src/magicoder/preprocess_data.py \
--dataset_path json \
--data_files ${DECONTAM_OUTPUT_PATH} \
--output_file ${PREPROCESS_OUTPUT_PATH} \
--key src-instruct
```After that, you can combine all the `jsonl` files into one.
## Instruction tuning
Pointing the environment variable `CUDA_VISIBLE_DEVICES` to the GPUs you want to use, train the model with the following command to obtain Magicoder:
```bash
accelerate launch -m magicoder.train \
--model_key $MODEL_KEY \
--use_flash_attention True \
--max_training_seq_length 1216 \
--datafile_paths \
${PATH_TO_OSS_INSTRUCT} \
--output_dir $MAGICODER_OUTPUT_DIR \
--bf16 True \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 128 \
--group_by_length False \
--ddp_find_unused_parameters False \
--logging_steps 1 \
--log_level info \
--optim adafactor \
--max_grad_norm -1 \
--warmup_steps 15 \
--learning_rate 5e-5 \
--lr_scheduler_type linear
```To get Magicoder-S, continue the training with the following command:
```bash
accelerate launch -m magicoder.train \
--model_key $MODEL_KEY \
--model_name_or_path $MAGICODER_OUTPUT_DIR \
--use_flash_attention True \
--max_training_seq_length 1024 \
--datafile_paths \
${PATH_TO_EVOL_INSTRUCT} \
--output_dir $MAGICODER_S_OUTPUT_DIR \
--bf16 True \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 128 \
--group_by_length False \
--ddp_find_unused_parameters False \
--logging_steps 1 \
--log_level info \
--optim adafactor \
--max_grad_norm -1 \
--warmup_steps 15 \
--learning_rate 5e-5 \
--lr_scheduler_type linear
```