{"id":25982410,"url":"https://github.com/DIRECT-BIT/METEOR","last_synced_at":"2025-03-05T09:03:35.046Z","repository":{"id":263992677,"uuid":"892023782","full_name":"DIRECT-BIT/METEOR","owner":"DIRECT-BIT","description":"METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth","archived":false,"fork":false,"pushed_at":"2024-11-21T11:43:49.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-21T12:28:04.061Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DIRECT-BIT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-21T11:31:02.000Z","updated_at":"2024-11-21T11:43:53.000Z","dependencies_parsed_at":"2024-11-21T12:40:39.044Z","dependency_job_id":null,"html_url":"https://github.com/DIRECT-BIT/METEOR","commit_stats":null,"previous_names":["direct-bit/meteor"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DIRECT-BIT%2FMETEOR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DIRECT-BIT%2FMETEOR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DIRECT-BIT%2FMETEOR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DIRECT-BIT%2FMETEOR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DIRECT-BIT","download_url":"https://codeload.github.com/DIRECT-BIT/METEOR/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241997422,"owners_count":20055117,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-05T09:03:30.191Z","updated_at":"2025-03-05T09:03:35.032Z","avatar_url":"https://github.com/DIRECT-BIT.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth\n\u003cp align=\"center\"\u003e \u003cimg src=\"images/meteor.png\" style=\"width: 50%;\" id=\"title-icon\"\u003e       \u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  📄 \u003ca href=\"https://arxiv.org/abs/2411.11933\" target=\"_blank\"\u003ePaper\u003c/a\u003e \u0026nbsp; | \u0026nbsp;\n  💻 \u003ca href=\"./code\" target=\"_blank\"\u003eCode\u003c/a\u003e \u0026nbsp; | \u0026nbsp;\n  🎰 \u003ca href=\"./code/data/raw\" target=\"_blank\"\u003eDatasets\u003c/a\u003e \u0026nbsp;\n\u003c/p\u003e\n\n\n## Contents\n- [METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth](#pspo-an-effective-process-supervised-policy-optimization-for-reasoning-alignment)\n  - [**Contents**](#contents)\n  - [📝 **Introduction**](#-introduction)\n  - [📈**Main Results**](#overall-results)\n  - [🔍**Methodology**](#methodology)\n  - [🔬**Quick Start**](#quick-start)\n  - [📃**License**](#license)\n  - [🌟**Citation**](#citation)\n\n\n## 📝 Introduction\n**METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth** a **weak-to-strong evolution framework** that enables LLMs to progressively evolve from supervised guidance to autonomous enhancement. METEOR introduces a comprehensive **three-stage evolution framework** that guides models from **basic domain knowledge acquisition** through supervised learning to **autonomous capability enhancement** via progressive computational scaling.\n\n## 📈**Main Results**\n| Model | Accuracy | Completeness | Relevance | Coherence | Reliability | GPT-4 Score |\n|-------|----------|--------------|-----------|-----------|-------------|-------------|\n|LLaMA3 w/o METEOR| 21.3% | 27.8% | 25.6%     | 19.9%     | 17.8%       | 5.02        |\n|LLaMA3 w METEOR  | **78.7%** |**72.2%**|**74.4%**|**80.1%**|**82.2%**  | **9.17**    |\n| Qwen2 w/o METEOR| 31.6% | 36.5% | 39.7%     | 36.5%     | 32.9%       | 6.88        |\n| Qwen2 w METEOR  | **68.4%** |**63.5%**|**60.3%**|**63.5%**|**63.5%**  | **9.28**    |\n\nWe compared the performance changes across various dimensions before and after applying the Meteor method for domain capability evolution of LLMs. For the **accuracy, completeness, relevance, coherence, and reliability**, we generated responses to test set questions using both the Meteor-trained LLMs and the non-Meteor-trained LLMs. As shown in this table, after the Meteor evolution, [LLaMA3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) achieved improvements. dditionally, the GPT-4 Score after evolution was significantly higher than before, demonstrating the effectiveness of the Meteor method.\n\n  \n## 🔍**Methodology**\n\u003cp align=\"center\"\u003e \u003cimg src=\"images/overall.jpg\" style=\"width: 75%;\" id=\"title-icon\"\u003e   \u003cbr\u003e \u003cem\u003eThe overall method of METEOR.\u003c/em\u003e    \u003c/p\u003e\n\nThe METEOR method consists of three distinct phases: **weak-to-strong data distillation, iterative training, and self-training strategies.** In each phase, the model fully utilizes its existing capabilities to strengthen its domain expertise. Each subsequent phase builds upon the advancements made in the previous one, employing different techniques to further evolve the model. \n\n## 🔬**Quick Start**\n\n#### **Download Data**\nWe use the field of advanced computer education as the specific domain to validate the effectiveness of the proposed Meteor method. To obtain high-quality domain data, we scraped data from Stack Overflow across four categories: Machine Learning (ML), Deep Learning (DL), Natural Language Processing (NLP), and Computer Vision (CV), totaling 10,276 entries.\n| Category     | ML  | DL | NLP | CV | TOTAL  |\n|--------------|-----|----|-----|----|--------|\n|Scale         |4605 |2092|1881 |1698| 10276  |\n\nYou can download the crawded domain data [here](https://github.com/DirectionAI/METEOR/tree/main/code/data/raw).\n\n#### Stage1: Weak-to-strong Domain Data Distillation\n\n**Guideline Generation:** Run the following instructions to obtain the guideline for generating domain data with the LLMs:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python get_guideline.py \\\n    --model_path \"models/llama3-8b-chat\" \\\n    --data_folder \"data/raw_data\" \\\n    --out_folder \"outputs/guidelines\" \\\n    --batch_size 4\n```\n\nYou can get the *guidelines.jsonl* at the out_folder.\n\n**Data Distillation:** Combine domain-specific questions with their corresponding guidelines, call the GPT-4 API, and distill high-quality domain data.\n\n```bash\npython get_distill_data.py \\\n    --data_folder \"outputs/guidelines\" \\\n    --out_folder \"outputs/distilled_data\" \\\n    --api_key 'your-api-key'\n```\n\n**Domain Fine-tuning:** Fine-tune the model using distilled high-quality domain data to enable it to perform domain-specific tasks effectively.\n```bash\ndeepspeed --include localhost:0,1,2,3 --master_port=29902 finetune.py \\\n    --model_name_or_path \"models/llama3-8b-chat\" \\\n    --train_data_path \"outputs/distilled_data/distilled_data.jsonl\" \\\n    --output_dir \"outputs/fine_tuned_model\" \\\n    --per_device_train_batch_size 1 \\\n    --per_device_eval_batch_size 1 \\\n    --gradient_accumulation_steps 8 \\\n    --learning_rate 3e-5 \\\n    --num_train_epochs 3 \\\n    --fp16 \\\n    --deepspeed \"configs/ds_config_stage2.json\" \\\n    --overwrite_output_dir \\\n    --max_seq_length 512\n```\n\n#### Stage2: Data Refinement and Iterative Training for Domain-Specific Models\n\n**Data Refinement:** Upon receiving domain-specific data, the model employs CoT reasoning to generate answers and reasoning paths. These are evaluated by GPT-4, which provides confirmation if correct or offers suggestions for refinement if incorrect.\n```bash\npython get_iterative_training_dataset.py \\\n    --model_path \"models/fine_tuned_model\" \\\n    --data_path \"outputs/distilled_data/distilled_data.jsonl\" \\\n    --out_folder \"outputs/iterative_data\" \\\n    --api_key 'your-api-key' \\\n    --max_iterations 3 \\\n    --batch_size 4\n```\n\nAn example demonstrating how GPT-4 corrects answers and generates final data:\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eCase：\u003c/b\u003e\u003c/summary\u003e\nTitle: Understanding Overfitting in Machine Learning\n\nContent:\n\nCan someone explain what overfitting is in machine learning and how it can be prevented? I'm trying to understand why my model performs well on training data but poorly on new, unseen data.\n\n2. Model's Initial Answer\nAnswer with Reasoning:\n\nOverfitting in machine learning occurs when a model learns the training data too well, including its noise and outliers. As a result, it performs exceptionally on the training data but poorly on new, unseen data because it hasn't learned the underlying general patterns.\n\nTo prevent overfitting, you can:\n\nUse a simpler model: Reduce the complexity of the model to avoid capturing noise.\nGather more data: More training data helps the model learn more general features.\nFeature selection: Remove irrelevant features that may introduce noise.\nData augmentation: Artificially expand your dataset by modifying existing data.\nBy implementing these methods, your model should generalize better to new data.\n\n3. GPT-4 Feedback\nEvaluation:\n\nThe explanation of overfitting is generally correct; however, there are some issues:\n\nData Augmentation: While data augmentation is common in fields like image processing, it might not be applicable or effective for all types of data.\nMissing Regularization Techniques: The answer lacks mention of regularization methods such as L1/L2 regularization, dropout, or early stopping, which are standard techniques to prevent overfitting.\nHint for Improvement:\n\nInclude regularization techniques as methods to prevent overfitting.\nClarify when data augmentation is appropriate.\n4. Iterative Refinement\nRevised Answer with Reasoning:\n\nOverfitting occurs when a machine learning model learns the training data, including its noise and specific patterns, too closely. This leads to excellent performance on training data but poor generalization to new, unseen data.\n\nTo prevent overfitting, you can:\n\nCross-Validation: Use techniques like k-fold cross-validation to ensure the model performs well on different subsets of data.\nRegularization Techniques:\nL1/L2 Regularization: Add a penalty term to the loss function to discourage complex models.\nDropout: Randomly ignore neurons during training to prevent co-adaptation.\nEarly Stopping: Stop training when performance on a validation set begins to degrade.\nSimplify the Model: Use models with fewer parameters to reduce complexity.\nGather More Data: More data can help the model learn more general patterns.\nData Augmentation (when applicable): For datasets like images, augment data by transformations like flipping or rotation to increase diversity.\nBy applying these methods, your model is more likely to generalize well to new data and perform better on unseen datasets.\n\n5. GPT-4 Final Evaluation\nEvaluation:\n\nCorrect.\n\nThe revised answer accurately explains overfitting and includes comprehensive methods to prevent it. The addition of regularization techniques and clarification on data augmentation enhance the quality of the response.\n\u003c/details\u003e\n\n\n**Iterative Training:** Once the accumulated data reaches an integer multiple of $10$, use this batch of data to fine-tune the model.\n\n```bash\npython get_iterative_training_dataset.py \\\n    --model_path \"models/fine_tuned_model\" \\\n    --data_path \"outputs/distilled_data/distilled_data.jsonl\" \\\n    --out_folder \"outputs/iterative_data\" \\\n    --api_key 'your-api-key' \\\n    --max_iterations 3 \\\n    --fine_tune_script \"finetune.py\" \\\n    --fine_tune_args \"--per_device_train_batch_size 1 --gradient_accumulation_steps 8 --learning_rate 3e-5 --num_train_epochs 1 --fp16 --deepspeed configs/ds_config_stage2.json --overwrite_output_dir --max_seq_length 512\" \\\n    --fine_tune_interval 10 \\\n    --gpu_devices \"0,1,2,3\"\n```\nEnable the model to perform self-assessment through iterative training under the guidance of GPT-4.\n\n#### Stage3: Self-training\nOnce the model has acquired the ability for self-evolution, and based on the theory that higher computational complexity (FLOPs) during the reasoning process leads to better performance, we have designed a self-training method based on contrastive learning, enabling the model to autonomously enhance its own capabilities.\n\n```bash\ndeepspeed --include localhost:0 --master_port=29902 self_training.py \\\n    --model_path \"outputs/iterative_model\" \\\n    --data_path \"outputs/iterative_data/iterative_training_data.jsonl\" \\\n    --out_folder \"outputs/self_trained_model\" \\\n    --per_device_train_batch_size 1 \\\n    --gradient_accumulation_steps 8 \\\n    --learning_rate 1e-5 \\\n    --num_train_epochs 1 \\\n    --fp16 \\\n    --deepspeed \"configs/ds_config_stage2.json\" \\\n    --overwrite_output_dir \\\n    --max_seq_length 512\n```\n\n#### **Metrics**\n  We use GPT-4 as a judge to evaluate both data quality and model performance. When comparing the distilled data quality with and without the use of guidelines, GPT-4 is used to score the data, where higher scores indicate better quality. In evaluating the domain-specific answers generated by the model, GPT-4 provides scores based on five criteria: accuracy, completeness, relevance, coherence, and reliability, allowing for a comprehensive assessment of the model's domain capabilities.\n\n#### **Evaluation**\nFor the accuracy, completeness, relevance, coherence, and reliability, we generated responses to test set questions using both the Meteor-trained LLMs and the non-Meteor-trained LLMs. We utilize the GPT-4 API to evaluate these dimensions.\n\n```bash\npython evaluate.py \\\n    --model_paths \"models/meteor_trained_model\" \"models/baseline_model\" \\\n    --test_data_path \"data/test_data.jsonl\" \\\n    --output_dir \"outputs/evaluation\" \\\n    --api_key 'your-api-key' \\\n    --batch_size 4 \\\n    --max_eval_samples 100\n```\n\nevaluate example:\n\n```json\n{\n    \"question\": \"Explain the concept of overfitting in machine learning.\",\n    \"answers\": {\n        \"meteor_trained_model\": \"Overfitting occurs when a machine learning model learns the training data too well...\",\n        \"baseline_model\": \"Overfitting is when a model performs well on training data but poorly on new data...\"\n    },\n    \"evaluation\": [\n        {\n            \"Answer\": 1,\n            \"Accuracy\": 9,\n            \"Completeness\": 8,\n            \"Relevance\": 10,\n            \"Coherence\": 9,\n            \"Reliability\": 9,\n            \"Justification\": \"The answer accurately explains overfitting and provides relevant details...\"\n        },\n        {\n            \"Answer\": 2,\n            \"Accuracy\": 7,\n            \"Completeness\": 6,\n            \"Relevance\": 8,\n            \"Coherence\": 7,\n            \"Reliability\": 7,\n            \"Justification\": \"The answer gives a basic definition but lacks depth and specific examples...\"\n        }\n    ]\n}\n```\n\n## 📃License\n\nThe code in this repo is licensed by [Apache 2.0](./LICENSE.txt), the data on huggingface and this repo are licensed by [CC BY-NC 4.0](./DATA_LICENSE.txt), the model weights on huggingface are licensed by [GNU AGPL 3.0](./MODEL_LICENSE.txt). To use the models in this project for commercial purposes or public deployment, please sign this [document](./images/Agreement.pdf) and complete this [questionnaire](https://zg5p4zcqlu.feishu.cn/share/base/form/shrcndUD6XiEhFfwI36QmdmMjzd) to obtain authorization. We only track the commercial use but charge nothing. The service provider shall be responsible for misleading or injurious statements and adverse effects caused by the use of the models contained in this repo and their modified versions.\n\n\n## 🌟Citation\nIf you find our work useful, please consider citing the following papers:\n\n```\n@misc{li2024meteorevolutionaryjourneylarge,\n      title={METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth}, \n      author={Jiawei Li and Chong Feng and Yang Gao},\n      year={2024},\n      eprint={2411.11933},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2411.11933}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDIRECT-BIT%2FMETEOR","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDIRECT-BIT%2FMETEOR","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDIRECT-BIT%2FMETEOR/lists"}