Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-LLM-Synthetic-Data
A reading list on LLM based Synthetic Data Generation 🔥
https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data
Last synced: 4 days ago
JSON representation
-
3. Application Areas
-
3.2. Code Generation
- **AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct**
- **How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data**
- **SelfCodeAlign: Self-Alignment for Code Generation**
- **CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning**
- **InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback**
- **WizardCoder: Empowering Code Large Language Models with Evol-Instruct**
- **WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning**
- **Magicoder: Empowering Code Generation with OSS-Instruct**
- **InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct**
- **OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement**
- **Language Models Can Teach Themselves to Program Better**
- **Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models**
- **Learning Performance-Improving Code Edits**
-
3.4. Alignment
- **Constitutional AI: Harmlessness from AI Feedback** - Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan.* Arxiv 2022.
- **Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs**
- **Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts** - Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu.* NeurIPS 2024.
- **Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models**
- **Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision**
- **SALMON: Self-Alignment with Instructable Reward Models**
-
3.8. Agent and Tool Use
- **Gorilla: Large Language Model Connected with Massive APIs**
- **Toolformer: Language Models Can Teach Themselves to Use Tools** - Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom.* NeurIPS 2023.
- **GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**
- **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases**
- **Voyager: An Open-Ended Embodied Agent with Large Language Models**
-
3.5. Reward Modeling
-
3.6. Long Context
-
3.7. Weak-to-Strong
-
3.9. Vision and Language
- **Visual Instruction Tuning**
- **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**
- **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
- **G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model**
- **Enhancing Large Vision Language Models with Self-Training on Image Comprehension** - Wei Chang, Wei Wang.* Arxiv 2024.
- **LLaVA-OneVision: Easy Visual Task Transfer**
-
3.1. Mathematical Reasoning
- **MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning**
- **MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs**
- **MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models**
- **Augmenting Math Word Problems via Iterative Question Composing** - Chih Yao.* DPFM@ICLR 2024.
- **Distilling LLMs' Decomposition Abilities into Compact Language Models**
-
3.3. Text-to-SQL
-
3.10. Factuality
-
-
2. Methods
-
2.1. Techniques
- **CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society**
- **Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models** - Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei.* Arxiv 2024.
- **Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling**
- **STaR: Bootstrapping Reasoning With Reasoning**
- **Symbolic Knowledge Distillation: from General Language Models to Commonsense Models**
- **Generating Training Data with Language Models: Towards Zero-Shot Language Understanding**
- **ZeroGen: Efficient Zero-shot Learning via Dataset Generation**
- **Large Language Models Can Self-Improve**
- **Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models**
- **Self-Rewarding Language Models.**
- **Self-instruct: Aligning language models with self-generated instructions**
- **TarGEN: Targeted Data Generation with Large Language Models**
- **Automatic Instruction Evolving for Large Language Models** - Guang Lou, Weizhu Chen.* Arxiv 2024.
- **Scaling Synthetic Data Creation with 1,000,000,000 Personas**
- **Self-playing Adversarial Language Game Enhances LLM Reasoning**
- **Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources** - Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli* Arxiv 2024.
- **Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation**
- **Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing**
-
2.2. Instruction Generation with High Quality/Complexity
- **CodecLM: Aligning Language Models with Tailored Synthetic Data** - Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister.* Findings of NAACL 2024.
- **WizardLM: Empowering Large Language Models to Follow Complex Instructions**
-
-
4. Datasets
-
3.10. Factuality
- **Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions**
- **SynthPAI: A Synthetic Dataset for Personal Attribute Inference**
- **Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts**
- **Open Artificial Knowledge**
-
-
Uncategorized
-
Uncategorized
- **Comprehensive Exploration of Synthetic Data Generation: A Survey**
- **Best Practices and Lessons Learned on Synthetic Data for Language Models**
- **On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey**
- **Large Language Models for Data Annotation: A Survey**
- **Generative AI for Synthetic Data Generation: Methods, Challenges and the Future**
-
-
5. Tools
-
3.10. Factuality
- **DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows** - Burch.* ACL 2024.
- **AgentInstruct: Toward Generative Teaching with Agentic Flows** - ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah.* Arxiv 2024.
- **Distilabel: An AI Feedback (AIF) Framework for Building Datasets with and for LLMs**
- **Fuxion: Synthetic Data Generation and Normalization Functions using Langchain + LLMs**
-
-
6. Blogs
Programming Languages
Categories
Sub Categories
2.1. Techniques
18
3.10. Factuality
13
3.2. Code Generation
13
3.9. Vision and Language
6
3.4. Alignment
6
3.8. Agent and Tool Use
5
3.1. Mathematical Reasoning
5
Uncategorized
5
3.6. Long Context
2
2.2. Instruction Generation with High Quality/Complexity
2
3.5. Reward Modeling
1
3.3. Text-to-SQL
1
3.7. Weak-to-Strong
1