Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-LLM-Synthetic-Data
A reading list on LLM based Synthetic Data Generation π₯
https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data
Last synced: about 17 hours ago
JSON representation
-
3. Application Areas
-
3.4. Alignment
- **Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs.**
- **Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models**
- **Constitutional AI: Harmlessness from AI Feedback.** - Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan.* Arxiv 2022.
- **Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision**
- **SALMON: Self-Alignment with Instructable Reward Models**
-
3.5. Reward Modeling
-
3.6. Long Context
-
3.7. Weak-to-Strong
-
3.8. Agent and Tool Use
- **Toolformer: Language Models Can Teach Themselves to Use Tools** - Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom.* NeurIPS 2023.
- **GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**
- **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases**
- **Voyager: An Open-Ended Embodied Agent with Large Language Models**
-
3.9. Vision and Language
- **Visual Instruction Tuning**
- **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**
- **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
- **G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model**
- **Enhancing Large Vision Language Models with Self-Training on Image Comprehension** - Wei Chang, Wei Wang.* Arxiv 2024.
- **LLaVA-OneVision: Easy Visual Task Transfer**
-
3.1. Mathematical Reasoning
- **MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning.**
- **MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs.**
- **MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models.**
- **Augmenting Math Word Problems via Iterative Question Composing.** - Chih Yao.* DPFM@ICLR 2024.
- **Distilling LLMs' Decomposition Abilities into Compact Language Models**
-
3.2. Code Generation
-
3.3. Text-to-SQL
-
3.10. Factuality
-
-
Uncategorized
-
Uncategorized
- **Comprehensive Exploration of Synthetic Data Generation: A Survey.**
- **Best Practices and Lessons Learned on Synthetic Data for Language Models.**
- **On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey.**
- **Large Language Models for Data Annotation: A Survey**
- **Generative AI for Synthetic Data Generation: Methods, Challenges and the Future.**
-
-
2. Methods
-
2.1. Techniques
- **STaR: Bootstrapping Reasoning With Reasoning**
- **Symbolic Knowledge Distillation: from General Language Models to Commonsense Models**
- **Generating Training Data with Language Models: Towards Zero-Shot Language Understanding**
- **ZeroGen: Efficient Zero-shot Learning via Dataset Generation**
- **Large Language Models Can Self-Improve**
- **Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.**
- **Self-Rewarding Language Models.**
- **Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models.** - Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei.* Arxiv 2024.
- **Self-instruct: Aligning language models with self-generated instructions.**
- **TarGEN: Targeted Data Generation with Large Language Models**
- **Automatic Instruction Evolving for Large Language Models.** - Guang Lou, Weizhu Chen.* Arxiv 2024.
- **Scaling Synthetic Data Creation with 1,000,000,000 Personas.**
- **Self-playing Adversarial Language Game Enhances LLM Reasoning**
- **Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources** - Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli* Arxiv 2024.
- **Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation**
- **Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing**
-
2.2. Instruction Generation with High Quality/Complexity
- **CodecLM: Aligning Language Models with Tailored Synthetic Data.** - Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister.* Findings of NAACL 2024.
- **WizardLM: Empowering Large Language Models to Follow Complex Instructions.**
-
-
4. Datasets
-
5. Tools
-
3.10. Factuality
- **DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows.** - Burch.* ACL 2024.
- **AgentInstruct: Toward Generative Teaching with Agentic Flows.** - ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah.* Arxiv 2024.
- **Distilabel: An AI Feedback (AIF) Framework for Building Datasets with and for LLMs**
- **Fuxion: Synthetic Data Generation and Normalization Functions using Langchain + LLMs**
-
-
6. Blogs
Programming Languages
Categories
Sub Categories
2.1. Techniques
16
3.10. Factuality
11
3.9. Vision and Language
6
3.1. Mathematical Reasoning
5
Uncategorized
5
3.4. Alignment
5
3.8. Agent and Tool Use
4
3.2. Code Generation
3
3.6. Long Context
2
2.2. Instruction Generation with High Quality/Complexity
2
3.5. Reward Modeling
1
3.3. Text-to-SQL
1
3.7. Weak-to-Strong
1