Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Awesome-LLM-Synthetic-Data

A reading list on LLM based Synthetic Data Generation 🔥
https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data

Last synced: 3 days ago
JSON representation

3. Application Areas
- 3.2. Code Generation
- 3.4. Alignment
  - **Constitutional AI: Harmlessness from AI Feedback** - Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan.* Arxiv 2022.
  - **Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs**
  - **Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts** - Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu.* NeurIPS 2024.
  - **Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models**
  - **Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision**
  - **SALMON: Self-Alignment with Instructable Reward Models**
- 3.8. Agent and Tool Use
  - **Gorilla: Large Language Model Connected with Massive APIs**
  - **Toolformer: Language Models Can Teach Themselves to Use Tools** - Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom.* NeurIPS 2023.
  - **GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**
  - **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases**
  - **Voyager: An Open-Ended Embodied Agent with Large Language Models**
- 3.5. Reward Modeling
  - **West-of-N: Synthetic Preference Generation for Improved Reward Modeling**
- 3.6. Long Context
  - **Make Your LLM Fully Utilize the Context.** - Guang Lou.* Arxiv 2024.
  - **From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data**
- 3.7. Weak-to-Strong
  - **Impossible Distillation for Paraphrasing and Summarization: How to Make High-quality Lemonade out of Small, Low-quality Models**
- 3.9. Vision and Language
- 3.1. Mathematical Reasoning
- 3.3. Text-to-SQL
  - **Synthesizing Text-to-SQL Data from Weak and Strong LLMs**
- 3.10. Factuality
  - **Fine-tuning Language Models for Factuality**
  - **MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**
2. Methods
- 2.1. Techniques
  - **CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society**
  - **Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models** - Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei.* Arxiv 2024.
  - **Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling**
  - **STaR: Bootstrapping Reasoning With Reasoning**
  - **Symbolic Knowledge Distillation: from General Language Models to Commonsense Models**
  - **Generating Training Data with Language Models: Towards Zero-Shot Language Understanding**
  - **ZeroGen: Efficient Zero-shot Learning via Dataset Generation**
  - **Large Language Models Can Self-Improve**
  - **Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models**
  - **Self-Rewarding Language Models.**
  - **Self-instruct: Aligning language models with self-generated instructions**
  - **TarGEN: Targeted Data Generation with Large Language Models**
  - **Automatic Instruction Evolving for Large Language Models** - Guang Lou, Weizhu Chen.* Arxiv 2024.
  - **Scaling Synthetic Data Creation with 1,000,000,000 Personas**
  - **Self-playing Adversarial Language Game Enhances LLM Reasoning**
  - **Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources** - Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli* Arxiv 2024.
  - **Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation**
  - **Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing**
- 2.2. Instruction Generation with High Quality/Complexity
  - **CodecLM: Aligning Language Models with Tailored Synthetic Data** - Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister.* Findings of NAACL 2024.
  - **WizardLM: Empowering Large Language Models to Follow Complex Instructions**
4. Datasets
- 3.10. Factuality
Uncategorized
- Uncategorized
5. Tools
- 3.10. Factuality
  - **DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows** - Burch.* ACL 2024.
  - **AgentInstruct: Toward Generative Teaching with Agentic Flows** - ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah.* Arxiv 2024.
  - **Distilabel: An AI Feedback (AIF) Framework for Building Datasets with and for LLMs**
  - **Fuxion: Synthetic Data Generation and Normalization Functions using Langchain + LLMs**
6. Blogs
- 3.10. Factuality

Programming Languages

Categories

3. Application Areas 42 2. Methods 20 Uncategorized 5 5. Tools 4 4. Datasets 4 6. Blogs 3

Sub Categories

2.1. Techniques 18 3.10. Factuality 13 3.2. Code Generation 13 3.9. Vision and Language 6 3.4. Alignment 6 3.8. Agent and Tool Use 5 3.1. Mathematical Reasoning 5 Uncategorized 5 3.6. Long Context 2 2.2. Instruction Generation with High Quality/Complexity 2 3.5. Reward Modeling 1 3.3. Text-to-SQL 1 3.7. Weak-to-Strong 1

Keywords

synthetic-dataset-generation 1 synthetic-data 1 rlhf 1 rlaif 1 python 1 openai 1 llms 1 huggingface 1 ai 1