Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Bolin97/awesome-instruction-selector
Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning
https://github.com/Bolin97/awesome-instruction-selector
List: awesome-instruction-selector
Last synced: 3 days ago
JSON representation
Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning
- Host: GitHub
- URL: https://github.com/Bolin97/awesome-instruction-selector
- Owner: Bolin97
- Created: 2024-01-27T11:39:05.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-02-03T10:09:14.000Z (11 months ago)
- Last Synced: 2024-02-03T11:25:24.373Z (11 months ago)
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome_ai_agents - Awesome-Instruction-Selector - Paper list and datasets for the paper - A Survey on Data Selection for LLM Instruction Tuning (Building / Datasets)
- awesome_ai_agents - Awesome-Instruction-Selector - Paper list and datasets for the paper - A Survey on Data Selection for LLM Instruction Tuning (Building / Datasets)
README
# awesome-instruction-selector
Paper list and datasets for the paper: **A Survey on Data Selection for LLM Instruction Tuning**
*Latest update date: February 2, 2024 UTC.*
Labels: `publisher year` 📄[PDF](), 🔗[Codes](), 💡[Report]()
## Useful instruction sets
Self-Instruct: Aligning Language Models with Self-Generated Instructions. `ACL 2023` 📄[PDF](https://aclanthology.org/2023.acl-long.754.pdf), 🔗[Data](https://github.com/yizhongw/self-instruct)Alpaca: A Strong, Replicable Instruction-Following Model. `Report 2023` 💡[Blog](https://crfm.stanford.edu/2023/03/13/alpaca.html), 🔗[Data](https://github.com/tatsu-lab/stanford_alpaca#data-release)
WizardLM: Empowering Large Language Models to Follow Complex Instructions. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2304.12244.pdf), 🔗[Data](https://github.com/nlpxucan/WizardLM/tree/main/WizardLM)
LIMA: Less Is More for Alignment. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2305.11206.pdf), 🔗[Data]()
Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM. `Report 2023` 💡[Blog](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm), 🔗[Data](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
Multitask Prompted Training Enables Zero-Shot Task Generalization. `ICLR 2022` 📄[PDF](https://openreview.net/pdf?id=9Vrb9D0WI4), 🔗[Data](https://github.com/bigscience-workshop/promptsource)
## Instruction Selections Methods
### System of indicators as data selector
Instruction Mining: High-Quality Instruction Data Selection for Large Language Models. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2307.06290.pdf)
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2308.12067.pdf), 🔗[Codes](https://github.com/waltonfuture/InstructionGPT-4)
Dataset Quantization. `ICCV 2023` 📄[PDF](https://arxiv.org/pdf/2308.10524.pdf), 🔗[Codes](https://github.com/magic-research/Dataset_Quantization)
### Trainable LLMs as data selector
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2308.12032.pdf), 🔗[Codes](https://github.com/MingLiiii/Cherry_LLM)Self-Alignment with Instruction Backtranslation. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2308.06259.pdf)
One Shot Learning as Instruction Data Prospector for Large Language Models. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2312.10302.pdf), 🔗[Codes](https://github.com/pldlgb/nuggets)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2311.08182.pdf), 🔗[Codes](https://github.com/OFA-Sys/DiverseEvol)
TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design. `ICLR 2024` 📄[PDF](https://arxiv.org/pdf/2309.05447.pdf)
Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks. `EMNLP 2023` 📄[PDF](https://aclanthology.org/2023.emnlp-main.112.pdf), 🔗[Codes](https://github.com/PlusLabNLP/Active-IT)
### Powerful LLMs like ChatGPT as data selector
AlpaGasus: Training A Better Alpaca with Fewer Data. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2307.08701.pdf), 💡[Blog](https://lichang-chen.github.io/AlpaGasus/)
#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2312.11508.pdf), 🔗[Codes](https://github.com/OFA-Sys/InsTag)
Rethinking the Instruction Quality: {LIFT} is What You Need. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2312.11508.pdf)
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2312.15685.pdf), 🔗[Codes](https://github.com/hkust-nlp/deita)
A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2308.05696.pdf)
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2312.14187.pdf)
### Small Reward Models as data selector
MoDS: Model-oriented Data Selection for Instruction Tuning. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2311.15653.pdf), 🔗[Codes](https://github.com/CASIA-LM/MoDS)Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning. `arXiv 2023` 📄[PDF](https://arxiv.org/pdf/2305.09246.pdf), 🔗[Codes](https://github.com/CASIA-LM/MoDS)
## Evaluation Results
![image](https://github.com/Bolin97/awesome-instruction-selector/assets/157888182/8abe93d7-2530-46af-8656-43c0d6ab3d40)![image](https://github.com/Bolin97/awesome-instruction-selector/assets/157888182/a7d34716-20eb-4c52-844c-944d5de44324)
![image](https://github.com/Bolin97/awesome-instruction-selector/assets/157888182/a6e341b9-c87c-440f-92ea-d8c40107a0d6)