Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/koalazf99/awesome-datacentric-llm
Trending projects & awesome papers about data-centric llm studies.
https://github.com/koalazf99/awesome-datacentric-llm
List: awesome-datacentric-llm
data-centric-ai evaluation llm pre-training
Last synced: 1 day ago
JSON representation
Trending projects & awesome papers about data-centric llm studies.
- Host: GitHub
- URL: https://github.com/koalazf99/awesome-datacentric-llm
- Owner: koalazf99
- Created: 2024-06-19T18:27:19.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-12-08T20:47:53.000Z (13 days ago)
- Last Synced: 2024-12-19T21:02:12.962Z (2 days ago)
- Topics: data-centric-ai, evaluation, llm, pre-training
- Homepage:
- Size: 13.7 KB
- Stars: 31
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ultimate-awesome - awesome-datacentric-llm - Trending projects & awesome papers about data-centric llm studies. (Other Lists / Monkey C Lists)
README
# Awesome-DataCentric-LLM
![](https://img.shields.io/github/last-commit/koalazf99/Awesome-DataCentric-LLM?color=green)
![](https://img.shields.io/badge/PRs-Welcome-red)
[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)Trending projects & awesome papers about data-centric LLM studies, including large-scale data curation, data quality assessment, evaluation, toolkits, etc.
## Papers
1. **ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information**
_Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie Callan_ [[abs](https://arxiv.org/abs/2211.15848)] [Nov 2022]
1. **Scaling Data-Constrained Language Models**
_Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel_ [[pdf](https://arxiv.org/abs/2305.16264)] [[code](https://github.com/mlfoundations/scaling)] [May 2023] ![stars](https://img.shields.io/github/stars/mlfoundations/scaling)
1. **A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity**
_Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, Daphne Ippolito_ [[pdf](https://arxiv.org/abs/2305.13169)] [May 2023]
1. **The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only**
_Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay_ [[pdf](https://arxiv.org/abs/2306.01116)] [Jun 2023]
1. **Textbooks Are All You Need**
_Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li_ [[pdf](https://arxiv.org/abs/2306.11644)] [Jun 2023]
1. **Textbooks Are All You Need II: phi-1.5 technical report.**
_Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee_ [[pdf](https://arxiv.org/abs/2309.05463)] [Sep 2023]
1. **What's In My Big Data?**
_Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge_ [[pdf](https://arxiv.org/abs/2310.20707)] [[code](https://github.com/allenai/wimbd)] [Oct 2023] ![stars](https://img.shields.io/github/stars/sangmichaelxie/doremi)
1. **SlimPajama-DC: Understanding Data Combinations for LLM Training**
_Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing_ [[pdf](https://arxiv.org/abs/2309.10818)] [Sep 2023]
1. **DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining**
_Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu_ [[pdf](https://arxiv.org/abs/2305.10429)] [[code](https://github.com/sangmichaelxie/doremi)] [Nov 2023] ![stars](https://img.shields.io/github/stars/sangmichaelxie/doremi)
1. **Rephrasing the Web: A Recipe for Compute & Data-Efficient Language Modeling**
_Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly_ [[pdf](https://arxiv.org/abs/2401.16380)] [Jan 2024]
1. **QuRating: Selecting High-Quality Data for Training Language Models**
_Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen_ [[pdf](https://arxiv.org/abs/2402.09739)] [[code](https://github.com/princeton-nlp/QuRating)] [Feb 2024] ![stars](https://img.shields.io/github/stars/princeton-nlp/QuRating)
1. **WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset**
_Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, Jin Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan, Conghui He_ [[pdf](https://arxiv.org/abs/2402.19282)] [Feb 2024]
1. **Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research**
_Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo_ (AI2) [[pdf](https://arxiv.org/abs/2402.00159)] [[code](https://github.com/allenai/dolma)] [Feb 2024] ![stars](https://img.shields.io/github/stars/allenai/dolma)
1. **Instruction-tuned Language Models are Better Knowledge Learners**
_Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer_ [[pdf](https://arxiv.org/abs/2402.12847)] [Feb 2024]
1. **Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models**
_Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei_ [[pdf](https://arxiv.org/pdf/2402.13064)] [Feb 2024]
1. **How to Train Data-Efficient LLMs**
_Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng_ [[pdf](https://arxiv.org/abs/2402.09668)] [Feb 2024]
1. **Language models scale reliably with over-training and on downstream tasks**
_Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt_ [[pdf](https://arxiv.org/pdf/2403.08540)] [Mar 2024]
1. **Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic**
_Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter_ [[pdf](https://arxiv.org/abs/2404.07177)]1. **Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models**
_Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul_ [[pdf](https://arxiv.org/abs/2405.20541)] [May 2024]
1. **MAP-NEO: A fully open-sourced Large Language Model**
_Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn Guo, Soren Gao, Wangchunshu Zhou, Xinyue Zhang, Yizhi Zhou, Yubo Wang, Yuelin Bai, Yuhan Zhang, Yuxiang Zhang, Zenith Wang, Zhenzhu Yang, Zijian Zhao, Jiajun Zhang, Wanli Ouyang, Wenhao Huang, Wenhu Chen_ [[pdf](https://arxiv.org/abs/2405.19327)] [[code](https://github.com/multimodal-art-projection/MAP-NEO)] [May 2024] ![stars](https://img.shields.io/github/stars/multimodal-art-projection/MAP-NEO)
1. **MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models**
_Zichun Yu, Spandan Das, Chenyan Xiong_ [[pdf](https://arxiv.org/abs/2406.06046)] [[code](https://github.com/cxcscmu/MATES)] [Jun 2024] ![stars](https://img.shields.io/github/stars/cxcscmu/MATES)
1. **Does your data spark joy? Performance gains from domain upsampling at the end of training**
_Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, Jonathan Frankle_ [[pdf](https://arxiv.org/pdf/2406.03476)] [Jun 2024]
1. **DataComp-LM: In search of the next generation of training sets for language models.**
_Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, Vaishaal Shankar_ [[pdf](https://arxiv.org/abs/2406.11794v1)] [[code](https://github.com/mlfoundations/dclm)] [Jun 2024] ![stars](https://img.shields.io/github/stars/mlfoundations/dclm)
1. **Instruction Pre-Training: Language Models are Supervised Multitask Learners**
_Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei_ [[pdf](https://arxiv.org/abs/2406.14491)] [Jun 2024]
1. **Scaling Synthetic Data Creation with 1,000,000,000 Personas**
_Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu_ [[pdf](https://arxiv.org/pdf/2406.20094v1)] [Jun 2024]
1. **Resolving Discrepancies in Compute-Optimal Scaling of Language Models**
_Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon_ [[pdf](https://arxiv.org/abs/2406.19146)] [Jun 2024]
1. **MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens**
_Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt_ [[pdf](https://arxiv.org/abs/2406.11271)] [Jun 2024]1. **RegMix: Data Mixture as Regression for Language Model Pre-training**
_Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin_ [[pdf](https://arxiv.org/abs/2407.01492)] [[code](https://github.com/sail-sg/regmix)] [Jul 2024] ![stars](https://img.shields.io/github/stars/sail-sg/regmix)
1. **To Code, or Not To Code? Exploring Impact of Code in Pre-training**
_Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker_ [[pdf](https://arxiv.org/abs/2408.10914)] [Aug 2024]
1. **Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale**
_Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu_ [[pdf](https://arxiv.org/pdf/2409.17115)] [[code](https://github.com/GAIR-NLP/ProX)] [[data](https://huggingface.co/collections/gair-prox/prox-dataset-66e81c9d560911b836bb3704)] [Sep 2024] ![stars](https://img.shields.io/github/stars/GAIR-NLP/ProX)
1. **ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment**
_Elyas Obbad, Iddah Mlauzi, Brando Miranda, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo_ [[pdf](https://arxiv.org/abs/2410.18194)] [Oct 2024]
## Projects & Blogs
1. **Language Model Evaluation Harness**
_Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, Andy Zou_ [[code](https://github.com/EleutherAI/lm-evaluation-harness)] [[report](https://arxiv.org/pdf/2405.14782)] [2023] ![stars](https://img.shields.io/github/stars/EleutherAI/lm-evaluation-harness)
1. **Cosmopedia**
_Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, Leandro von Werra_ (HuggingFaceTB) [[code](https://github.com/huggingface/cosmopedia)] [[datasets](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)] [Feb 2024] ![stars](https://img.shields.io/github/stars/huggingface/cosmopedia)
1. **DataTrove: large scale data processing**
_Guilherme Penedo, Hynek Kydlíček, Alessandro Cappelli, Mario Sasko, Thomas Wolf_ [[code](https://github.com/huggingface/datatrove)] [Feb 2024] ![stars](https://img.shields.io/github/stars/huggingface/datatrove)
1. **SailCraft: Data Toolkit for Sailor Language Models**
_Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, Min Lin_ [[code](https://github.com/sail-sg/sailcraft)] [Apr 2024] ![stars](https://img.shields.io/github/stars/sail-sg/sailcraft)
1. **🍷 FineWeb: decanting the web for the finest text data at scale**
_Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Colin Raffel, Leandro Werra, Thomas Wolf_ (HuggingFaceFW) [[datasets](https://huggingface.co/collections/HuggingFaceFW/fineweb-datasets-662458592d61edba3d2f245d)] [[report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)] [[pdf](https://arxiv.org/abs/2406.17557)] [May 2024]
1. **Hugging Face Ethics and Society Newsletter 6: Building Better AI: The Importance of Data Quality**
_Avijit Ghosh and Lucie-Aimée Kaffee_ (Huggingface) [[blog](https://huggingface.co/blog/ethics-soc-6)] [Jun 2024]
1. **TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend**
_Liping Tang, Nikhil Ranjan, Omkar Pangarkar, Xuezhi Liang, Zhen Wang, Li An, Bhaskar Rao, Zhoujun Cheng, Suqi Sun, Cun Mu, Victor Miller, Yue Peng, Eric P. Xing, Zhengzhong Liu_ (LLM360) [[datasets](https://huggingface.co/datasets/LLM360/TxT360)] [[blog](https://huggingface.co/spaces/LLM360/TxT360)]1. **Scaling FineWeb to 1000+ languages, Step 1: Finding Signals in 100s of Evaluation Tasks**
_Hynek Kydlíček, Guilherme Penedo, Clémentine Fourier, Nathan Habib, Thomas Wolf_ (HuggingFaceFW) [[blog](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks)] [Oct 2024]1. **FineWeb-2: A 1000+ Language Dataset for Multilingual Language Models**
_Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Leandro von Werra, Thomas Wolf_ (HuggingFaceFW) [[code](https://github.com/huggingface/fineweb-2)] [[datasets](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)] [Oct 2024] ![stars](https://img.shields.io/github/stars/huggingface/fineweb-2)## Tutorials
1. **CSE599J: Data-centric Machine Learning**
_Pang Wei Koh_ [[website](https://koh.pw/cse599j/)] [2023]