awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
https://github.com/weAIDB/awesome-data-llm
Last synced: 1 day ago
JSON representation
-
1 Data Processing for LLM
-
1.5 Data Mixing
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.1 Data Acquisition
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Source
- [Source
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Github
- [Source
- [Source
- [Paper
- [Paper
- [Paper
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.2 Data Deduplication
-
1.3 Data Filtering
-
1.4 Data Selection
-
1.6 Data Distillation and Synthesis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.7 End-to-End Data Processing Pipelines
-
-
🌤 The IaaS Concept of DATA4LLM
-
2 Data Storage for LLM
-
2.1 Data Formats
-
2.3 Data Organization
-
2.5 Data Fault Tolerance
-
2.6 KV Cache
-
2.2 Data Distribution
-
2.4 Data Movement
-
-
3 Data Serving for LLM
-
3.1 Data Shuffling
-
3.2 Data Compression
-
3.3 Data Packing
-
3.4 Data Provenance
-
-
4 LLM for Data Management
-
4.1 LLM for Data Manipulation
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- IEEE Data Eng. Bull. 49(1) - 31 (2025)*. [[Paper](http://sites.computer.org/debull/A25mar/p3.pdf)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.2 LLM for Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Source
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.2 LLM for Data System Optimization
-
4.3 LLM for Data System Optimization
-
-
5 LLM as Data Analyst
-
5.1 LLM for Structured Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- cs.LG
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
5.2 LLM for Semi-Structured Data Analysis
-
5.3 LLM for Unstructured Data Analysis
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- cs.CV
- cs.CL
- [Paper
- cs.AI
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- cs.CV
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
5.4 LLM for Heterogeneous Data Analysis
-
-
🌟 LLM/Agent-as-Data-Analyst
-
Datasets
-
0 Data Characteristics across LLM Stages
-
Data for Pretraining
-
Data for Continual Pre-training
-
Data for Supervised Fine-Tuning (SFT)
-
Data for LLM Agents
-
Data for Reinforcement Learning (RL)
-
Data for Retrieval-Augmented Generation (RAG)
-
Data for LLM Evaluation
-
Categories
Sub Categories
5.3 LLM for Unstructured Data Analysis
106
4.2 LLM for Data Analysis
82
5.1 LLM for Structured Data Analysis
70
1.6 Data Distillation and Synthesis
69
4.1 LLM for Data Manipulation
66
1.1 Data Acquisition
56
1.5 Data Mixing
51
1.3 Data Filtering
41
4.3 LLM for Data System Optimization
28
2.3 Data Organization
25
1.2 Data Deduplication
25
3.1 Data Shuffling
24
5.2 LLM for Semi-Structured Data Analysis
15
1.4 Data Selection
13
2.4 Data Movement
12
1.7 End-to-End Data Processing Pipelines
11
2.6 KV Cache
11
3.2 Data Compression
10
4.2 LLM for Data System Optimization
10
2.1 Data Formats
9
2.2 Data Distribution
9
5.4 LLM for Heterogeneous Data Analysis
8
3.3 Data Packing
8
2.5 Data Fault Tolerance
7
3.4 Data Provenance
6
Data for LLM Agents
5
Data for Retrieval-Augmented Generation (RAG)
5
Data for LLM Evaluation
3
Data for Reinforcement Learning (RL)
3
Data for Continual Pre-training
1
Data for Supervised Fine-Tuning (SFT)
1
Data for Pretraining
1
Keywords
ocrlite
1
ocr
1
db
1
crnn
1
chineseocr
1
medicalgpt
1
medical
1
llm
1
llama
1
gpt
1
dpo
1
chatgpt
1
webdriver
1
selenium
1
rust
1
ruby
1
python
1
javascript
1
java
1
dotnet
1
storage
1
s3
1
redis
1
posix
1
object-storage
1
hdfs
1
golang
1
go
1
filesystem
1
distributed-systems
1
cloud-native
1
bigdata
1
nosql
1
neo4j
1
graphdb
1
graph-database
1
graph
1
database
1
cypher
1