awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
https://github.com/weAIDB/awesome-data-llm
Last synced: 3 days ago
JSON representation
-
2 Data Storage for LLM
-
2.3 Data Organization
-
2.6 KV Cache
-
2.1 Data Formats
-
2.2 Data Distribution
-
2.4 Data Movement
-
2.5 Data Fault Tolerance
-
-
4 LLM for Data Management
-
4.2 LLM for Data System Optimization
-
4.1 LLM for Data Manipulation
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- IEEE Data Eng. Bull. 49(1) - 31 (2025)*. [[Paper](http://sites.computer.org/debull/A25mar/p3.pdf)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.2 LLM for Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Source
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.3 LLM for Data System Optimization
-
-
5 LLM as Data Analyst
-
5.1 LLM for Structured Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- cs.LG
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
-
5.2 LLM for Semi-Structured Data Analysis
-
5.3 LLM for Unstructured Data Analysis
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- [Paper
- cs.AI
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
5.4 LLM for Heterogeneous Data Analysis
-
-
1 Data Processing for LLM
-
1.5 Data Mixing
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.1 Data Acquisition
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Source
- [Source
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Github
- [Source
- [Source
- [Paper
- [Paper
- [Paper
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.3 Data Filtering
-
1.6 Data Distillation and Synthesis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.7 End-to-End Data Processing Pipelines
-
1.2 Data Deduplication
-
1.4 Data Selection
-
-
🌟 LLM/Agent-as-Data-Analyst
-
🌤 The IaaS Concept of DATA4LLM
-
Datasets
-
0 Data Characteristics across LLM Stages
-
Data for Pretraining
-
Data for Continual Pre-training
-
Data for Supervised Fine-Tuning (SFT)
-
Data for LLM Agents
-
Data for Reinforcement Learning (RL)
-
Data for Retrieval-Augmented Generation (RAG)
-
Data for LLM Evaluation
-
-
3 Data Serving for LLM
-
3.1 Data Shuffling
-
3.2 Data Compression
-
3.3 Data Packing
-
3.4 Data Provenance
-
Categories
Sub Categories
5.3 LLM for Unstructured Data Analysis
89
4.2 LLM for Data Analysis
82
5.1 LLM for Structured Data Analysis
74
1.6 Data Distillation and Synthesis
69
4.1 LLM for Data Manipulation
66
1.1 Data Acquisition
57
1.5 Data Mixing
52
1.3 Data Filtering
41
4.3 LLM for Data System Optimization
28
2.3 Data Organization
26
1.2 Data Deduplication
25
3.1 Data Shuffling
24
1.4 Data Selection
13
2.4 Data Movement
12
5.2 LLM for Semi-Structured Data Analysis
12
1.7 End-to-End Data Processing Pipelines
11
4.2 LLM for Data System Optimization
11
2.6 KV Cache
11
3.2 Data Compression
10
2.1 Data Formats
9
2.2 Data Distribution
9
5.4 LLM for Heterogeneous Data Analysis
8
3.3 Data Packing
8
2.5 Data Fault Tolerance
7
3.4 Data Provenance
6
Data for LLM Agents
5
Data for Retrieval-Augmented Generation (RAG)
5
Data for LLM Evaluation
3
Data for Reinforcement Learning (RL)
3
Data for Continual Pre-training
1
Data for Supervised Fine-Tuning (SFT)
1
Data for Pretraining
1
Keywords
ocrlite
1
ocr
1
db
1
crnn
1
chineseocr
1
medicalgpt
1
medical
1
llm
1
llama
1
gpt
1
dpo
1
chatgpt
1
webdriver
1
selenium
1
rust
1
ruby
1
python
1
javascript
1
java
1
dotnet
1
storage
1
s3
1
redis
1
posix
1
object-storage
1
hdfs
1
golang
1
go
1
filesystem
1
distributed-systems
1
cloud-native
1
bigdata
1
nosql
1
neo4j
1
graphdb
1
graph-database
1
graph
1
database
1
cypher
1