awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
https://github.com/weAIDB/awesome-data-llm
Last synced: 4 days ago
JSON representation
-
1 Data Processing for LLM
-
1.5 Data Mixing
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.1 Data Acquisition
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Source
- [Source
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Github
- [Source
- [Source
- [Paper
- [Paper
- [Paper
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.3 Data Filtering
-
1.2 Data Deduplication
-
1.4 Data Selection
-
1.6 Data Distillation and Synthesis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.7 End-to-End Data Processing Pipelines
-
-
🌤 The IaaS Concept of DATA4LLM
-
2 Data Storage for LLM
-
2.1 Data Formats
-
2.3 Data Organization
-
2.5 Data Fault Tolerance
-
2.6 KV Cache
-
2.2 Data Distribution
-
2.4 Data Movement
-
-
3 Data Serving for LLM
-
3.1 Data Shuffling
-
3.2 Data Compression
-
3.3 Data Packing
-
3.4 Data Provenance
-
-
4 LLM for Data Management
-
4.1 LLM for Data Manipulation
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- IEEE Data Eng. Bull. 49(1) - 31 (2025)*. [[Paper](http://sites.computer.org/debull/A25mar/p3.pdf)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.2 LLM for Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Source
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.2 LLM for Data System Optimization
-
4.3 LLM for Data System Optimization
-
-
5 LLM as Data Analyst
-
5.1 LLM for Structured Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
- [Paper
- [Paper
- cs.LG
- cs.LG
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
5.2 LLM for Semi-Structured Data Analysis
-
5.3 LLM for Unstructured Data Analysis
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- cs.CV
- cs.CL
- [Paper
- cs.AI
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- cs.CV
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
5.4 LLM for Heterogeneous Data Analysis
-
-
🌟 LLM/Agent-as-Data-Analyst
-
Datasets
-
0 Data Characteristics across LLM Stages
-
Data for Pretraining
-
Data for Continual Pre-training
-
Data for Supervised Fine-Tuning (SFT)
-
Data for LLM Agents
-
Data for Reinforcement Learning (RL)
-
Data for Retrieval-Augmented Generation (RAG)
-
Data for LLM Evaluation
-
Categories
Sub Categories
5.3 LLM for Unstructured Data Analysis
106
4.2 LLM for Data Analysis
101
5.1 LLM for Structured Data Analysis
72
1.6 Data Distillation and Synthesis
69
4.1 LLM for Data Manipulation
63
1.1 Data Acquisition
55
1.5 Data Mixing
50
1.3 Data Filtering
41
4.3 LLM for Data System Optimization
28
2.3 Data Organization
25
1.2 Data Deduplication
25
3.1 Data Shuffling
24
4.2 LLM for Data System Optimization
19
5.2 LLM for Semi-Structured Data Analysis
15
1.4 Data Selection
13
2.4 Data Movement
12
2.6 KV Cache
11
1.7 End-to-End Data Processing Pipelines
10
3.2 Data Compression
10
2.1 Data Formats
9
2.2 Data Distribution
9
5.4 LLM for Heterogeneous Data Analysis
8
3.3 Data Packing
8
2.5 Data Fault Tolerance
7
3.4 Data Provenance
6
Data for LLM Agents
5
Data for Retrieval-Augmented Generation (RAG)
5
Data for LLM Evaluation
3
Data for Reinforcement Learning (RL)
3
Data for Continual Pre-training
1
Data for Supervised Fine-Tuning (SFT)
1
Data for Pretraining
1
Keywords
ocrlite
1
ocr
1
db
1
crnn
1
chineseocr
1
medicalgpt
1
medical
1
llm
1
llama
1
gpt
1
dpo
1
chatgpt
1
webdriver
1
selenium
1
rust
1
ruby
1
python
1
javascript
1
java
1
dotnet
1
storage
1
s3
1
redis
1
posix
1
object-storage
1
hdfs
1
golang
1
go
1
filesystem
1
distributed-systems
1
cloud-native
1
bigdata
1
nosql
1
neo4j
1
graphdb
1
graph-database
1
graph
1
database
1
cypher
1