awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
https://github.com/weAIDB/awesome-data-llm
Last synced: 1 day ago
JSON representation
-
1 Data Processing for LLM
-
1.1 Data Acquisition
- [Source
- [Source
- [Source
- [Source
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Github
- [Github
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.4 Data Selection
-
1.6 Data Distillation and Synthesis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.5 Data Mixing
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
1.3 Data Filtering
-
1.7 End-to-End Data Processing Pipelines
-
1.2 Data Deduplication
-
-
2 Data Storage for LLM
-
2.3 Data Organization
-
2.1 Data Formats
-
2.2 Data Distribution
-
2.5 Data Fault Tolerance
-
2.6 KV Cache
-
2.4 Data Movement
-
-
Datasets
-
4 LLM for Data Management
-
4.2 LLM for Data Analysis
- [Source
- [Source
- [Paper
- [Paper
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.3 LLM for Data System Optimization
-
4.1 LLM for Data Manipulation
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- IEEE Data Eng. Bull. 49(1) - 31 (2025)*. [[Paper](http://sites.computer.org/debull/A25mar/p3.pdf)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
4.2 LLM for Data System Optimization
-
-
0 Data Characteristics across LLM Stages
-
Data for Supervised Fine-Tuning (SFT)
-
Data for Retrieval-Augmented Generation (RAG)
-
Data for Continual Pre-training
-
Data for Reinforcement Learning (RL)
-
Data for LLM Agents
-
Data for Pretraining
-
Data for LLM Evaluation
-
-
5 LLM as Data Analyst
-
5.1 LLM for Structured Data Analysis
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
- [Paper
- [Paper
- cs.LG
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.LG
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
5.3 LLM for Unstructured Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- cs.CV
- cs.CL
- [Paper
- cs.AI
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- cs.CV
- cs.CL
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- cs.CV
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
5.2 LLM for Semi-Structured Data Analysis
-
5.4 LLM for Heterogeneous Data Analysis
-
-
3 Data Serving for LLM
-
3.2 Data Compression
-
3.1 Data Shuffling
-
3.4 Data Provenance
-
3.3 Data Packing
-
-
🌟 LLM/Agent-as-Data-Analyst
-
🌤 The IaaS Concept of DATA4LLM
Categories
Sub Categories
5.3 LLM for Unstructured Data Analysis
106
4.2 LLM for Data Analysis
101
5.1 LLM for Structured Data Analysis
72
1.6 Data Distillation and Synthesis
69
4.1 LLM for Data Manipulation
63
1.1 Data Acquisition
55
1.5 Data Mixing
50
1.3 Data Filtering
41
4.3 LLM for Data System Optimization
28
2.3 Data Organization
26
1.2 Data Deduplication
25
3.1 Data Shuffling
24
4.2 LLM for Data System Optimization
19
5.2 LLM for Semi-Structured Data Analysis
15
1.4 Data Selection
13
2.4 Data Movement
12
2.6 KV Cache
11
1.7 End-to-End Data Processing Pipelines
10
3.2 Data Compression
10
2.1 Data Formats
9
2.2 Data Distribution
9
5.4 LLM for Heterogeneous Data Analysis
8
3.3 Data Packing
8
2.5 Data Fault Tolerance
7
3.4 Data Provenance
6
Data for LLM Agents
5
Data for Retrieval-Augmented Generation (RAG)
5
Data for LLM Evaluation
3
Data for Reinforcement Learning (RL)
3
Data for Continual Pre-training
1
Data for Supervised Fine-Tuning (SFT)
1
Data for Pretraining
1
Keywords
ocrlite
1
ocr
1
db
1
crnn
1
chineseocr
1
medicalgpt
1
medical
1
llm
1
llama
1
gpt
1
dpo
1
chatgpt
1
webdriver
1
selenium
1
rust
1
ruby
1
python
1
javascript
1
java
1
dotnet
1
storage
1
s3
1
redis
1
posix
1
object-storage
1
hdfs
1
golang
1
go
1
filesystem
1
distributed-systems
1
cloud-native
1
bigdata
1
nosql
1
neo4j
1
graphdb
1
graph-database
1
graph
1
database
1
cypher
1