awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
https://github.com/weAIDB/awesome-data-llm
Last synced: 3 days ago
JSON representation
-
1 Data Processing for LLM
-
1.2 Data Deduplication
-
1.5 Data Mixing
-
1.3 Data Filtering
-
1.6 Data Distillation and Synthesis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
-
1.1 Data Acquisition
-
1.4 Data Selection
-
1.7 End-to-End Data Processing Pipelines
-
-
🌤 The IaaS Concept of DATA4LLM
-
Datasets
-
0 Data Characteristics across LLM Stages
-
Data for Pretraining
-
Data for Reinforcement Learning (RL)
-
Data for Continual Pre-training
-
Data for Supervised Fine-Tuning (SFT)
-
Data for LLM Agents
-
Data for Retrieval-Augmented Generation (RAG)
-
Data for LLM Evaluation
-
-
4 LLM for Data Management
-
4.3 LLM for Data System Optimization
-
4.1 LLM for Data Manipulation
-
4.2 LLM for Data Analysis
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Github
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Source
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
-
2 Data Storage for LLM
-
2.1 Data Formats
-
2.2 Data Distribution
-
2.3 Data Organization
-
2.4 Data Movement
-
2.5 Data Fault Tolerance
-
2.6 KV Cache
-
-
3 Data Serving for LLM
-
3.1 Data Shuffling
-
3.2 Data Compression
-
3.3 Data Packing
-
3.4 Data Provenance
-
Categories
Sub Categories
4.2 LLM for Data Analysis
62
1.6 Data Distillation and Synthesis
43
1.3 Data Filtering
36
1.1 Data Acquisition
31
4.1 LLM for Data Manipulation
29
4.3 LLM for Data System Optimization
28
1.5 Data Mixing
23
1.2 Data Deduplication
22
3.1 Data Shuffling
18
2.3 Data Organization
18
2.4 Data Movement
12
1.4 Data Selection
11
2.6 KV Cache
10
2.2 Data Distribution
9
1.7 End-to-End Data Processing Pipelines
9
2.1 Data Formats
8
3.2 Data Compression
7
2.5 Data Fault Tolerance
6
3.4 Data Provenance
6
Data for LLM Agents
5
Data for Retrieval-Augmented Generation (RAG)
5
Data for Reinforcement Learning (RL)
3
3.3 Data Packing
3
Data for LLM Evaluation
3
Data for Continual Pre-training
1
Data for Pretraining
1
Data for Supervised Fine-Tuning (SFT)
1
Keywords
nosql
1
neo4j
1
graphdb
1
graph-database
1
graph
1
database
1
cypher
1
storage
1
s3
1
redis
1
posix
1
object-storage
1
hdfs
1
golang
1
go
1
filesystem
1
distributed-systems
1
cloud-native
1
bigdata
1
ocrlite
1
ocr
1
db
1
crnn
1
chineseocr
1
webdriver
1
selenium
1
rust
1
ruby
1
python
1
javascript
1
java
1
dotnet
1
medicalgpt
1
medical
1
llm
1
llama
1
gpt
1
dpo
1
chatgpt
1