Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/davanstrien/data-for-fine-tuning-llms
https://github.com/davanstrien/data-for-fine-tuning-llms
Last synced: 16 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/davanstrien/data-for-fine-tuning-llms
- Owner: davanstrien
- Created: 2024-05-30T08:54:43.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-06-05T15:39:01.000Z (5 months ago)
- Last Synced: 2024-10-14T10:52:11.953Z (28 days ago)
- Language: Jupyter Notebook
- Size: 1.19 MB
- Stars: 74
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Datasets for fine-tuning LLMs
This repo is to accompany a session run as part of the Mastering LLMs: A Conference For Developers & Data Scientists conference. The session focused on _some_ of the data issues related to fine-tuning LLMs.
The goals of the notebooks are focused on balancing the requirement to have sufficiently diverse data, with high quality and the right quantity i.e. avoid duplication.
![Diagram showing goals of the notebooks](goals.png)
## Slides
- [Creating, curating, and cleaning data for LLMs](https://docs.google.com/presentation/d/12n-_ivhTQQpeTKAIvmuxnUxkJ19zvtJzKBwvZn-t8rQ/edit?usp=sharing)
## Notebooks
- [01_eda_and_deduplication](01_eda_and_deduplication.ipynb)
- [02-data-checks](02-data-checks.ipynb)
- [03-synthetic-data-generation](03-synthetic-data-generation.ipynb)## Synthetic data pipelines
- [dataset-card-summaries](dataset-card-summaries/): This folder contains a pipeline for generating a synthetic dataset focused on generating tl;dr summaries of datasets based on their dataset card.
### Other resources for synthetic data generation
- [awesome-synthetic-datasets](https://github.com/davanstrien/awesome-synthetic-datasets)