https://github.com/adithya-s-k/topic2dataset
Create high qaulity instruct finetuning datasets for LLMs by just giving a topic / website
https://github.com/adithya-s-k/topic2dataset
dataset-generation fine-tuning instruct-finetune llm
Last synced: 7 months ago
JSON representation
Create high qaulity instruct finetuning datasets for LLMs by just giving a topic / website
- Host: GitHub
- URL: https://github.com/adithya-s-k/topic2dataset
- Owner: adithya-s-k
- License: mit
- Created: 2023-07-21T17:31:38.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-12T08:16:38.000Z (over 1 year ago)
- Last Synced: 2025-01-12T22:11:10.482Z (9 months ago)
- Topics: dataset-generation, fine-tuning, instruct-finetune, llm
- Language: Jupyter Notebook
- Homepage:
- Size: 167 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Topic2Dataset
## 👋🏼Introduction
data collection and preprocessing can be time-consuming and resource-intensive tasks. Topic2Dataset ia an open source project that helps by automating dataset generation from specified topics and websites, eliminating manual data gathering and accelerating dataset creation timelines. This project mainly aims to curate datasets for Retrieval augmented Generation (RAG) and LLM finetuning.## 🔑Key Features
- **Effortless Dataset Generation:** Generate datasets for pre-training, fine-tuning, and preference modeling with ease.
- **Tailored Instruction and Data:** Upload a document and receive customized instructions and data for fine-tuning models.
- **AI-Powered Scraping:** Live AI agent-based scrapers ensure relevance, accuracy, and up-to-date data extraction.
- **Time-Dependent Scraping:** Keep datasets fresh and relevant with time-dependent scraping capabilities.
- **Seamless Data Export:** Export datasets to S3 and other blob storage services effortlessly for further analysis and usage.
- **Insightful Summarization:** Receive answers to questions and insights relevant to the document's content for enhanced understanding.### Getting Started
Demo Notebook:
[](https://colab.research.google.com/)## Python Setup🐍
Clone the repository
```
git clone https://github.com/adithya-s-k/topic2dataset
cd topic2dataset
```
Install Required Dependencies
```
pip install -r requirements.txt
```
Quick Start dataset generation
```
python generate.py --topic Clinical Trials --website ["https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/" , "https://clinregs.niaid.nih.gov/#"] --time_in_hrs 10 --output local
```### Contributing
Topic2Dataset is open source. We welcome contributions and collaboration from the community! See the project page for ways to contribute.
### License
This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
### Citing
If you use Topic2Dataset in your research, please cite the following paper:
```
@article{Topic2Dataset,
title={Topic2Dataset: AI-Drive Dataset curation for RAG and Finetuning},
author={Adithya S K},
year={2023}
}
```## Disclaimer
The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced by any version of WizardLM is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.