https://github.com/adithya-s-k/topic2dataset

Create high qaulity instruct finetuning datasets for LLMs by just giving a topic / website
https://github.com/adithya-s-k/topic2dataset

dataset-generation fine-tuning instruct-finetune llm

Last synced: 7 months ago
JSON representation

Create high qaulity instruct finetuning datasets for LLMs by just giving a topic / website

Host: GitHub
URL: https://github.com/adithya-s-k/topic2dataset
Owner: adithya-s-k
License: mit
Created: 2023-07-21T17:31:38.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-05-12T08:16:38.000Z (over 1 year ago)
Last Synced: 2025-01-12T22:11:10.482Z (9 months ago)
Topics: dataset-generation, fine-tuning, instruct-finetune, llm
Language: Jupyter Notebook
Homepage:
Size: 167 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Topic2Dataset

## 👋🏼Introduction
data collection and preprocessing can be time-consuming and resource-intensive tasks. Topic2Dataset ia an open source project that helps by automating dataset generation from specified topics and websites, eliminating manual data gathering and accelerating dataset creation timelines. This project mainly aims to curate datasets for Retrieval augmented Generation (RAG) and LLM finetuning.

## 🔑Key Features
- **Effortless Dataset Generation:** Generate datasets for pre-training, fine-tuning, and preference modeling with ease.
- **Tailored Instruction and Data:** Upload a document and receive customized instructions and data for fine-tuning models.
- **AI-Powered Scraping:** Live AI agent-based scrapers ensure relevance, accuracy, and up-to-date data extraction.
- **Time-Dependent Scraping:** Keep datasets fresh and relevant with time-dependent scraping capabilities.
- **Seamless Data Export:** Export datasets to S3 and other blob storage services effortlessly for further analysis and usage.
- **Insightful Summarization:** Receive answers to questions and insights relevant to the document's content for enhanced understanding.

### Getting Started

Demo Notebook:
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)

## Python Setup🐍
Clone the repository
```
git clone https://github.com/adithya-s-k/topic2dataset
cd topic2dataset
```
Install Required Dependencies
```
pip install -r requirements.txt
```
Quick Start dataset generation
```
python generate.py --topic Clinical Trials --website ["https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/" , "https://clinregs.niaid.nih.gov/#"] --time_in_hrs 10 --output local
```

### Contributing

Topic2Dataset is open source. We welcome contributions and collaboration from the community! See the project page for ways to contribute.

### License

This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.

### Citing

If you use Topic2Dataset in your research, please cite the following paper:

```
@article{Topic2Dataset,
title={Topic2Dataset: AI-Drive Dataset curation for RAG and Finetuning},
author={Adithya S K},
year={2023}
}
```

## Disclaimer
The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced by any version of WizardLM is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adithya-s-k/topic2dataset

Awesome Lists containing this project

README