https://github.com/daspartho/distillclassifier

Easily generate synthetic data for classification tasks using LLMs
https://github.com/daspartho/distillclassifier

classification classification-models dataset-generation distillation distillation-model distilling-the-knowledge large-language-models nlp synthetic-data synthetic-dataset-generation text-classification

Last synced: about 1 year ago
JSON representation

Easily generate synthetic data for classification tasks using LLMs

Host: GitHub
URL: https://github.com/daspartho/distillclassifier
Owner: daspartho
License: mit
Created: 2023-10-08T09:04:23.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-07T15:34:52.000Z (over 2 years ago)
Last Synced: 2025-02-08T15:28:27.216Z (over 1 year ago)
Topics: classification, classification-models, dataset-generation, distillation, distillation-model, distilling-the-knowledge, large-language-models, nlp, synthetic-data, synthetic-dataset-generation, text-classification
Language: Python
Homepage:
Size: 1020 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# DistillClassifier

## About
DistillClassifier is a tool built on top of [LLM-VM](https://github.com/anarchy-ai/LLM-VM) to easily generate synthetic data for classification tasks using LLMs for distilling LLM knowledge for classification task into much smaller and faster-to-run classification models.

This project was build for the ANARCHY October 2023 Hackathon. Checkout ANARCHY on their [github](https://github.com/anarchy-ai) and [website](https://anarchy.ai/welcome/why_anarchy).

## Team Members:

- [Partho Das](https://github.com/daspartho)
- [Karan Janthe](https://github.com/kmj-007)

## Setup

### clone the project from github

```bash
git clone https://github.com/daspartho/DistillClassifier
```

### `cd` into the project

```bash
cd DistillClassifier
```

### install LLM-VM

```bash
git clone https://github.com/anarchy-ai/LLM-VM.git
cd LLM-VM
pip3 install .
cd ..
```

### install python dependencies

```bash
pip3 install -r requirements.txt
```

### create an `.env` file and set OpenAI API key (if you want to use openai models) and Huggingface Hub Token (if you want to push the dataset to huggingface):

```bash
OPENAI_API_KEY=
HF_HUB_TOKEN=
```

## Run

### You can run the tool from command line like this:

```bash
python3 generation.py [-m ] [-f ] [-r ]
```

### Arguments:

- ``: Column information as a dictionary.
- ``: Number of examples to be generated.
- `-m, --model`: (Optional) Model name. Defaults to "chat_gpt".
- `-f, --filename`: (Optional) Dataset filename. Defaults to "dataset.json".
- `-r, --repo`: (Optional) HuggingFace repo ID". Defaults to "None"

### Example:

```bash
python3 generation.py '{"text": "either spoiler or not spoiler text", "label": "if text is spoiler or not"}' 25 -m 'chat_gpt' -f 'dataset.json' -r 'spoiler_or_not'
```

### or run the `demo.py` file directly:

```bash
python3 demo.py
```

### example output dataset:

#### [demo_dataset.json](/demo_dataset.json)
#### [demo dataset on huggingface](https://huggingface.co/datasets/daspartho/demo_dataset)

### LICENSE
MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daspartho/distillclassifier

Awesome Lists containing this project

README