https://github.com/minpeter/stanford_alpaca_regen

alpaca dataset self-instruct

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/minpeter/stanford_alpaca_regen
Owner: minpeter
License: apache-2.0
Created: 2025-02-10T02:54:14.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-10T02:58:56.000Z (over 1 year ago)
Last Synced: 2025-07-24T11:47:15.396Z (11 months ago)
Topics: alpaca, dataset, self-instruct
Language: Python
Homepage: https://huggingface.co/datasets/minpeter/stanford-alpaca-regen-llama-3.3
Size: 426 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Stanford Alpaca Regenerate: An Instruction-following LLaMA Model

Updated to the latest OpenAI Python SDK.
For the data generation part, we mostly keep it and instead replace the model training part with axolotl config.

## create dataset

You need a `FRIENDLI_TOKEN`. You can get it from the [document](https://friendli.ai/docs/guides/personal_access_tokens).

```bash
## Generate 52 new data -> regen.json
python -m generate_instruction generate_instruction_following_data \
--output_dir ./ \
--num_instructions_to_generate 52

# convert regen.json to jsonl -> output.jsonl
python convert_to_jsonl.py
```

### Example of completed data

```
$ head -n 1 output.jsonl
{"instruction": "What can you infer from the following conversation?", "input": "John: How was your weekend?\nJane: It was great. I went to the beach with friends and had a lot of fun.", "output": "Jane had a great weekend and enjoyed her time at the beach with friends."}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/minpeter/stanford_alpaca_regen

Awesome Lists containing this project

README