https://github.com/minpeter/stanford_alpaca_regen
https://github.com/minpeter/stanford_alpaca_regen
alpaca dataset self-instruct
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/minpeter/stanford_alpaca_regen
- Owner: minpeter
- License: apache-2.0
- Created: 2025-02-10T02:54:14.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-10T02:58:56.000Z (over 1 year ago)
- Last Synced: 2025-07-24T11:47:15.396Z (11 months ago)
- Topics: alpaca, dataset, self-instruct
- Language: Python
- Homepage: https://huggingface.co/datasets/minpeter/stanford-alpaca-regen-llama-3.3
- Size: 426 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Stanford Alpaca Regenerate: An Instruction-following LLaMA Model
Updated to the latest OpenAI Python SDK.
For the data generation part, we mostly keep it and instead replace the model training part with axolotl config.
## create dataset
You need a `FRIENDLI_TOKEN`. You can get it from the [document](https://friendli.ai/docs/guides/personal_access_tokens).
```bash
## Generate 52 new data -> regen.json
python -m generate_instruction generate_instruction_following_data \
--output_dir ./ \
--num_instructions_to_generate 52
# convert regen.json to jsonl -> output.jsonl
python convert_to_jsonl.py
```
### Example of completed data
```
$ head -n 1 output.jsonl
{"instruction": "What can you infer from the following conversation?", "input": "John: How was your weekend?\nJane: It was great. I went to the beach with friends and had a lot of fun.", "output": "Jane had a great weekend and enjoyed her time at the beach with friends."}
```