Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/orhonovich/unnatural-instructions
https://github.com/orhonovich/unnatural-instructions
Last synced: 24 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/orhonovich/unnatural-instructions
- Owner: orhonovich
- License: mit
- Created: 2022-12-19T14:01:06.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-02-23T06:26:18.000Z (over 1 year ago)
- Last Synced: 2024-01-14T05:59:31.773Z (6 months ago)
- Size: 27.6 MB
- Stars: 162
- Watchers: 7
- Forks: 9
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-chatgpt-dataset - Unnatural Instructions - ative and diverse instructions, collected with virtually no human labor. | MIT | (Dataset Detail)
- awesome-prompt-engineering - Unnatural Instruction
- Awesome-instruction-tuning - Unnatural Inst. - LM-Unnat. Inst. | T5-LM | 11B | (Datasets and Models / Modified from Traditional NLP)
- awesome-stars - orhonovich/unnatural-instructions - (Others)
- awesome-rlhf - unnatural-instructions
- awesome-instruction-dataset - (orhonovich/unnatural-instructions)|240K|EN|MT|MIX
- awesome-instruction-datasets - Dataset Link
- Awesome-LLM - https://github.com/orhonovich/unnatural-instructions
README
# Unnatural Instructions
This repository contains the Unnatural Instructions dataset. Unnatural Instructions is a dataset of instructions automatically generated by a Large Language model.
See full details in the paper: "[Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor](https://arxiv.org/abs/2212.09689)"## 🗃️ Content
The `data` folder contains two files: `core_data.jsonl`, containing the Unnatural Instructions core dataset of 68,478 instruction-input-output triplets, and `full_data.jsonl`, containing the full 240,670 Unnatural Instructions examples. The full data was constructed by expanding the core data with automatically generated instruction paraphrases.## 📄 Format
### Core data
Each line in `core_data.jsonl` is a JSON object with two fields - `instruction`, which is a natural language instruction describing a task, and `instances`, an array of JSON objects, each contains
- `input`: An input for the task described by the `instruction`
- `instruction_with_input`: The instruction concatenated with the `input`
- `constraints`: The task's output space constraints
- `output`: The output of executing `instruction` with the given `input`### Full data
`core_data.jsonl` has the same structure as `core_data.jsonl`, but with one additional field - `reformulations`. `reformulations` is an array of JSON objects, each corresponds to an automatically generated paraphrase for the given instruction. Each reformulation contains the fields:
- `instruction`: A paraphrase of the original instruction
- `input`: An input for the task described by the `instruction`
- `instruction_with_input`: The paraphrased instruction concatenated with the `input`
- `output`: The output of executing `instruction` with the given `input`## 📘 Citation
If you make use of Unnatural Instructions, please cite the following paper:
```
@misc{honovich2022unnatural,
title = {Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor},
author = {Honovich, Or and Scialom, Thomas and Levy, Omer and Schick, Timo},
url = {https://arxiv.org/abs/2212.09689},
publisher = {arXiv},
year={2022}
}
```