{"id":13625096,"url":"https://github.com/BatsResearch/bonito","last_synced_at":"2025-04-16T06:31:43.121Z","repository":{"id":224565268,"uuid":"763068976","full_name":"BatsResearch/bonito","owner":"BatsResearch","description":"A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.","archived":false,"fork":false,"pushed_at":"2024-09-12T18:56:27.000Z","size":810,"stargazers_count":655,"open_issues_count":2,"forks_count":42,"subscribers_count":12,"default_branch":"main","last_synced_at":"2024-09-14T06:03:38.293Z","etag":null,"topics":["domain-adaptation","gpt","llm","synthetic-data","synthetic-dataset-generation","task-adaptation","zero-shot-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BatsResearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-25T13:33:33.000Z","updated_at":"2024-09-12T18:56:28.000Z","dependencies_parsed_at":"2024-02-26T17:00:38.513Z","dependency_job_id":"19bc89e4-7ab5-4c00-8ec9-784e8a937fe3","html_url":"https://github.com/BatsResearch/bonito","commit_stats":null,"previous_names":["batsresearch/bonito"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BatsResearch%2Fbonito","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BatsResearch%2Fbonito/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BatsResearch%2Fbonito/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BatsResearch%2Fbonito/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BatsResearch","download_url":"https://codeload.github.com/BatsResearch/bonito/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223700228,"owners_count":17188272,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["domain-adaptation","gpt","llm","synthetic-data","synthetic-dataset-generation","task-adaptation","zero-shot-learning"],"created_at":"2024-08-01T21:01:50.808Z","updated_at":"2025-04-16T06:31:43.106Z","avatar_url":"https://github.com/BatsResearch.png","language":"Python","funding_links":[],"categories":["llm","Python"],"sub_categories":[],"readme":"# Bonito\n\nBonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face `transformers` and `vllm` libraries.\n\n- Paper: [Learning to Generate Instruction Tuning Datasets for\nZero-Shot Task Adaptation](https://arxiv.org/abs/2402.18334)\n- Model: [bonito-v1](https://huggingface.co/BatsResearch/bonito-v1)\n- Demo: [Bonito on Spaces](https://huggingface.co/spaces/nihalnayak/bonito)\n- Dataset: [ctga-v1](https://huggingface.co/datasets/BatsResearch/ctga-v1)\n- Code: To reproduce experiments in our paper, see [nayak-aclfindings24-code](https://github.com/BatsResearch/nayak-aclfindings24-code).\n\n![Bonito](https://nihalnayak.github.io/assets/img/workflow.png)\n\n## News\n- 🐠 February 2025: Uploaded `bonito-llm` to PyPI.\n- 🐡 August 2024: Released [new Bonito model](https://huggingface.co/BatsResearch/Llama-3.1-8B-bonito-v1) with Meta Llama 3.1 as the base model.\n- 🐟 June 2024: Bonito is accepted to ACL Findings 2024.\n\n## Installation\nCreate an environment and install the package using the following command:\n```bash\npip3 install bonito-llm\n```\n\n## Basic Usage\nTo generate synthetic instruction tuning dataset using Bonito, you can use the following code:\n```python\nfrom bonito import Bonito\nfrom vllm import SamplingParams\nfrom datasets import load_dataset\n\n# Initialize the Bonito model\nbonito = Bonito(\"BatsResearch/bonito-v1\")\n\n# load dataset with unannotated text\nunannotated_text = load_dataset(\n    \"BatsResearch/bonito-experiment\",\n    \"unannotated_contract_nli\"\n)[\"train\"].select(range(10))\n\n# Generate synthetic instruction tuning dataset\nsampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)\nsynthetic_dataset = bonito.generate_tasks(\n    unannotated_text,\n    context_col=\"input\",\n    task_type=\"nli\",\n    sampling_params=sampling_params\n)\n```\n\n## Supported Task Types\nHere we include the supported task types [full name (short form)]: `extractive question answering` (`exqa`), `multiple-choice question answering` (`mcqa`), `question generation` (`qg`), `question answering without choices` (`qa`), `yes-no question answering` (`ynqa`), `coreference resolution` (`coref`), `paraphrase generation` (`paraphrase`), `paraphrase identification` (`paraphrase_id`), `sentence completion` (`sent_comp`), `sentiment` (`sentiment`), `summarization` (`summarization`), `text generation` (`text_gen`), `topic classification` (`topic_class`), `word sense disambiguation` (`wsd`), `textual entailment` (`te`), `natural language inference` (`nli`)\n\nYou can use either the full name or the short form to specify the `task_type` in `generate_tasks`.\n\n## Tutorial\nWe have created a tutorial [here](https://colab.research.google.com/drive/12OCh4OYo1vr9ZvwIWK4JwZT7rkMrYrx2?usp=sharing) for how to use a quantized version of the model in a Google Colab T4 instance. The quantized version was graciously contributed by user [alexandreteles](https://github.com/alexandreteles).\nWe have an additional tutorial to try out the Bonito model on A100 GPU on Google Colab [here](https://colab.research.google.com/drive/1XuDRVKpUUqdjrqg2-P2FIqkdAQBnqoNL?usp=sharing).\n\n\n## Citation\nIf you use Bonito in your research, please cite the following paper:\n```\n@inproceedings{bonito:aclfindings24,\n  title = {Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation},\n  author = {Nayak, Nihal V. and Nan, Yiyang and Trost, Avi and Bach, Stephen H.},\n  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},\n  year = {2024}}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBatsResearch%2Fbonito","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBatsResearch%2Fbonito","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBatsResearch%2Fbonito/lists"}