{"id":16819131,"url":"https://github.com/datadreamer-dev/DataDreamer","last_synced_at":"2025-11-06T09:30:26.683Z","repository":{"id":220258004,"uuid":"648788010","full_name":"datadreamer-dev/DataDreamer","owner":"datadreamer-dev","description":"DataDreamer: Prompt. Generate Synthetic Data. Train \u0026 Align Models.    🤖💤","archived":false,"fork":false,"pushed_at":"2025-01-27T19:10:47.000Z","size":828,"stargazers_count":910,"open_issues_count":3,"forks_count":47,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-01-27T19:27:49.492Z","etag":null,"topics":["alignment","deep-learning","fine-tuning","gpt","instruction-tuning","llm","llmops","llms","machine-learning","natural-language-processing","nlp","nlp-library","openai","python","pytorch","synthetic-data","synthetic-dataset-generation","transformers"],"latest_commit_sha":null,"homepage":"https://datadreamer.dev","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datadreamer-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-02T20:17:04.000Z","updated_at":"2025-01-24T18:09:46.000Z","dependencies_parsed_at":"2024-03-30T20:30:53.256Z","dependency_job_id":"1f8b2fdb-44ef-48ed-8765-ac847753e888","html_url":"https://github.com/datadreamer-dev/DataDreamer","commit_stats":{"total_commits":70,"total_committers":4,"mean_commits":17.5,"dds":"0.042857142857142816","last_synced_commit":"6994b30cb4fba0d153067e8b82841967d6e33f4a"},"previous_names":["datadreamer-dev/datadreamer"],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadreamer-dev%2FDataDreamer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadreamer-dev%2FDataDreamer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadreamer-dev%2FDataDreamer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadreamer-dev%2FDataDreamer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datadreamer-dev","download_url":"https://codeload.github.com/datadreamer-dev/DataDreamer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239488531,"owners_count":19647226,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","deep-learning","fine-tuning","gpt","instruction-tuning","llm","llmops","llms","machine-learning","natural-language-processing","nlp","nlp-library","openai","python","pytorch","synthetic-data","synthetic-dataset-generation","transformers"],"created_at":"2024-10-13T10:52:10.021Z","updated_at":"2025-11-06T09:30:26.652Z","avatar_url":"https://github.com/datadreamer-dev.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://datadreamer.dev\"\u003e\u003cimg src=\"https://datadreamer.dev/docs/latest/_static/logo.svg\" alt=\"DataDreamer\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\u003cbr /\u003e\n  \u003ca href=\"https://datadreamer.dev\"\u003e\u003cb\u003ehttps://datadreamer.dev\u003c/b\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n   \u003cb\u003ePrompt. Generate Synthetic Data. Train \u0026 Align Models.\u003c/b\u003e\u003cbr /\u003e\u003cbr /\u003e\n  \u003ca href=\"https://github.com/datadreamer-dev/DataDreamer/actions/workflows/release.yml\"\u003e\u003cimg src=\"https://img.shields.io/github/actions/workflow/status/datadreamer-dev/DataDreamer/release.yml?logo=githubactions\u0026logoColor=white\u0026label=Tests%20%26%20Release\" alt=\"Tests \u0026 Release\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://codecov.io/gh/datadreamer-dev/DataDreamer\"\u003e\u003cimg src=\"https://codecov.io/gh/datadreamer-dev/DataDreamer/graph/badge.svg?token=KZB00BKWJE\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/datadreamer-dev/DataDreamer/actions/workflows/tests.yml\"\u003e\u003cimg src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/57b6a8cedd26481516a1a6af510d6b24272d0a76/assets/badge/v2.json\" alt=\"Ruff\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/datadreamer.dev/\"\u003e\u003cimg src=\"https://badge.fury.io/py/datadreamer.dev.svg\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://datadreamer.dev/docs/\"\u003e\u003cimg src=\"https://img.shields.io/website.svg?down_color=red\u0026down_message=offline\u0026label=Documentation\u0026up_message=online\u0026url=https://datadreamer.dev/docs/\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://datadreamer.dev/docs/latest/pages/contributing.html\"\u003e\u003cimg src=\"https://img.shields.io/badge/Contributor-Guide-blue?logo=Github\u0026color=purple\"/\u003e\u003c/a\u003e\n  \u003cbr /\u003e\n  \u003ca href=\"https://github.com/datadreamer-dev/DataDreamer/blob/main/LICENSE.txt\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-blue.svg\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://ajayp.app/\"\u003e\u003cimg src=\"https://img.shields.io/badge/NLP-NLP?labelColor=011F5b\u0026color=990000\u0026label=University%20of%20Pennsylvania\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/abs/2402.10379\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2402.10379-b31b1b.svg\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://discord.gg/dwWW8wuCtK\"\u003e\u003cimg src=\"https://img.shields.io/badge/Discord-Chat-blue?logo=discord\u0026color=4338ca\u0026labelColor=black\"/\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nDataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.\n\n\u003cdiv align=\"center\"\u003e\n  \u003ctable class=\"docutils align-default\"\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n          \u003ctd colspan=\"2\"\u003e\n            \u003cp align=\"center\"\u003e\u003cb\u003eInstallation\u003c/b\u003e\u003c/p\u003e \u003cpre lang=\"bash\"\u003epip3 install datadreamer.dev\u003c/pre\u003e\n          \u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n          \u003cth class=\"head\"\u003e\u003ccode\u003edemo.py\u003c/code\u003e\u003c/th\u003e\n          \u003cth class=\"head\"\u003eResult of \u003ccode\u003edemo.py\u003c/code\u003e\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n          \u003ctd\u003e\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u003cbr /\u003e\n              \u003ca href=\"https://datadreamer.dev/docs/latest/\" title=\"demo.py\"\u003e\u003cimg src=\"https://datadreamer.dev/docs/latest/_static/images/demo_code.png\" alt=\"demo.py\" /\u003e\u003c/a\u003e\n              \u003cbr /\u003e\u003cbr /\u003e\n              \u003cp align=\"center\"\u003e\n                See the \u003ca class=\"reference external\" href=\"https://datadreamer.dev/docs/latest/\" title=\"demo.py\"\u003efull demo script\u003c/a\u003e\n              \u003c/p\u003e\n              \u003cbr /\u003e\n          \u003c/td\u003e\n          \u003ctd\u003e\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u003cbr /\u003e\n            \u003ca href=\"https://datadreamer.dev/docs/latest/\" title=\"Demo\"\u003e\u003cimg style=\"height: 400px;\" src=\"https://datadreamer.dev/docs/latest/_static/images/demo.svg#cachebust-2\" alt=\"Demo\" /\u003e\u003c/a\u003e\n            \u003cp align=\"center\"\u003e\n              See the \u003ca class=\"reference external\" href=\"https://huggingface.co/datasets/datadreamer-dev/abstracts_and_tweets\"\u003esynthetic dataset\u003c/a\u003e and \u003ca class=\"reference external\" href=\"https://huggingface.co/datadreamer-dev/abstracts_to_tweet_model\"\u003ethe trained model\u003c/a\u003e\n            \u003c/p\u003e\n          \u003c/td\u003e\n        \u003c/tr\u003e \n    \u003c/tbody\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n          \u003ctd colspan=\"2\"\u003e\n              \u003cp align=\"center\"\u003e\n                🚀 For more demonstrations and recipes see the \u003ca class=\"reference external\" href=\"https://datadreamer.dev/docs/latest/pages/get_started/quick_tour/index.html\" title=\"Quick Tour\"\u003e Quick Tour\u003c/a\u003e page.\n              \u003c/p\u003e\n          \u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n  \u003c/table\u003e\n\u003c/div\u003e\n\nWith DataDreamer you can:\n\n* 💬 **Create Prompting Workflows**: Create and run multi-step, complex, prompting workflows easily with major open source or API-based LLMs.\n* 📊 **Generate Synthetic Datasets**: Generate synthetic datasets for novel tasks or augment existing datasets with LLMs.\n* ⚙️ **Train Models**: Align models. Fine-tune models. Instruction-tune models. Distill models. Train on existing data or synthetic data.\n* ... learn more about what's possible in the [Overview Guide](https://datadreamer.dev/docs/latest/pages/get_started/overview_guide.html)\n\nDataDreamer is:\n\n* 🧩 **Simple**: Simple and approachable to use with sensible defaults, yet powerful with support for bleeding edge techniques.\n* 🔬 **Research-Grade**: Built for researchers, by researchers, but accessible to all. A focus on correctness, best practices, and reproducibility.\n* 🏎️ **Efficient**: Aggressive caching and resumability built-in. Support for techniques like quantization, parameter-efficient training (LoRA), and more.\n* 🔄 **Reproducible**: Workflows built with DataDreamer are easily shareable, reproducible, and extendable.\n* 🤝 **Makes Sharing Easy**: Publishing datasets and models is simple. Automatically generate data cards and model cards with metadata. Generate a list of any citations required.\n* ... learn more about the [motivation and design principles behind DataDreamer](https://datadreamer.dev/docs/latest/pages/get_started/motivation_and_design.html).\n\n## Citation\n\nPlease cite the [DataDreamer paper](https://arxiv.org/abs/2402.10379):\n\n```bibtex\n@misc{patel2024datadreamer,\n      title={DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows}, \n      author={Ajay Patel and Colin Raffel and Chris Callison-Burch},\n      year={2024},\n      eprint={2402.10379},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n## Contact\n\nPlease reach out to us via [email (ajayp@upenn.edu)](mailto:ajayp@upenn.edu) or on [Discord](https://discord.gg/dwWW8wuCtK) if you have any questions, comments, or feedback.\n\n\u003cbr /\u003e\n\n------------------------------\n\nCopyright © 2024, [Ajay Patel](https://ajayp.app/). Released under the [MIT License](https://github.com/datadreamer-dev/DataDreamer/blob/main/LICENSE.txt).\n\nThank you to the maintainers at [Hugging Face](https://github.com/huggingface) and [LiteLLM](https://github.com/BerriAI/litellm) for accepting contributions necessary for DataDreamer and providing upstream support.\n\n------------------------------\n#### Funding Acknowledgements\n\n\u003csub\u003e\u003cb\u003eODNI, IARPA:\u003c/b\u003e This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.\u003c/sup\u003e\n","funding_links":[],"categories":["Important techniques","Python"],"sub_categories":["Libraries, code and tools"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatadreamer-dev%2FDataDreamer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatadreamer-dev%2FDataDreamer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatadreamer-dev%2FDataDreamer/lists"}