{"id":28179411,"url":"https://github.com/stackloklabs/promptwright","last_synced_at":"2025-05-16T02:07:24.216Z","repository":{"id":259600648,"uuid":"878416567","full_name":"StacklokLabs/promptwright","owner":"StacklokLabs","description":"Generate large synthetic data using an LLM","archived":false,"fork":false,"pushed_at":"2025-05-12T06:42:40.000Z","size":13900,"stargazers_count":414,"open_issues_count":4,"forks_count":32,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-05-12T07:40:22.796Z","etag":null,"topics":["ai","data-science","dataset","huggingface","huggingface-datasets","machine-learning","synthetic-data","synthetic-dataset-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StacklokLabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-25T10:59:05.000Z","updated_at":"2025-05-12T06:41:02.000Z","dependencies_parsed_at":"2024-10-26T22:28:58.302Z","dependency_job_id":"3d472fa5-ca1a-4dad-95d2-a6be1741deb0","html_url":"https://github.com/StacklokLabs/promptwright","commit_stats":null,"previous_names":["stackloklabs/promptsmith","stackloklabs/promptweaver","stacklok/promptwright","stackloklabs/promptwright"],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StacklokLabs%2Fpromptwright","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StacklokLabs%2Fpromptwright/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StacklokLabs%2Fpromptwright/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StacklokLabs%2Fpromptwright/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StacklokLabs","download_url":"https://codeload.github.com/StacklokLabs/promptwright/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254453652,"owners_count":22073617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-science","dataset","huggingface","huggingface-datasets","machine-learning","synthetic-data","synthetic-dataset-generation"],"created_at":"2025-05-16T02:07:23.476Z","updated_at":"2025-05-16T02:07:24.203Z","avatar_url":"https://github.com/StacklokLabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Promptwright - Synthetic Dataset Generation Library\n\n[![Tests](https://github.com/StacklokLabs/promptwright/actions/workflows/test.yml/badge.svg)](https://github.com/StacklokLabs/promptwright/actions/workflows/test.yml)\n[![Python Version](https://img.shields.io/pypi/pyversions/promptwright.svg)](https://pypi.org/project/promptwright/)\n\n![promptwright-cover](https://github.com/user-attachments/assets/0bace08b-753c-42d3-a084-5a1953101fa2)\n\n\nPromptwright is a Python library from [Stacklok](https://stacklok.com) designed \nfor generating large synthetic  datasets using a either a local LLM and most LLM service\nproviders (openAI, Anthropic, OpenRouter etc). The library offers\na flexible and easy-to-use set of interfaces, enabling users the ability to\ngenerate prompt led synthetic datasets.\n\nPromptwright was inspired by the [redotvideo/pluto](https://github.com/redotvideo/pluto),\nin fact it started as fork, but ended up largley being a re-write.\n\n## Features\n\n- **Multiple Providers Support**: Works with most LLM service providers and LocalLLM's via Ollama, VLLM etc\n- **Configurable Instructions and Prompts**: Define custom instructions and system prompts\n- **YAML Configuration**: Define your generation tasks using YAML configuration files\n- **Command Line Interface**: Run generation tasks directly from the command line\n- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub with automatic dataset cards and tags\n- **System Message Control**: Choose whether to include system messages in the generated dataset\n\n## Getting Started\n\n### Prerequisites\n\n- Python 3.11+\n- Poetry (for dependency management)\n- (Optional) Hugging Face account and API token for dataset upload\n\n### Installation\n\n#### pip\n\nYou can install Promptwright using pip:\n\n```bash\npip install promptwright\n```\n\n#### Development Installation\n\nTo install the prerequisites, you can use the following commands:\n\n```bash\n# Install Poetry if you haven't already\ncurl -sSL https://install.python-poetry.org | python3 -\n\n# Install promptwright and its dependencies\ngit clone https://github.com/StacklokLabs/promptwright.git\ncd promptwright\npoetry install\n```\n\n### Usage\n\nPromptwright offers two ways to define and run your generation tasks:\n\n#### 1. Using YAML Configuration (Recommended)\n\nCreate a YAML file defining your generation task:\n\n```yaml\nsystem_prompt: \"You are a helpful assistant. You provide clear and concise answers to user questions.\"\n\ntopic_tree:\n  args:\n    root_prompt: \"Capital Cities of the World.\"\n    model_system_prompt: \"\u003csystem_prompt_placeholder\u003e\"\n    tree_degree: 3\n    tree_depth: 2\n    temperature: 0.7\n    model_name: \"ollama/mistral:latest\"\n  save_as: \"basic_prompt_topictree.jsonl\"\n\ndata_engine:\n  args:\n    instructions: \"Please provide training examples with questions about capital cities.\"\n    system_prompt: \"\u003csystem_prompt_placeholder\u003e\"\n    model_name: \"ollama/mistral:latest\"\n    temperature: 0.9\n    max_retries: 2\n\ndataset:\n  creation:\n    num_steps: 5\n    batch_size: 1\n    model_name: \"ollama/mistral:latest\"\n    sys_msg: true  # Include system message in dataset (default: true)\n  save_as: \"basic_prompt_dataset.jsonl\"\n\n# Optional Hugging Face Hub configuration\nhuggingface:\n  # Repository in format \"username/dataset-name\"\n  repository: \"your-username/your-dataset-name\"\n  # Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option\n  token: \"your-hf-token\"\n  # Additional tags for the dataset (optional)\n  # \"promptwright\" and \"synthetic\" tags are added automatically\n  tags:\n    - \"promptwright-generated-dataset\"\n    - \"geography\"\n```\n\nRun using the CLI:\n\n```bash\npromptwright start config.yaml\n```\n\nThe CLI supports various options to override configuration values:\n\n```bash\npromptwright start config.yaml \\\n  --topic-tree-save-as output_tree.jsonl \\\n  --dataset-save-as output_dataset.jsonl \\\n  --model-name ollama/llama3 \\\n  --temperature 0.8 \\\n  --tree-degree 4 \\\n  --tree-depth 3 \\\n  --num-steps 10 \\\n  --batch-size 2 \\\n  --sys-msg true \\  # Control system message inclusion (default: true)\n  --hf-repo username/dataset-name \\\n  --hf-token your-token \\\n  --hf-tags tag1 --hf-tags tag2\n```\n\n#### Provider Integration\n\nPromptwright uses LiteLLM to interface with LLM providers. You can specify the\nprovider in the provider, model section in your config or code:\n\n```yaml\nprovider: \"openai\"  # LLM provider\n    model: \"gpt-4-1106-preview\"  # Model name\n```\n\nChoose any of the listed providers [here](https://docs.litellm.ai/docs/providers/)\nand following the same naming convention.\n\ne.g.\n\nThe LiteLLM convention for Google Gemini would is:\n\n```python\nfrom litellm import completion\nimport os\n\nos.environ['GEMINI_API_KEY'] = \"\"\nresponse = completion(\n    model=\"gemini/gemini-pro\", \n    messages=[{\"role\": \"user\", \"content\": \"write code for saying hi from LiteLLM\"}]\n)\n```\n\nIn Promptwright, you would specify the provider as `gemini` and the model as `gemini-pro`.\n\n```yaml\nprovider: \"gemini\"  # LLM provider\n    model: \"gemini-pro\"  # Model name\n```\n\nFor Ollama, you would specify the provider as `ollama` and the model as `mistral`\nand so on.\n\n```yaml\nprovider: \"ollama\"  # LLM provider\n    model: \"mistral:latest\"  # Model name\n```\n\n##### API Keys\n\nYou can set the API key for the provider in the environment variable. The key\nshould be set as `PROVIDER_API_KEY`. For example, for OpenAI, you would set the\nAPI key as `OPENAI_API_KEY`.\n\n```bash\nexport OPENAI_API_KEY\n```\n\nAgain, refer to the [LiteLLM documentation](https://docs.litellm.ai/docs/providers/)\nfor more information on setting up the API keys.\n\n#### Hugging Face Hub Integration\n\nPromptwright supports automatic dataset upload to the Hugging Face Hub with the following features:\n\n1. **Dataset Upload**: Upload your generated dataset directly to Hugging Face Hub\n2. **Dataset Cards**: Automatically creates and updates dataset cards\n3. **Automatic Tags**: Adds \"promptwright\" and \"synthetic\" tags automatically\n4. **Custom Tags**: Support for additional custom tags\n5. **Flexible Authentication**: HF token can be provided via:\n   - CLI option: `--hf-token your-token`\n   - Environment variable: `export HF_TOKEN=your-token`\n   - YAML configuration: `huggingface.token`\n\nExample using environment variable:\n```bash\nexport HF_TOKEN=your-token\npromptwright start config.yaml --hf-repo username/dataset-name\n```\n\nOr pass it in as a CLI option:\n```bash\npromptwright start config.yaml --hf-repo username/dataset-name --hf-token your-token\n```\n\n#### 2. Using Python Code\n\nYou can also create generation tasks programmatically using Python code. There\nare several examples in the `examples` directory that demonstrate this approach.\n\nExample Python usage:\n\n```python\nfrom promptwright import DataEngine, EngineArguments, TopicTree, TopicTreeArguments\n\ntree = TopicTree(\n    args=TopicTreeArguments(\n        root_prompt=\"Creative Writing Prompts\",\n        model_system_prompt=system_prompt,\n        tree_degree=5,\n        tree_depth=4,\n        temperature=0.9,\n        model_name=\"ollama/llama3\"\n    )\n)\n\nengine = DataEngine(\n    args=EngineArguments(\n        instructions=\"Generate creative writing prompts and example responses.\",\n        system_prompt=\"You are a creative writing instructor providing writing prompts and example responses.\",\n        model_name=\"ollama/llama3\",\n        temperature=0.9,\n        max_retries=2,\n        sys_msg=True,  # Include system message in dataset (default: true)\n    )\n)\n```\n\n### Development\n\nThe project uses Poetry for dependency management. Here are some common development commands:\n\n```bash\n# Install dependencies including development dependencies\nmake install\n\n# Format code\nmake format\n\n# Run linting\nmake lint\n\n# Run tests\nmake test\n\n# Run security checks\nmake security\n\n# Build the package\nmake build\n\n# Run all checks and build\nmake all\n```\n\n### Prompt Output Examples\n\nWith sys_msg=true (default):\n```json\n{\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"You are tasked with designing an immersive virtual reality experience that transports users to a fantastical world of wonder.\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"Create a descriptive passage about a character discovering their hidden talents.\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"As she stared at the canvas, Emma's fingers hovered above the paintbrushes, as if hesitant to unleash the colors that had been locked within her. The strokes began with bold abandon, swirling blues and greens merging into a mesmerizing dance of light and shadow. With each passing moment, she felt herself becoming the art – her very essence seeping onto the canvas like watercolors in a spring storm. The world around her melted away, leaving only the vibrant symphony of color and creation.\"\n    }\n  ]\n}\n```\n\nWith sys_msg=false:\n```json\n{\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"Create a descriptive passage about a character discovering their hidden talents.\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"As she stared at the canvas, Emma's fingers hovered above the paintbrushes, as if hesitant to unleash the colors that had been locked within her. The strokes began with bold abandon, swirling blues and greens merging into a mesmerizing dance of light and shadow. With each passing moment, she felt herself becoming the art – her very essence seeping onto the canvas like watercolors in a spring storm. The world around her melted away, leaving only the vibrant symphony of color and creation.\"\n    }\n  ]\n}\n```\n\n\n## Unpredictable Behavior\n\nThe library is designed to generate synthetic data based on the prompts and instructions\nprovided. The quality of the generated data is dependent on the quality of the prompts\nand the model used. The library does not guarantee the quality of the generated data.\n\nLarge Language Models can sometimes generate unpredictable or inappropriate\ncontent and the authors of this library are not responsible for the content\ngenerated by the models. We recommend reviewing the generated data before using it\nin any production environment.\n\nLarge Language Models also have the potential to fail to stick with the behavior\ndefined by the prompt around JSON formatting, and may generate invalid JSON. This\nis a known issue with the underlying model and not the library. We handle these\nerrors by retrying the generation process and filtering out invalid JSON. The \nfailure rate is low, but it can happen. We report on each failure within a final\nsummary.\n\n## Contributing\n\nIf something here could be improved, please open an issue or submit a pull request.\n\n### License\n\nThis project is licensed under the Apache 2 License. See the `LICENSE` file for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstackloklabs%2Fpromptwright","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstackloklabs%2Fpromptwright","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstackloklabs%2Fpromptwright/lists"}