{"id":37255179,"url":"https://nvidia-nemo.github.io/DataDesigner/","last_synced_at":"2026-01-22T23:01:18.302Z","repository":{"id":325181623,"uuid":"1077662591","full_name":"NVIDIA-NeMo/DataDesigner","owner":"NVIDIA-NeMo","description":"🎨 NeMo Data Designer: A general library for generating high-quality synthetic data from scratch or based on seed data.","archived":false,"fork":false,"pushed_at":"2026-01-17T00:41:10.000Z","size":15639,"stargazers_count":632,"open_issues_count":27,"forks_count":51,"subscribers_count":5,"default_branch":"main","last_synced_at":"2026-01-17T00:44:55.592Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://nvidia-nemo.github.io/DataDesigner/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA-NeMo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":"DCO","cla":null}},"created_at":"2025-10-16T15:01:31.000Z","updated_at":"2026-01-17T00:41:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/NVIDIA-NeMo/DataDesigner","commit_stats":null,"previous_names":["nvidia-nemo/datadesigner"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/NVIDIA-NeMo/DataDesigner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FDataDesigner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FDataDesigner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FDataDesigner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FDataDesigner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA-NeMo","download_url":"https://codeload.github.com/NVIDIA-NeMo/DataDesigner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FDataDesigner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28673382,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T20:48:19.482Z","status":"ssl_error","status_checked_at":"2026-01-22T20:48:14.968Z","response_time":144,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-15T19:00:23.329Z","updated_at":"2026-01-22T23:01:18.297Z","avatar_url":"https://github.com/NVIDIA-NeMo.png","language":"Python","readme":"# 🎨 NeMo Data Designer\n\n[![CI](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml)\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Python 3.10 - 3.13](https://img.shields.io/badge/🐍_Python-3.10_|_3.11_|_3.12_|_3.13-blue.svg)](https://www.python.org/downloads/) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html) [![Code](https://img.shields.io/badge/Code-Documentation-8A2BE2.svg)](https://nvidia-nemo.github.io/DataDesigner/)\n\n**Generate high-quality synthetic datasets from scratch or using your own seed data.**\n\n---\n\n## Welcome!\n\nData Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.\n\n## What can you do with Data Designer?\n\n- **Generate diverse data** using statistical samplers, LLMs, or existing seed datasets\n- **Control relationships** between fields with dependency-aware generation\n- **Validate quality** with built-in Python, SQL, and custom local and remote validators\n- **Score outputs** using LLM-as-a-judge for quality assessment\n- **Iterate quickly** with preview mode before full-scale generation\n\n---\n\n## Quick Start\n\n### 1. Install\n\n```bash\npip install data-designer\n```\n\nOr install from source:\n\n```bash\ngit clone https://github.com/NVIDIA-NeMo/DataDesigner.git\ncd DataDesigner\nmake install\n```\n\n### 2. Set your API key\n\nStart with one of our default model providers:\n\n- [NVIDIA Build API](https://build.nvidia.com)\n- [OpenAI](https://platform.openai.com/api-keys)\n- [OpenRouter](https://openrouter.ai)\n\nGrab your API key(s) using the above links and set one or more of the following environment variables:\n```bash\nexport NVIDIA_API_KEY=\"your-api-key-here\"\n\nexport OPENAI_API_KEY=\"your-openai-api-key-here\"\n\nexport OPENROUTER_API_KEY=\"your-openrouter-api-key-here\"\n```\n\n### 3. Start generating data!\n```python\nfrom data_designer.essentials import (\n    CategorySamplerParams,\n    DataDesigner,\n    DataDesignerConfigBuilder,\n    LLMTextColumnConfig,\n    PersonSamplerParams,\n    SamplerColumnConfig,\n    SamplerType,\n)\n\n# Initialize with default settings\ndata_designer = DataDesigner()\nconfig_builder = DataDesignerConfigBuilder()\n\n# Add a product category\nconfig_builder.add_column(\n    SamplerColumnConfig(\n        name=\"product_category\",\n        sampler_type=SamplerType.CATEGORY,\n        params=CategorySamplerParams(\n            values=[\"Electronics\", \"Clothing\", \"Home \u0026 Kitchen\", \"Books\"],\n        ),\n    )\n)\n\n# Generate personalized customer reviews\nconfig_builder.add_column(\n    LLMTextColumnConfig(\n        name=\"review\",\n        model_alias=\"nvidia-text\",\n        prompt=\"Write a brief product review for a {{ product_category }} item you recently purchased.\",\n    )\n)\n\n# Preview your dataset\npreview = data_designer.preview(config_builder=config_builder)\npreview.display_sample_record()\n```\n\n---\n\n## What's next?\n\n### 📚 Learn more\n\n- **[Quick Start Guide](https://nvidia-nemo.github.io/DataDesigner/latest/quick-start/)** – Detailed walkthrough with more examples\n- **[Tutorial Notebooks](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/)** – Step-by-step interactive tutorials\n- **[Column Types](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/columns/)** – Explore samplers, LLM columns, validators, and more\n- **[Validators](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/validators/)** – Learn how to validate generated data with Python, SQL, and remote validators\n- **[Model Configuration](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/model-configs/)** – Configure custom models and providers\n- **[Person Sampling](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/)** – Learn how to sample realistic person data with demographic attributes\n\n### 🔧 Configure models via CLI\n\n```bash\ndata-designer config providers # Configure model providers\ndata-designer config models    # Set up your model configurations\ndata-designer config list      # View current settings\n```\n\n### 🤝 Get involved\n\n- **[Contributing Guide](https://nvidia-nemo.github.io/DataDesigner/latest/CONTRIBUTING)** – Help improve Data Designer\n- **[GitHub Issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues)** – Report bugs or make a feature request\n\n---\n\n## Telemetry\n\nData Designer collects telemetry to help us improve the library for developers. We collect:\n\n* The names of models used\n* The count of input tokens\n* The count of output tokens\n\n**No user or device information is collected.** This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.\n\nSpecifically, a model name that is defined a `ModelConfig` object, is what will be collected. In the below example config:\n\n```python\nModelConfig(\n    alias=\"nv-reasoning\",\n    model=\"openai/gpt-oss-20b\",\n    provider=\"nvidia\",\n    inference_parameters=ChatCompletionInferenceParams(\n        temperature=0.3,\n        top_p=0.9,\n        max_tokens=4096,\n    ),\n)\n```\n\nThe value `openai/gpt-oss-20b` would be collected.\n\nTo disable telemetry capture, set `NEMO_TELEMETRY_ENABLED=false`.\n\n### Top Models\n\nThis chart represents the breakdown of models used for Data Designer across all synthetic data generation jobs from 12/18/2025 to 1/14/2026.\n\n![Top models used for synthetic data generation](docs/images/top-models.png)\n\n_Last updated on 1/14/2026_\n\n---\n\n## License\n\nApache License 2.0 – see [LICENSE](LICENSE) for details.\n\n---\n\n## Citation\n\nIf you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:\n\n```bibtex\n@misc{nemo-data-designer,\n  author = {The NeMo Data Designer Team, NVIDIA},\n  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},\n  howpublished = {\\url{https://github.com/NVIDIA-NeMo/DataDesigner}},\n  year = {2025},\n  note = {GitHub Repository},\n}\n```\n","funding_links":[],"categories":["Newly Created Repositories"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/nvidia-nemo.github.io%2FDataDesigner%2F","html_url":"https://awesome.ecosyste.ms/projects/nvidia-nemo.github.io%2FDataDesigner%2F","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/nvidia-nemo.github.io%2FDataDesigner%2F/lists"}