{"id":41574478,"url":"https://github.com/timothepearce/synda","last_synced_at":"2026-01-24T08:12:09.053Z","repository":{"id":271907624,"uuid":"914949215","full_name":"timothepearce/synda","owner":"timothepearce","description":"A CLI for generating synthetic data","archived":false,"fork":false,"pushed_at":"2025-05-14T07:30:26.000Z","size":863,"stargazers_count":42,"open_issues_count":7,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-12-05T14:36:36.742Z","etag":null,"topics":["ai","cli","llm","machine-learning","synthetic-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timothepearce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-10T16:34:41.000Z","updated_at":"2025-07-21T13:36:45.000Z","dependencies_parsed_at":"2025-02-01T17:22:23.777Z","dependency_job_id":"eaaeabdd-f919-49d6-a060-1b6a43b4ff18","html_url":"https://github.com/timothepearce/synda","commit_stats":null,"previous_names":["timothepearce/nebula","timothepearce/synda"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/timothepearce/synda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timothepearce%2Fsynda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timothepearce%2Fsynda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timothepearce%2Fsynda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timothepearce%2Fsynda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timothepearce","download_url":"https://codeload.github.com/timothepearce/synda/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timothepearce%2Fsynda/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28720454,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-24T05:53:42.649Z","status":"ssl_error","status_checked_at":"2026-01-24T05:53:41.698Z","response_time":89,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cli","llm","machine-learning","synthetic-data"],"created_at":"2026-01-24T08:12:07.444Z","updated_at":"2026-01-24T08:12:09.049Z","avatar_url":"https://github.com/timothepearce.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Synda\n\n\u003e [!WARNING]\n\u003e This project is in its very early stages of development and should not be used in production environments.\n\n\u003e [!NOTE]\n\u003e PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.\n\nSynda (*synthetic data*) is a package that allows you to create synthetic data generation pipelines. \nIt is opinionated and fast by design, with plans to become highly configurable in the future.\n\n\n## Installation\n\nSynda requires Python 3.10 or higher.\n\nYou can install Synda using pipx:\n\n```bash\npipx install synda\n```\n\n## Usage\n\n1. Create a YAML configuration file (e.g., `config.yaml`) that defines your pipeline:\n\n```yaml\ninput:\n  type: csv\n  properties:\n    path: tests/stubs/simple_pipeline/source.csv\n    target_column: content\n    separator: \"\\t\"\n\npipeline:\n  - type: split\n    method: chunk\n    name: chunk_faq\n    parameters:\n      size: 500\n      # overlap: 20\n\n  - type: split\n    method: separator\n    name: sentence_chunk_faq\n    parameters:\n      separator: .\n      keep_separator: true\n\n  - type: generation\n    method: llm\n    parameters:\n      provider: openai\n      model: gpt-4o-mini\n      template: |\n        Ask a question regarding the sentence about the content.\n        content: {chunk_faq}\n        sentence: {sentence_chunk_faq}\n\n        Instructions :\n        1. Use english only\n        2. Keep it short\n\n        question:\n\n  - type: clean\n    method: deduplicate-tf-idf\n    parameters:\n      strategy: fuzzy\n      similarity_threshold: 0.9\n      keep: first \n\n  - type: ablation\n    method: llm-judge-binary\n    parameters:\n      provider: openai\n      model: gpt-4o-mini\n      consensus: all # any, majority\n      criteria:\n        - Is the question written in english?\n        - Is the question consistent?\n\noutput:\n  type: csv\n  properties:\n    path: tests/stubs/simple_pipeline/output.csv\n    separator: \"\\t\"\n```\n\n2. Add a model provider:\n\n```bash\nsynda provider add openai --api-key [YOUR_API_KEY]\n```\n\n3. Generate some synthetic data:\n\n```bash\nsynda generate config.yaml\n```\n\n## Pipeline Structure\n\nThe Nebula pipeline consists of three main parts:\n\n- **Input**: Data source configuration\n- **Pipeline**: Sequence of transformation and generation steps\n- **Output**: Configuration for the generated data output\n\n### Available Pipeline Steps\n\nCurrently, Synda supports four pipeline steps (as shown in the example above):\n\n- **split**: Breaks down data (`method: chunk` or `method: split`)\n- **generation**: Generates content using LLMs (`method: llm`)\n- **clean**: Delete the duplicated data (`method: deduplicate-tf-idf`)\n- **ablation**: Filters data based on defined criteria (`method: llm-judge-binary`)\n- **metadata**: Add metadata to text (`method: word-position`)\n\nMore steps will be added in future releases.\n\n## Roadmap\n\nThe following features are planned for future releases.\n\n### Core\n- [x] Implement a Proof of Concept\n- [x] Implement a common interface (Node) for input and output of each step\n- [x] Add SQLite support\n- [x] Add setter command for provider variable (openai, etc.)\n- [x] Store each execution and step in DB\n- [x] Add \"split\" -\u003e \"separator\" step\n- [x] Add named step\n- [x] Store each Node in DB\n- [x] Add \"clean\" -\u003e \"deduplicate\" step\n- [x] Allow injecting params from distant step into prompt\n- [x] Add Ollama with structured generation output\n- [x] Retry a failed run\n- [ ] Add asynchronous behaviour for any CLI\n- [ ] Add vLLM with structured generation output\n- [ ] Batch processing logic (via param.) for LLMs steps\n- [ ] Move input into pipeline (step type: 'load')\n- [ ] Move output into pipeline (step type: 'export')\n- [ ] Allow pausing and resuming pipelines\n- [ ] Trace each synthetic data with his historic\n- [ ] Enable caching of each step's output\n- [ ] Implement custom scriptable step for developer\n- [ ] Use Ray for large workload\n- [ ] Add a programmatic API\n\n### Steps\n- [x] input/output: .xls format\n- [ ] input/output: Hugging Face datasets\n- [ ] chunk: Semantic chunks\n- [ ] clean: embedding deduplication\n- [ ] ablation: LLMs as a juries\n- [ ] masking: NER (GliNER)\n- [ ] masking: Regexp\n- [ ] masking: PII\n- [ ] metadata: Word position\n- [ ] metadata: Regexp\n\n### Ideas\n- [ ] translations (SeamlessM4T)\n- [ ] speech-to-text\n- [ ] text-to-speech\n- [ ] metadata extraction\n- [ ] tSNE / PCA\n- [ ] custom steps?\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimothepearce%2Fsynda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimothepearce%2Fsynda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimothepearce%2Fsynda/lists"}