{"id":27777831,"url":"https://github.com/mirpo/datamatic","last_synced_at":"2026-04-28T08:01:30.093Z","repository":{"id":290080663,"uuid":"970231208","full_name":"mirpo/datamatic","owner":"mirpo","description":"Build multi-step AI workflows with schema-guided reasoning. Supports Ollama, LMStudio, OpenAI, OpenRouter, Gemini, and all latest models for structured generation, chaining, and data processing.","archived":false,"fork":false,"pushed_at":"2026-04-07T21:40:38.000Z","size":234,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-07T23:23:24.776Z","etag":null,"topics":["agentic-ai","ai-workflow","dataset","deepseek-r1","jsonl","llama3","llm","lmstudio","localllm","ollama","phi4","synthetic-data","synthetic-dataset-generation"],"latest_commit_sha":null,"homepage":"https://github.com/mirpo/datamatic","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mirpo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-04-21T17:30:14.000Z","updated_at":"2026-03-20T19:00:14.000Z","dependencies_parsed_at":"2025-05-19T18:27:16.277Z","dependency_job_id":"ab4724a3-f881-4e73-8a92-614a1623d237","html_url":"https://github.com/mirpo/datamatic","commit_stats":null,"previous_names":["mirpo/datamatic"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/mirpo/datamatic","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirpo%2Fdatamatic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirpo%2Fdatamatic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirpo%2Fdatamatic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirpo%2Fdatamatic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mirpo","download_url":"https://codeload.github.com/mirpo/datamatic/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirpo%2Fdatamatic/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32371672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T20:07:02.737Z","status":"online","status_checked_at":"2026-04-28T02:00:07.250Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","ai-workflow","dataset","deepseek-r1","jsonl","llama3","llm","lmstudio","localllm","ollama","phi4","synthetic-data","synthetic-dataset-generation"],"created_at":"2025-04-30T07:56:43.366Z","updated_at":"2026-04-28T08:01:30.084Z","avatar_url":"https://github.com/mirpo.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# datamatic\n\n[![Tests](https://github.com/mirpo/datamatic/actions/workflows/tests.yml/badge.svg)](https://github.com/mirpo/datamatic/actions/workflows/tests.yml)\n[![Go Version](https://img.shields.io/github/go-mod/go-version/mirpo/datamatic)](https://golang.org/)\n[![Release](https://img.shields.io/github/v/release/mirpo/datamatic)](https://github.com/mirpo/datamatic/releases)\n[![License](https://img.shields.io/github/license/mirpo/datamatic)](https://github.com/mirpo/datamatic/blob/main/LICENSE)\n\nBuild multi-step AI workflows with schema-guided reasoning. Works with Ollama, LMStudio, OpenAI, OpenRouter, Gemini, and all the latest models for structured generation, chaining, and data processing.\n\n## Features\n\n### AI Provider Support\n- **[Ollama](https://ollama.com/download)** - Local model inference\n- **[LM Studio](https://lmstudio.ai/download)** - Local model management\n- **[OpenAI](https://openai.com/)** - Cloud-based models\n- **[OpenRouter](https://openrouter.ai/)** - Multi-provider access\n- **[Gemini](https://deepmind.google/models/gemini/)** - Google DeepMind's multimodal LLMs\n\n### Workflow Capabilities\n- **JSON Schema Validation** - Structured output with type safety (YAML-native or JSON string formats)\n- **Text Generation** - Flexible content creation\n- **Multi-step Chaining** - Link generation steps together with template variables\n- **Schema-Guided Reasoning (SGR)** - Guide LLMs through systematic analysis using structured schemas\n- **Image Analysis** - Visual model integration\n\n### Extensibility\n- **CLI Integration** - Use any command-line tool as a step\n- **Dataset Loading** - Import from [Huggingface](https://huggingface.co/datasets)\n- **Data Transformation** - Built-in [jq](https://github.com/jqlang/jq) support\n- **Environment Variables** - Dynamic configuration with `$VAR` syntax\n- **Retry Logic** - Smart error handling and recovery\n\n## Installation\n\n### Homebrew\n\n```shell\nbrew tap mirpo/homebrew-tools\nbrew install datamatic\n```\n\n### Using Go Install\n\n```shell\ngo install github.com/mirpo/datamatic@latest\n```\n\n### From source\n\n```bash\ngit clone https://github.com/mirpo/datamatic.git\ncd datamatic\nmake build\n```\n\n## Use Cases\n\n- **Synthetic Data Generation** - Create training datasets for fine-tuning LLMs\n- **Document Classification** - Systematic analysis with structured reasoning\n- **SQL Query Generation** - Chain-of-thought reasoning for complex queries\n- **Multi-step Processing Pipelines** - CV analysis, data transformation, content generation\n- **Vision Workflows** - Image analysis combined with text generation\n- **Data Integration** - Combine HuggingFace datasets with LLM processing\n\n## Quick Start\n\nCreate a configuration file and run datamatic:\n\n```yaml\n# config.yaml\nversion: 1.0\nsteps:\n  - name: generate_titles\n    model: ollama:llama3.2\n    prompt: Generate a catchy news title\n    jsonSchema:\n      type: object\n      properties:\n        title:\n          type: string\n        tags:\n          type: array\n          items:\n            type: string\n      required:\n        - title\n        - tags\n      additionalProperties: false\n\n  - name: analyze_title\n    model: ollama:llama3.2\n    prompt: |\n      Analyze this news title and provide sentiment and category analysis:\n      Title: {{.generate_titles.title}}\n    jsonSchema: |\n      {\n        \"type\": \"object\",\n        \"properties\": {\n          \"sentiment\": {\"type\": \"string\", \"enum\": [\"positive\", \"negative\", \"neutral\"]},\n          \"category\": {\"type\": \"string\", \"description\": \"News category\"},\n          \"clickbait_score\": {\"type\": \"number\", \"minimum\": 0, \"maximum\": 10}\n        },\n        \"required\": [\"sentiment\", \"category\", \"clickbait_score\"]\n      }\n```\n\n```bash\n# Generate data\ndatamatic -config config.yaml\n\n# With debug output\ndatamatic -config config.yaml -verbose -log-pretty\n```\n\n**Other providers:**\n- OpenAI: `model: openai:gpt-4o-mini` + `export OPENAI_API_KEY=sk-...`\n- OpenRouter: `model: openrouter:meta-llama/llama-3.2-3b` + `export OPENROUTER_API_KEY=sk-...`\n- Gemini: `model: gemini:gemini-2.0-flash` + `export GEMINI_API_KEY=...`\n\n### Environment Variables\n\nConfigure your pipelines dynamically using `$VAR` syntax:\n\n```yaml\nversion: 1.0\n\nenvVars:\n  - PROVIDER\n  - MODEL\n\nsteps:\n  - name: generate\n    model: $PROVIDER:$MODEL\n    prompt: Generate a creative story\n```\n\n```bash\nPROVIDER=ollama MODEL=llama3.2 datamatic -config config.yaml\n```\n\nVariables listed in `envVars` are validated before execution (fail-fast). See [Multi-Stage Pipeline example](./examples/v1/18.%20workdir-multi-stage-pipeline/README.md) for more details.\n\n## Output Format\n\nDatamatic outputs structured data in JSONl format:\n\n```go\ntype LineEntity struct {\n\tID       string      `json:\"id\"`\n\tFormat   string      `json:\"format\"`\n\tPrompt   string      `json:\"prompt\"`\n\tResponse interface{} `json:\"response\"`\n\tValues   interface{} `json:\"values\"`\n}\n```\n\n- **Format**: `text` or `json`\n- **Response**: Generated content (text string or JSON object)\n- **Values**: Linked step values for traceability\n\n### Output Examples\n\n**Text line**:\n\n```json\n{\n  \"id\":\"38082542-f352-44d2-88e9-6d68d28dcac4\"\n  \"format\":\"text\",\n  \"prompt\":\"Generate a catchy and one unique news title. Come up with a wildly different and surprising news headline. Return only one news title per request, without any extra thinking.\",\n  \"response\":\"BREAKING: Giant Squid Found Wearing Tiny Top Hat and monocle in Remote Arctic Location\"\n}\n```\n\n**JSON line**:\n\n```json\n{\n  \"id\":\"cc437b10-63c6-443a-9b3e-a7d6c51fc0a0\",\n  \"format\":\"json\",\n  \"prompt\":\"Provide up-to-date information about a randomly selected country, including its name, population, land area, UN membership status, capital city, GDP per capita, official languages, and year of independence. Return the data in a structured JSON format according to the schema below.\",\n  \"response\":{\"capitalCity\":\"Bishkek\",\"gdpPerCapita\":1700,\"independenceYear\":1991,\"isUNMember\":true,\"languages\":[\"Kyr Kyrgyz\",\"Russian\"],\"name\":\"Kyrgyzstan\",\"population\":6184000,\"totalCountryArea\":199912}\n}\n```\n\nWith values from linked steps:\n\n```json\n{\n  \"id\":\"dc140355-6c41-4ce7-9127-b8145cf1a23e\",\n  \"format\":\"text\",\n  \"prompt\":\"Write nice tourist brochure about country {{.about_country.name}}, which capital is {{.about_country.capitalCity}}, area {{.about_country.totalCountryArea}}, independenceYear: {{.about_country.independenceYear}} and official languages are {{.about_country.languages}}.\",\n  \"response\":\"**Discover the Hidden Gem of Central Asia: Kyrgyzstan**\\n\\nTucked away in the heart of Central Asia, Kyrgyzstan is a land of breathtaking beauty, rich history, and warm hospitality. Our capital city, Bishkek, is a bustling metropolis surrounded by the stunning Tian Shan mountains, waiting to be explored.\\n\\n**A Brief History**\\n\\nKyrgyzstan gained its independence on August 31, 1991...\",\n  \"values\":{\".about_country.capitalCity\":{\"id\":\"cc437b10-63c6-443a-9b3e-a7d6c51fc0a0\",\"content\":\"Bishkek\"},\".about_country.independenceYear\":{\"id\":\"cc437b10-63c6-443a-9b3e-a7d6c51fc0a0\",\"content\":\"1991\"},\".about_country.languages\":{\"id\":\"cc437b10-63c6-443a-9b3e-a7d6c51fc0a0\",\"content\":\"Kyr Kyrgyz, Russian\"},\".about_country.name\":{\"id\":\"cc437b10-63c6-443a-9b3e-a7d6c51fc0a0\",\"content\":\"Kyrgyzstan\"},\".about_country.totalCountryArea\":{\"id\":\"cc437b10-63c6-443a-9b3e-a7d6c51fc0a0\",\"content\":\"199912\"}}\n}\n```\n\n## CLI Reference\n\n```bash\ndatamatic [OPTIONS]\n\nOptions:\n  -config string\n        Config file path\n  -http-timeout int\n        HTTP timeout: 0 - no timeout, if number - recommended to put high on poor hardware (default 300)\n  -log-pretty\n        Enable pretty logging, JSON when false (default true)\n  -output string\n        Output folder path (default \"dataset\")\n  -skip-cli-warning\n        Skip external CLI warning\n  -validate-response\n        Validate JSON response from server to match the schema (default true)\n  -verbose\n        Enable DEBUG logging level\n  -version\n        Get current version of datamatic\n```\n\n## Examples\n\n### Getting Started\n| Example | Description | Provider |\n| --- | --- | --- |\n| [Simple Text](./examples/v1/1.%20simple%20text%20generation,%20not%20linked%20steps/README.md) | Basic text generation | Ollama, LM Studio |\n| [Simple JSON](./examples/v1/2.%20simple%20json%20generation,%20not%20linked%20steps/README.md) | Basic JSON generation | Ollama, LM Studio |\n| [Linked Steps](./examples/v1/3.%20complex%20json,%20linked%20steps/README.md) | Multi-step chaining with templates | Ollama |\n\n### Data Integration \u0026 Tool Orchestration\n| Example | Description | Provider |\n| --- | --- | --- |\n| [Huggingface + jq](./examples/v1/4.%20using%20huggingface%20and%20jq%20cli/README.md) | HuggingFace datasets with jq filtering | Ollama |\n| [DuckDB Integration](./examples/v1/5.%20using%20duckdb%20to%20convert%20parquet%20huggingface%20dataset%20and%20lmstudio/README.md) | Parquet to JSONL with DuckDB | LM Studio |\n| [Git Dataset](./examples/v1/6.%20git%20dataset/README.md) | Git command dataset generation | Ollama |\n| [Fine-tuning Data](./examples/v1/7.%20fine-tuning%20dataset/README.md) | Training dataset creation | Ollama |\n| [Vision Models](./examples/v1/8.%20hugginface%20images%20and%20qwen2.5vl%20or%20gemma3/README.md) | Image analysis with vision models | Ollama, LM Studio |\n\n### Cloud Provider Examples\n| Example | Description | Provider |\n| --- | --- | --- |\n| [OpenAI](./examples/v1/9.%20openai-example/README.md) | Using OpenAI models | OpenAI |\n| [OpenRouter](./examples/v1/10.%20openrouter-example/README.md) | Multi-provider via OpenRouter | OpenRouter |\n| [Gemini](./examples/v1/11.%20gemini-example/README.md) | Google Gemini integration | Gemini |\n\n### Advanced Workflows \u0026 Reasoning\n| Example | Description | Provider |\n| --- | --- | --- |\n| [CV Processing Pipeline](./examples/v1/12.%20cv-processing-pipeline/README.md) | 3-step CV extraction workflow | Ollama |\n| [Retry Configuration](./examples/v1/13.%20retry%20configuration%20example/README.md) | Error handling and retry logic | Ollama |\n| [Recipe with Nested Fields](./examples/v1/14.%20recipe%20generation%20with%20nested%20fields/README.md) | Nested JSON field access | Ollama |\n| [Math Reasoning](./examples/v1/15.%20simple%20math%20reasoning/README.md) | Step-by-step math problem solving | Ollama |\n| [SQL Reasoning](./examples/v1/16.%20sql%20reasoning%20with%20checklist/README.md) | SQL generation with reasoning checklist | Ollama |\n| [Document Classification](./examples/v1/17.%20document%20classification%20with%20schema-guided%20reasoning/README.md) | Schema-guided classification workflow | Ollama |\n| [Multi-Stage Pipeline](./examples/v1/18.%20workdir-multi-stage-pipeline/README.md) | workDir control and environment variables | Ollama |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmirpo%2Fdatamatic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmirpo%2Fdatamatic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmirpo%2Fdatamatic/lists"}