{"id":47686296,"url":"https://github.com/databricks-solutions/lakeflow-tapworks","last_synced_at":"2026-04-02T14:51:34.134Z","repository":{"id":342144723,"uuid":"1155839452","full_name":"databricks-solutions/lakeflow-tapworks","owner":"databricks-solutions","description":null,"archived":false,"fork":false,"pushed_at":"2026-03-12T02:16:10.000Z","size":767,"stargazers_count":0,"open_issues_count":7,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-12T08:06:06.856Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks-solutions.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS.txt","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE.md","maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-12T00:47:21.000Z","updated_at":"2026-02-25T04:47:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/databricks-solutions/lakeflow-tapworks","commit_stats":null,"previous_names":["databricks-solutions/lakeflow-tapworks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/databricks-solutions/lakeflow-tapworks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Flakeflow-tapworks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Flakeflow-tapworks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Flakeflow-tapworks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Flakeflow-tapworks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks-solutions","download_url":"https://codeload.github.com/databricks-solutions/lakeflow-tapworks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Flakeflow-tapworks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31308446,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-02T14:51:33.292Z","updated_at":"2026-04-02T14:51:34.127Z","avatar_url":"https://github.com/databricks-solutions.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lakeflow Tapworks\n\nAutomated Load Balancer and DAB (Databricks Asset Bundle) generation toolkit for [Databricks Managed Lakeflow Connectors](https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/) .\n\n## Problem\n\nDAB is the recommended way for deploying Lakeflow connectors, however, manually creating and maintaining DAB templates for Lakeflow connectors doesn't scale. Common challenges include:\n\n- **Manual table/object management** - Adding hundreds or thousands of tables to DAB templates by hand is error-prone and time-consuming\n- **Load balancing** - Distributing tables across pipelines based on size, SLAs, or performance metrics is impossible to do manually at scale\n- **Naming conventions** - Table mapping for sources with unsupported characters (e.g., SAP tables with \"/\") or enforcing naming standards can be automated\n- **DAB syntax errors** - Minor syntax mistakes (e.g., missing spaces) cause errors and can be difficult to troubleshoot\n- **Config re-use** - Existing table configurations from other tools can't be reused as input for migration\n\n## Solution\n\nTapworks reads from a simple configuration (e.g., CSV, YAML, JSON, Delta table, or any DataFrame source) and automatically generates complete DAB packages with load balancing, validation, and proper syntax while splitting the specified tables across pipelines for performance (load balancing).\n\n```\n┌─────────────────┐     ┌─────────────────────────────────┐     ┌─────────────────┐     ┌─────────────────┐\n│     INPUT       │     │           TAPWORKS              │     │     OUTPUT      │     │     DEPLOY      │\n│                 │     │                                 │     │   DAB Package   │     │                 │\n│  • Table config │────▶│  1. Validate config             │────▶│  databricks.yml │────▶│  bundle deploy  │\n│    (CSV/Delta)  │     │  2. Apply defaults/overrides    │     │  resources/     │     │                 │\n│  • Targets      │     │  3. Load balance (split tables) │     │   pipelines.yml │     │                 │\n│  • Defaults     │     │  4. Generate YAML               │     │   jobs.yml      │     │                 │\n│  • Overrides    │     │                                 │     │   gateways.yml  │     │                 │\n└─────────────────┘     └─────────────────────────────────┘     └─────────────────┘     └─────────────────┘\n```\n\n## How It Works\n\n1. **Define your config** - Specify at least source/target mappings or other extra configuration (e.g., schedule, SCD type, gateway driver type, ...), and target environements. Using target, it is possible to specify different workspaces for deployment (e.g., dev, staging, prod)\n\n\n    **Example of a basic CSV config**\n      ```csv\n      source_schema,source_table,target_catalog,target_schema,target_table,connection_name\n      dbo,customers,bronze,sales,customers,sql_server_conn\n      ```\n\n    **Example target environments**\n    ```python\n    {\n        'dev': {'workspace_host': 'https://dev.cloud.databricks.com'},\n        'prod': {'workspace_host': 'https://prod.cloud.databricks.com', \"root_path\": \"/Shared/pipelines/prod\"},\n    }\n    ```\n\n\n2. **Run the generator** - From CLI or notebook. This will write the DAB templates into the specified output dir.\n\n    **Output Structure**\n\n    ```\n    output/\u003cproject_name\u003e/\n      databricks.yml\n      resources/\n        gateways.yml    # database connectors only\n        pipelines.yml\n        jobs.yml\n    ```\n\n\n\n      **CLI:**\n      ```bash\n      # Install the package first\n      pip install -e .\n\n      # List available connectors\n      tapworks --list\n\n      # Show connector requirements\n      tapworks sql_server --info\n\n      # Generate DAB files\n      tapworks sql_server --input-config tables.csv --output-dir output \\\n        --targets '{\"dev\": {\"workspace_host\": \"https://your-workspace.databricks.com\"}}'\n      ```\n\n      **Notebook / Python:**\n      ```python\n      from tapworks.core import run_pipeline_generation\n\n      result = run_pipeline_generation(\n          connector_name='sql_server',\n          input_source='tables.csv',  # or Delta table or DataFrame\n          output_dir='output',\n          targets={'dev': {'workspace_host': 'https://your-workspace.databricks.com'}},\n      )\n      ```\n\n3. **Deploy** - Use the generated DAB files with [`databricks bundle deploy`](https://docs.databricks.com/aws/en/dev-tools/cli/bundle-commands)\n\n\n## Load Balancing\n\nTapworks automatically distributes tables across pipelines and gateways. This is done according to limitations of the pipelines (e.g., maximum recommended number of tables per pipeline), and can be adjusted via user config. Users can group pipelines together using project, prefix, or subgroups.\n\n**Example CSV with grouping columns**\n```csv\nsource_table,target_catalog,target_schema,target_table,connection_name,prefix,subgroup\ncustomers,main,bronze,customers,sql_conn,sales,1\norders,main,bronze,orders,sql_conn,sales,1\nproducts,main,bronze,products,sql_conn,sales,2\nemployees,main,bronze,employees,sql_conn,hr,1\n```\n\n### Hierarchy\n\n**SaaS connectors** (e.g., Salesforce, GA4, ServiceNow, Workday):\n```\nProject (DAB Package)\n└── Prefix + Subgroup (logical grouping)\n    └── Pipeline(s) - auto-split if \u003e 250 tables\n```\n\n**Database connectors** (e.g., SQL Server, PostgreSQL):\n```\nProject (DAB Package)\n└── Prefix + Subgroup (logical grouping)\n    └── Gateway(s) - auto-split if \u003e 250 tables\n        └── Pipeline(s) - auto-split if \u003e 250 tables per gateway\n```\n\n### Auto-Distribution\n\nTables are automatically split based on configurable limits (default: 250 tables per pipeline/gateway):\n\n**SaaS connector example** (600 tables):\n```\n              Input: 600 tables, prefix=\"sales\", subgroup=\"01\"\n                                      │\n                    ┌─────────────────┼─────────────────┐\n                    ▼                 ▼                 ▼\n            ┌───────────────┐ ┌───────────────┐ ┌───────────────┐\n            │   Pipeline    │ │   Pipeline    │ │   Pipeline    │\n            │ sales_01_g01  │ │ sales_01_g02  │ │ sales_01_g03  │\n            │ (250 tables)  │ │ (250 tables)  │ │ (100 tables)  │\n            └───────────────┘ └───────────────┘ └───────────────┘\n```\n\n**Database connector example** (600 tables):\n```\n              Input: 600 tables, prefix=\"sales\", subgroup=\"01\"\n                                      │\n                    ┌─────────────────┴─────────────────┐\n                    ▼                                   ▼\n            ┌───────────────┐                   ┌───────────────┐\n            │    Gateway    │                   │    Gateway    │\n            │ sales_01_g01  │                   │ sales_01_g02  │\n            │ (500 tables)  │                   │ (100 tables)  │\n            └───────┬───────┘                   └───────┬───────┘\n                    │                                   │\n          ┌─────────┴─────────┐                         │\n          ▼                   ▼                         ▼\n   ┌───────────────┐   ┌───────────────┐        ┌───────────────┐\n   │   Pipeline    │   │   Pipeline    │        │   Pipeline    │\n   │  sales_01_    │   │  sales_01_    │        │  sales_01_    │\n   │  g01_p01      │   │  g01_p02      │        │  g02_p01      │\n   │ (250 tables)  │   │ (250 tables)  │        │ (100 tables)  │\n   └───────────────┘   └───────────────┘        └───────────────┘\n```\n\n### Manual Subgroups\n\nUse subgroups to isolate specific tables (e.g., critical or high-volume tables).\n**Note:** When using subgroups, all tables in a prefix must have explicit subgroups.\n\n```\n                    prefix=\"sales\"\n                          │\n          ┌───────────────┴───────────────┐\n          ▼                               ▼\n      subgroup=\"01\"                  subgroup=\"02\"\n    (5 critical tables)          (595 remaining tables)\n          │                               │\n          ▼                               ▼\n  ┌───────────────┐       ┌───────────────┬───────────────┬───────────────┐\n  │   Pipeline    │       │   Pipeline    │   Pipeline    │   Pipeline    │\n  │ sales_01_p01  │       │ sales_02_p01  │ sales_02_p02  │ sales_02_p03  │\n  │  (5 tables)   │       │ (250 tables)  │ (250 tables)  │  (95 tables)  │\n  └───────────────┘       └───────────────┴───────────────┴───────────────┘\n```\n\n## Resource Naming\n\nTapworks generates DAB resource names from `project_name`, `prefix`, and `subgroup`. Suffixes are always present for stable naming — adding more tables never renames existing resources.\n\n```\nproject_name  →  required, no default\nprefix        →  falls back to project_name if not specified\nsubgroup      →  defaults to \"01\" if not specified\nbase_group    =  {prefix}_{subgroup}\n```\n\n**Database connector** (with gateways):\n\n| Resource | Pattern | Example |\n|---|---|---|\n| Gateway (resource key) | `gateway_{base_group}_g{NN}` | `gateway_sales_01_g01` |\n| Gateway (display name) | `{base_group}_g{NN}` | `sales_01_g01` |\n| Pipeline (resource name) | `pipeline_{base_group}_g{NN}_p{NN}` | `pipeline_sales_01_g01_p01` |\n| Pipeline (display name) | `{base_group}_g{NN}_p{NN}` | `sales_01_g01_p01` |\n| Job (resource name) | `job_{base_group}_g{NN}_p{NN}` | `job_sales_01_g01_p01` |\n| Job (display name) | `{base_group}_g{NN}_p{NN}` | `sales_01_g01_p01` |\n\n**SaaS connector** (no gateways):\n\n| Resource | Pattern | Example |\n|---|---|---|\n| Pipeline (resource name) | `pipeline_{base_group}_p{NN}` | `pipeline_sales_01_p01` |\n| Pipeline (display name) | `{base_group}_p{NN}` | `sales_01_p01` |\n| Job (resource name) | `job_{base_group}_p{NN}` | `job_sales_01_p01` |\n| Job (display name) | `{base_group}_p{NN}` | `sales_01_p01` |\n\n\u003e **Important:** Prefixes must be unique per workspace. Using the same prefix across different projects deployed to the same workspace will cause resource name collisions. Use distinct prefixes (or distinct `project_name` values if relying on the prefix fallback) for each project.\n\n## Defaults and Overrides\n\nUsers can define configs individually for objects or pipelines in the config, or they can specify config for a group of pipelines when calling generator, using default_values and override_configs.\n\n- **default_values** - Fill missing/empty columns with defaults (e.g., set schedule for rows that don't have one)\n- **override_config** - Force values for ALL rows, ignoring what's in the input (e.g., pause all jobs)\n\n### Simple Configuration\n\nApply the same default values to all rows using a flat dictionary:\n\n```python\ndefault_values = {\n    'schedule': '0 */6 * * *',\n    'pause_status': 'UNPAUSED',\n}\n```\n\n### Group-Based Configuration\n\nApply different values per pipeline group using nested dictionaries:\n\n```python\ndefault_values = {\n    '*': {'schedule': '0 */6 * * *'},        # Global fallback\n    'sales': {'schedule': '*/15 * * * *'},   # All sales pipelines\n    'hr': {'schedule': '0 0 * * *'},         # HR pipelines\n}\n\noverride_config = {\n    '*': {'pause_status': 'UNPAUSED'},\n    'finance': {'pause_status': 'PAUSED'},   # Pause finance for audit\n}\n```\n\n**Matching precedence** (most specific wins):\n1. `pipeline_group` (prefix_subgroup) - e.g., `'sales_2'`\n2. `prefix` - e.g., `'sales'`\n3. `project_name` - e.g., `'my_project'`\n4. `'*'` (global)\n\nSee [examples/features/group_based_config](./examples/features/group_based_config) (\u003ca href=\"$./examples/features/group_based_config\"\u003eDatabricks\u003c/a\u003e) for detailed examples.\n\n### CLI Examples\n\n**Inline JSON:**\n```bash\ntapworks salesforce --input-config tables.csv --output-dir output \\\n  --targets '{\"dev\": {\"workspace_host\": \"https://dev.cloud.databricks.com\"}}' \\\n  --default-values '{\"project_name\": \"sfdc_prod\", \"schedule\": \"0 */6 * * *\"}' \\\n  --override '{\"pause_status\": \"PAUSED\"}'\n```\n\n\n**Using settings file:**\n```bash\ntapworks salesforce --input-config tables.csv --output-dir output --settings settings.json\n```\n\n**Settings file (settings.json):**\n```json\n{\n  \"targets\": {\n    \"dev\": {\n      \"workspace_host\": \"https://dev.cloud.databricks.com\",\n      \"root_path\": \"/Shared/pipelines/dev\"\n    },\n    \"prod\": {\n      \"workspace_host\": \"https://prod.cloud.databricks.com\",\n      \"root_path\": \"/Shared/pipelines/prod\"\n    }\n  },\n  \"default_values\": {\n    \"project_name\": \"sfdc_prod\",\n    \"schedule\": \"0 */6 * * *\"\n  },\n  \"override_input_config\": {\n    \"pause_status\": \"PAUSED\"\n  }\n}\n```\n\n### Notebook Example\n\n```python\nfrom tapworks.core import run_pipeline_generation\n\nresult = run_pipeline_generation(\n    connector_name='salesforce',\n    input_source='config.csv',\n    output_dir='output',\n    targets={\n        'dev': {'workspace_host': 'https://dev.cloud.databricks.com'},\n        'prod': {'workspace_host': 'https://prod.cloud.databricks.com'},\n    },\n    # Fill missing values\n    default_values={\n        'project_name': 'sfdc_prod',\n        'schedule': '0 */6 * * *',\n    },\n    # Override ALL rows (e.g., pause all jobs during maintenance)\n    override_config={\n        'pause_status': 'PAUSED',\n    },\n)\n```\n\n\n\n## Documentation\n\n- [USAGE.md](./docs/USAGE.md) (\u003ca href=\"$./docs/USAGE.md\"\u003eDatabricks\u003c/a\u003e) - CLI and notebook usage examples for all connectors\n- [ARCHITECTURE.md](./docs/ARCHITECTURE.md) (\u003ca href=\"$./docs/ARCHITECTURE.md\"\u003eDatabricks\u003c/a\u003e) - Technical architecture and class hierarchy\n\n## License\n\n\u0026copy; 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source].\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks-solutions%2Flakeflow-tapworks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks-solutions%2Flakeflow-tapworks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks-solutions%2Flakeflow-tapworks/lists"}