https://github.com/dagucloud/duckdb

Last synced: about 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/dagucloud/duckdb
Owner: dagucloud
Created: 2026-05-18T05:33:25.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-18T06:13:19.000Z (2 months ago)
Last Synced: 2026-05-18T08:30:10.136Z (2 months ago)
Language: Shell
Size: 4.88 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Dagu DuckDB Action

Official Dagu action for running DuckDB SQL through the DuckDB CLI.

This action keeps DuckDB out of the Dagu core binary, so Dagu can remain

portable and cgo-free. The action pins the DuckDB CLI with Dagu `tools`, which

uses aqua internally.

## Usage

```yaml

type: graph

steps:

  - id: query

    action: duckdb@v1

    with:

      query: |

        SELECT 42 AS answer, 'duckdb' AS engine;

  - id: print_result

    depends: [query]

    run: printf '%s\n' '${query.outputs.result}'

```

The default output format is DuckDB JSON mode, so `result` is a JSON string:

```json

[{"answer":42,"engine":"duckdb"}]

```

## Existing DuckDB Files

Use `database` to run SQL against an existing DuckDB file:

```yaml

type: graph

steps:

  - id: query_existing_db

    action: duckdb@v1

    with:

      database: /data/analytics.duckdb

      query: |

        SELECT count(*) AS users FROM users;

```

Use `workdir` when the database path or files referenced by SQL should be

resolved relative to a directory:

```yaml

type: graph

steps:

  - id: query_project_db

    action: duckdb@v1

    with:

      workdir: /data/project

      database: analytics.duckdb

      query: |

        SELECT * FROM read_csv_auto('events.csv') LIMIT 10;

```

The database file must exist on the worker that runs the action. In distributed

shared-nothing mode, use a shared mount or an absolute path available on that

worker. `:memory:` is scoped to one action invocation, so it cannot share state

between multiple action steps.

## Multiple Operations

For tightly coupled operations, run multiple SQL statements in one action. This

keeps them in one DuckDB process and lets you use a transaction boundary:

```yaml

type: graph

steps:

  - id: update_metrics

    action: duckdb@v1

    with:

      database: /data/analytics.duckdb

      query: |

        BEGIN TRANSACTION;

        CREATE TABLE IF NOT EXISTS metrics (

          name VARCHAR,

          value INTEGER

        );

        INSERT INTO metrics VALUES ('runs', 1);

        UPDATE metrics

        SET value = value + 1

        WHERE name = 'runs';

        COMMIT;

        SELECT * FROM metrics;

```

For separate DAG visibility, use multiple action steps against the same database

file and connect them with `depends`:

```yaml

type: graph

steps:

  - id: insert_rows

    action: duckdb@v1

    with:

      database: /data/analytics.duckdb

      query: |

        INSERT INTO metrics VALUES ('jobs', 10);

  - id: update_rows

    depends: [insert_rows]

    action: duckdb@v1

    with:

      database: /data/analytics.duckdb

      query: |

        UPDATE metrics SET value = value + 5 WHERE name = 'jobs';

  - id: select_rows

    depends: [update_rows]

    action: duckdb@v1

    with:

      database: /data/analytics.duckdb

      readonly: true

      query: |

        SELECT * FROM metrics WHERE name = 'jobs';

  - id: print_result

    depends: [select_rows]

    run: printf '%s\n' '${select_rows.outputs.result}'

```

Keep write operations ordered with `depends`. Parallel writes to the same DuckDB

file can conflict because DuckDB uses file-level locking semantics.

## Artifacts and Large Results

The `result` output is for small values that need to flow through Dagu variables:

counts, IDs, status rows, or compact JSON. Do not use `${step.outputs.result}`

as a transport for large rowsets.

When the query result itself should be kept with the DAG run, stream DuckDB

stdout directly to a run artifact:

```yaml

type: graph

tools:

  - duckdb/duckdb@v1.5.2

steps:

  - id: export_rows

    run: |

      duckdb -batch -bail -no-stdin -csv /data/source.duckdb \

        -c "SELECT id, name, score FROM source_table WHERE score >= 80"

    stdout:

      artifact: exports/selected_rows.csv

```

This keeps the CSV out of Dagu output variables while making it available in the

run's Artifacts tab.

You can then load that artifact file in a later DuckDB step when the artifact

directory is readable by that step:

```yaml

type: graph

tools:

  - duckdb/duckdb@v1.5.2

steps:

  - id: export_rows

    run: |

      duckdb -batch -bail -no-stdin -csv /data/source.duckdb \

        -c "SELECT id, name, score FROM source_table WHERE score >= 80"

    stdout:

      artifact: exports/selected_rows.csv

  - id: insert_rows

    depends: [export_rows]

    action: duckdb@v1

    with:

      database: /data/target.duckdb

      query: |

        INSERT INTO target_table

        SELECT *

        FROM read_csv_auto('${DAG_RUN_ARTIFACTS_DIR}/exports/selected_rows.csv');

```

For large or typed datasets, prefer writing Parquet from SQL:

```yaml

steps:

  - id: export_parquet

    action: duckdb@v1

    with:

      database: /data/source.duckdb

      query: |

        COPY (

          SELECT id, name, score

          FROM source_table

          WHERE score >= 80

        )

        TO '${DAG_RUN_ARTIFACTS_DIR}/exports/selected_rows.parquet'

        (FORMAT parquet);

  - id: insert_parquet

    depends: [export_parquet]

    action: duckdb@v1

    with:

      database: /data/target.duckdb

      query: |

        INSERT INTO target_table

        SELECT *

        FROM read_parquet('${DAG_RUN_ARTIFACTS_DIR}/exports/selected_rows.parquet');

```

In distributed shared-nothing mode, an artifact path may be worker-local while

the run is still executing. For cross-worker data handoff, use a shared mounted

path, object storage, or keep the operation inside one DuckDB statement with

`ATTACH` / `INSERT INTO ... SELECT`.

## Inputs

| Name | Type | Required | Default | Description |

|------|------|----------|---------|-------------|

| `query` | string | Yes | | SQL passed to `duckdb -c`. |

| `database` | string | No | transient in-memory database | Database file path. Use an absolute path when the file lives outside the action workspace. |

| `workdir` | string | No | action workspace | Directory to `cd` into before running DuckDB. Use this when SQL references local files with relative paths. |

| `format` | string | No | `json` | Output format: `json`, `csv`, `table`, `markdown`, `line`, `list`, or `column`. |

| `readonly` | boolean | No | `false` | Open the database in read-only mode. |

## Outputs

| Name | Type | Description |

|------|------|-------------|

| `result` | string | Raw DuckDB stdout in the selected format. |

## Local Development

Use `source:` to call a local checkout:

```yaml

steps:

  - id: query

    action: source:file:///path/to/duckdb@local

    with:

      query: SELECT 1 AS ok;

```

Remote actions run in their own action workspace. If a query needs files from a

caller workspace, pass `workdir` and use paths that exist on the worker running

the action.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dagucloud/duckdb

Awesome Lists containing this project

README