{"id":14563813,"url":"https://github.com/rustyconover/duckdb-shellfs-extension","last_synced_at":"2025-05-12T17:26:00.181Z","repository":{"id":240313333,"uuid":"801514837","full_name":"rustyconover/duckdb-shellfs-extension","owner":"rustyconover","description":"DuckDB extension allowing shell commands to be used for input and output.","archived":false,"fork":false,"pushed_at":"2025-04-18T13:54:33.000Z","size":268,"stargazers_count":69,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-12T03:13:29.987Z","etag":null,"topics":["duckdb","duckdb-extension","popen","shell"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rustyconover.png","metadata":{"files":{"readme":"docs/README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-16T11:33:01.000Z","updated_at":"2025-04-18T13:54:36.000Z","dependencies_parsed_at":"2024-10-30T07:29:46.956Z","dependency_job_id":null,"html_url":"https://github.com/rustyconover/duckdb-shellfs-extension","commit_stats":null,"previous_names":["rustyconover/duckdb-shellfs-extension"],"tags_count":0,"template":false,"template_full_name":"duckdb/extension-template","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustyconover%2Fduckdb-shellfs-extension","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustyconover%2Fduckdb-shellfs-extension/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustyconover%2Fduckdb-shellfs-extension/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustyconover%2Fduckdb-shellfs-extension/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rustyconover","download_url":"https://codeload.github.com/rustyconover/duckdb-shellfs-extension/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253785941,"owners_count":21964056,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duckdb","duckdb-extension","popen","shell"],"created_at":"2024-09-07T02:05:21.089Z","updated_at":"2025-05-12T17:26:00.156Z","avatar_url":"https://github.com/rustyconover.png","language":"C++","funding_links":[],"categories":["shell","Extensions"],"sub_categories":["[Community Extensions](https://duckdb.org/community_extensions/)"],"readme":"# DuckDB Shellfs Extension\n\n![DuckDB Shellfs Extension logo](duckdb-shellfs.jpg)\n\nThe `shellfs` extension for DuckDB enables the use of Unix pipes for input and output.\n\nBy appending a pipe character `|` to a filename, DuckDB will treat it as a series of commands to execute and capture the output. Conversely, if you prefix a filename with `|`, DuckDB will treat it as an output pipe.\n\nWhile the examples provided are simple, in practical scenarios, you might use this feature to run another program that generates CSV, JSON, or other formats to manage complexity that DuckDB cannot handle directly.\n\nThe implementation uses `popen()` to create the pipe between processes.\n\n## Installation\n\n**`shellfs` is a [DuckDB Community Extension](https://github.com/duckdb/community-extensions).**\n\nYou can now use this by using this SQL:\n\n```sql\ninstall shellfs from community;\nload shellfs;\n```\n\n---\n\n## Examples\n\n### Reading input from a pipe\n\n```sql\n\n-- Install the extension.\ninstall shellfs from community;\nload shellfs;\n\n-- Generate a sequence only return numbers that contain a 2\nSELECT * from read_csv('seq 1 100 | grep 2 |');\n┌─────────┐\n│ column0 │\n│  int64  │\n├─────────┤\n│       2 │\n│      12 │\n│      20 │\n│      21 │\n│      22 │\n└─────────┘\n\n-- Get the first multiples of 7 between 1 and 3 5\n-- demonstrate how commands can be chained together\nSELECT * from read_csv('seq 1 35 | awk \"\\$1 % 7 == 0\" | head -n 2 |');\n┌─────────┐\n│ column0 │\n│  int64  │\n├─────────┤\n│       7 │\n│      14 │\n└─────────┘\n\n-- Do some arbitrary curl\nSELECT abbreviation, unixtime from\nread_json('curl -s http://worldtimeapi.org/api/timezone/Etc/UTC  |');\n┌──────────────┬────────────┐\n│ abbreviation │  unixtime  │\n│   varchar    │   int64    │\n├──────────────┼────────────┤\n│ UTC          │ 1715983565 │\n└──────────────┴────────────┘\n```\n\n\nCreate a program to generate CSV in Python:\n\n```python\n#!/usr/bin/env python3\n\nprint(\"counter1,counter2\")\nfor i in range(10000000):\n    print(f\"{i},{i}\")\n```\n\nRun that program and determine the number of distinct values it produces:\n\n```sql\nselect count(distinct counter1)\nfrom read_csv('./test-csv.py |');\n┌──────────────────────────┐\n│ count(DISTINCT counter1) │\n│          int64           │\n├──────────────────────────┤\n│                 10000000 │\n└──────────────────────────┘\n```\n\nWhen a command is not found or able to be executed, this is the result:\n\n```sql\n SELECT count(distinct column0) from read_csv('foo |');\nsh: foo: command not found\n┌─────────────────────────┐\n│ count(DISTINCT column0) │\n│          int64          │\n├─────────────────────────┤\n│                       0 │\n└─────────────────────────┘\n```\n\nThe reason why there isn't an exception raised in this cause is because the `popen()` implementation starts a process with [`fork()`](https://man7.org/linux/man-pages/man2/fork.2.html) or the appropriate system call for the operating system, but when the the child process calls [`exec()`](https://man7.org/linux/man-pages/man3/exec.3.html) that fails, and there was no output produced by the child process.\n\n### Writing output to a pipe\n\n```sql\n-- Write all numbers from 1 to 30 out, but then filter via grep\n-- for only lines that contain 6.\nCOPY (select * from unnest(generate_series(1, 30)))\nTO '| grep 6 \u003e numbers.csv' (FORMAT 'CSV');\n6\n16\n26\n\n-- Copy the result set to the clipboard on Mac OS X using pbcopy\nCOPY (select 'hello' as type, from unnest(generate_series(1, 30)))\nTO '| grep 3 | pbcopy' (FORMAT 'CSV');\ntype,\"generate_series(1, 30)\"\nhello,3\nhello,13\nhello,23\nhello,30\n\n-- Write an encrypted file out via openssl\nCOPY (select 'hello' as type, * from unnest(generate_series(1, 30)))\nTO '| openssl enc -aes-256-cbc -salt -in - -out example.enc -pbkdf2 -iter 1000 -pass pass:testing12345' (FORMAT 'JSON');\n\n```\n\n## Configuration\n\nThis extension introduces a new configuration option:\n\n`ignore_sigpipe` - a boolean option that, when set to true, ignores the SIGPIPE signal. This is useful when writing to a pipe that stops reading input. For example:\n\n```sql\nCOPY (select 'hello' as type, * from unnest(generate_series(1, 300))) TO '| head -n 100';\n```\n\nIn this scenario, DuckDB attempts to write 300 lines to the pipe, but the `head` command only reads the first 100 lines. After `head` reads the first 100 lines and exits, it closes the pipe. The next time DuckDB tries to write to the pipe, it receives a SIGPIPE signal. By default, this causes DuckDB to exit. However, if `ignore_sigpipe` is set to true, the SIGPIPE signal is ignored, allowing DuckDB to continue without error even if the pipe is closed.\n\nYou can enable this option by setting it with the following command:\n\n```sql\nset ignore_sigpipe = true;\n```\n\n## Caveats\n\nWhen using `read_text()` or `read_blob()` the contents of the data read from a pipe is limited to 2GB in size.  This is the maximum length of a single row's value.\n\nWhen using `read_csv()` or `read_json()` the contents of the pipe can be unlimited as it is processed in a streaming fashion.\n\nA demonstration of this would be:\n\n```python\n#!/usr/bin/env python3\n\nprint(\"counter1,counter2\")\nfor i in range(10000000):\n    print(f\"{i},{i}\")\n```\n\n```sql\nselect count(distinct counter1) from read_csv('./test-csv.py |');\n┌──────────────────────────┐\n│ count(DISTINCT counter1) │\n│          int64           │\n├──────────────────────────┤\n│                 10000000 │\n└──────────────────────────┘\n```\n\nIf a `limit` clause is used you may see an error like this:\n\n```sql\nselect * from read_csv('./test-csv.py |') limit 3;\n┌──────────┬──────────┐\n│ counter1 │ counter2 │\n│  int64   │  int64   │\n├──────────┼──────────┤\n│        0 │        0 │\n│        1 │        1 │\n│        2 │        2 │\n└──────────┴──────────┘\nTraceback (most recent call last):\n  File \"/Users/rusty/Development/duckdb-shell-extension/./test-csv.py\", line 5, in \u003cmodule\u003e\n    print(f\"{i},{i}\")\nBrokenPipeError: [Errno 32] Broken pipe\nException ignored in: \u003c_io.TextIOWrapper name='\u003cstdout\u003e' mode='w' encoding='utf-8'\u003e\nBrokenPipeError: [Errno 32] Broken pipe\n```\n\nDuckDB continues to run, but the program that was producing output received a SIGPIPE signal because DuckDB closed the pipe after reading the necessary number of rows.  It is up to the user of DuckDB to decide whether to suppress this behavior by setting the `ignore_sigpipe` configuration parameter.\n\n## Building\n\n### Build steps\nNow to build the extension, run:\n```sh\nmake\n```\nThe main binaries that will be built are:\n```sh\n./build/release/duckdb\n./build/release/test/unittest\n./build/release/extension/shellfs/shellfs.duckdb_extension\n```\n- `duckdb` is the binary for the duckdb shell with the extension code automatically loaded.\n- `unittest` is the test runner of duckdb. Again, the extension is already linked into the binary.\n- `shellfs.duckdb_extension` is the loadable binary as it would be distributed.\n\n## Running the extension\nTo run the extension code, simply start the shell with `./build/release/duckdb`.\n\nNow we can use the features from the extension directly in DuckDB.\n\n## Running the tests\nDifferent tests can be created for DuckDB extensions. The primary way of testing DuckDB extensions should be the SQL tests in `./test/sql`. These SQL tests can be run using:\n```sh\nmake test\n```\n\n### Installing the deployed binaries\n\nTo install your extension binaries from S3, you will need to do two things. Firstly, DuckDB should be launched with the\n`allow_unsigned_extensions` option set to true. How to set this will depend on the client you're using. Some examples:\n\nCLI:\n```shell\nduckdb -unsigned\n```\n\nPython:\n```python\ncon = duckdb.connect(':memory:', config={'allow_unsigned_extensions' : 'true'})\n```\n\nNodeJS:\n```js\ndb = new duckdb.Database(':memory:', {\"allow_unsigned_extensions\": \"true\"});\n```\n\nSecondly, you will need to set the repository endpoint in DuckDB to the HTTP url of your bucket + version of the extension\nyou want to install. To do this run the following SQL query in DuckDB:\n```sql\nSET custom_extension_repository='bucket.s3.eu-west-1.amazonaws.com/shellfs/latest';\n```\nNote that the `/latest` path will allow you to install the latest extension version available for your current version of\nDuckDB. To specify a specific version, you can pass the version instead.\n\nAfter running these steps, you can install and load your extension using the regular INSTALL/LOAD commands in DuckDB:\n\n```sql\nINSTALL shellfs\nLOAD shellfs\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frustyconover%2Fduckdb-shellfs-extension","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frustyconover%2Fduckdb-shellfs-extension","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frustyconover%2Fduckdb-shellfs-extension/lists"}