{"id":16797341,"url":"https://github.com/mistercrunch/duckstreams","last_synced_at":"2025-06-11T18:04:04.909Z","repository":{"id":255763353,"uuid":"853564004","full_name":"mistercrunch/duckstreams","owner":"mistercrunch","description":null,"archived":false,"fork":false,"pushed_at":"2024-09-06T23:47:39.000Z","size":86,"stargazers_count":28,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-17T03:45:40.247Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mistercrunch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-06T23:21:09.000Z","updated_at":"2025-02-02T11:57:35.000Z","dependencies_parsed_at":null,"dependency_job_id":"c91239b6-4e24-4125-9d59-0bd6dfdabe46","html_url":"https://github.com/mistercrunch/duckstreams","commit_stats":null,"previous_names":["mistercrunch/duckstreams"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mistercrunch/duckstreams","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mistercrunch%2Fduckstreams","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mistercrunch%2Fduckstreams/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mistercrunch%2Fduckstreams/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mistercrunch%2Fduckstreams/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mistercrunch","download_url":"https://codeload.github.com/mistercrunch/duckstreams/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mistercrunch%2Fduckstreams/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259311839,"owners_count":22838802,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T09:22:01.505Z","updated_at":"2025-06-11T18:04:04.891Z","avatar_url":"https://github.com/mistercrunch.png","language":null,"funding_links":[],"categories":["\u003ca name=\"Not%20Set\"\u003e\u003c/a\u003eNot Set"],"sub_categories":[],"readme":"# DuckStreams\n\n\u003cimg src=\"logo.jpg\" width=\"250\"\u003e\n\n### Description\nDuckStreams acts as a virtual database on top of your Kafka and Redpanda\nclusters, effectively providing an ephemeral SQL interface for querying\nstreaming data.\n\n## Scope \nDuckStreams turns Kafka and Redpanda topics into SQL-accessible virtual tables,\nallowing users to query streaming data in real-time. The project interfaces with\nthe schema registry to map topics to tables and supports deserializing data in\nformats like JSON, Protobuf, and Thrift. Using DuckDB, DuckStreams creates\nephemeral tables, runs queries on them, and returns results without persisting\nany data. It’s designed to be lightweight, fully in-memory, and ideal for\nquerying dynamic stream data without caching or long-term storage.\n\n## How it works\n\nFirst, the service is served as a python DBAPI-compatible driver. If asked for its list of tables,\nqueried through `INFORMATION_SCHEMA.TABLES`, it interfaces with your streaming cluster's\nschema registry to get a list of topics. This table is made ephemeral through duckdb so that\nyou can apply predicates, grouping and run any SQL against it.\n\nSimilarly, when running ANY SQL statement against the database, we parse-out the virtual table\nname, which should match an existing topic (or `INFORMATION_SCHEMA.TABLES`), then it will simply:\n\n1. figure out the topic\n1. fire up a client + consumer, and apply the time and partition predicate\n1. deserialize the data into memory, load it into an ephemeral, in-memory duckdb table\n1. run the SQL you ran against this ephemeral table in duck db, retrieve the result set\n1. return it through a DBAPI-compatible interface\n\n## Configuration\n\n* clusters: define your clusters into a yaml file\n\n* policies:\n  * levels inheritance: top-level, cluster-level or table-level \n  * parameters\n    * row_limit: limit the number of rows the consumer will read, it just stops once reached\n    * time_range_limit: define a max time range that can be queries, can be anything from seconds to years\n    * bytes (?)\n    * cells (?)\n\n## Thoughts \u0026 questions\n\n* nesting: it's pretty common to have deeply/oddly nested schemas on the transport layer,\n  how good are duckdb's support for complex schema? arbitrary json? Should we auto-comlumnize\n  things as we deserialize? automagically? based on configs?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmistercrunch%2Fduckstreams","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmistercrunch%2Fduckstreams","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmistercrunch%2Fduckstreams/lists"}