{"id":13671235,"url":"https://github.com/sutoiku/puffin","last_synced_at":"2026-02-23T05:39:35.049Z","repository":{"id":65387041,"uuid":"591304097","full_name":"sutoiku/puffin","owner":"sutoiku","description":"Serverless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg","archived":false,"fork":false,"pushed_at":"2023-03-28T05:42:48.000Z","size":18685,"stargazers_count":327,"open_issues_count":1,"forks_count":12,"subscribers_count":27,"default_branch":"main","last_synced_at":"2025-04-27T14:38:52.074Z","etag":null,"topics":["arrow","duckdb","iceberg","serverless"],"latest_commit_sha":null,"homepage":"http://PuffinDB.io","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sutoiku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null}},"created_at":"2023-01-20T12:36:48.000Z","updated_at":"2025-03-30T21:07:58.000Z","dependencies_parsed_at":"2023-06-01T01:00:17.548Z","dependency_job_id":null,"html_url":"https://github.com/sutoiku/puffin","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sutoiku/puffin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sutoiku%2Fpuffin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sutoiku%2Fpuffin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sutoiku%2Fpuffin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sutoiku%2Fpuffin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sutoiku","download_url":"https://codeload.github.com/sutoiku/puffin/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sutoiku%2Fpuffin/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29738142,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-23T04:51:08.365Z","status":"ssl_error","status_checked_at":"2026-02-23T04:49:15.865Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","duckdb","iceberg","serverless"],"created_at":"2024-08-02T09:01:03.698Z","updated_at":"2026-02-23T05:39:35.017Z","avatar_url":"https://github.com/sutoiku.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"# PuffinDB \u003cimg src=\"https://user-images.githubusercontent.com/1074452/220389778-245dd23e-3f09-4615-bdf1-24cd1eb8b819.png\" style=\"margin-left: 0.25em\" width=\"24\"\u003e\n\nServerless [HTAP](docs/HTAP.md) cloud data platform powered by [Arrow](https://arrow.apache.org/) × [DuckDB](https://duckdb.org/) × [Iceberg](https://iceberg.apache.org/)\n\nAccelerate DuckDB with 10,000 [AWS Lambda functions](https://aws.amazon.com/lambda/) running on your own VPC\n\n**Note**: This repository only contains preliminary design documents (*Cf.* [Roadmap](ROADMAP.md))\n\n**Kickoff meetup**: [Rovinj, Croatia, March 29-31, 2023](meetup)\n\n## Introduction\n\u003cimg width=\"1217\" alt=\"Architecture\" src=\"https://user-images.githubusercontent.com/1074452/221385259-c8b07288-2300-40ef-a6b8-b786ce808cc4.png\"\u003e\n\nIf you are using DuckDB client-side with [any client application](docs/Clientless.md), adding the [PuffinDB extension](docs/Extension.md) will let you:\n- Distribute queries across thousands of serverless functions and a [Monostore](docs/Monostore.md)\n- Read from and write to hundreds of applications using any [Airbyte connector](https://airbyte.com/connectors)\n- Collaborate on the same [Iceberg tables](https://iceberg.apache.org/spec/) with other users\n- Write back to an Iceberg table with [ACID](https://en.wikipedia.org/wiki/ACID) transactional integrity\n- Execute [cross-database joins](docs/Query%20Proxy.md#query-delegation) (*Cf.* [Edge-Driven Data Integration](EDDI.md))\n- Translate between 19 [SQL dialects](docs/Query%20Proxy.md#dialect-translation)\n- Invoke [remote query generators](docs/Query%20Proxy.md)\n- Invoke [curl](https://curl.se/) commands\n- Execute incremental and observable [data pipelines](docs/Pipeline%20Engine.md)\n- Turn DuckDB into a next-generation [vector database](docs/Vector%20Database.md)\n- Support the [Lance](https://github.com/eto-ai/lance) file format for 100× faster random access\n- Accelerate and | or schedule the downloading of large tables to your client\n- Cache tables and run computations at the edge ([Amazon CloudFront](https://aws.amazon.com/cloudfront/) × [Lambda@Edge](https://aws.amazon.com/lambda/edge/))\n- Log queries on your data lake\n\nPuffinDB is an initiative of [STOIC](https://stoic.com/), and not [DuckDB Labs](https://duckdblabs.com/) or the [DuckDB Foundation](https://duckdb.org/foundation/).\n\nDuckDB and the DuckDB logo are trademarks of the DuckDB Foundation.\n\nPuffinDB and the PuffinDB logo are trademarks of STOIC (Sutoiku, Inc.).\n\nSTOIC is a member of the DuckDB Foundation.\n\n## Beliefs\n- Nothing beats [SQL](https://en.wikipedia.org/wiki/SQL) because nothing can beat [maths](https://en.wikipedia.org/wiki/Relational_algebra)\n- The [public cloud](FAQ.md#why-not-support-private-cloud-deployment) is the only truly elastic platform\n- [Arrow](https://arrow.apache.org/) × [DuckDB](https://duckdb.org/) × [Iceberg](https://iceberg.apache.org/) are game changers\n- [Edge-Driven Data Integration](EDDI.md) is the way forward\n- [Clientless](docs/Clientless.md) + [Serverless](docs/Serverless.md) = [Goodness](CLOUD.md)\n\n## Rationale\nMany excellent distributed SQL engines are available today. Why do we need yet another one?\n\n- [True serverless architecture](RATIONALE.md/#true-serverless-architecture)\n- [Future-proof architecture](RATIONALE.md/#future-proof-architecture)\n- [Designed for virtual private cloud deployment](RATIONALE.md/#designed-for-virtual-private-cloud-deployment)\n- [Designed for small to large datasets](RATIONALE.md/#designed-for-small-to-large-datasets)\n- [Designed for real-time analytics](RATIONALE.md/#designed-for-real-time-analytics)\n- [Designed for interactive analytics](RATIONALE.md/#designed-for-interactive-analytics)\n- [Designed for transformation and analytics](RATIONALE.md/#designed-for-transformation-and-analytics)\n- [Designed for analytics and transactions](RATIONALE.md/#designed-for-analytics-and-transactions)\n- [Designed for next-generation query engines](RATIONALE.md/#designed-for-next-generation-query-engines)\n- [Designed for next-generation file formats](RATIONALE.md/#designed-for-next-generation-file-formats)\n- [Designed for lakehouses](RATIONALE.md/#designed-for-lakehouses)\n- [Designed for data mesh integration](RATIONALE.md/#designed-for-data-mesh-integration)\n- [Designed for all users](RATIONALE.md/#designed-for-all-users)\n- [Designed for extensibility](RATIONALE.md/#designed-for-extensibility)\n- [Designed for embedability](RATIONALE.md/#designed-for-embedability)\n- [Optimized for machine-generated queries](RATIONALE.md/#optimized-for-machine-generated-queries)\n- [Scalable across large user bases](RATIONALE.md/#scalable-across-large-user-bases)\n\n## Outline\n- True [serverless architecture](docs/Architecture.md) (run [DuckDB](https://duckdb.org/) on 10,000 [Lambda functions](https://aws.amazon.com/lambda/))\n- Supporting both read and write queries ([HTAP](docs/HTAP.md))\n- Implemented in [Python](https://www.python.org/), [Rust](https://www.rust-lang.org/), and [TypeScript](https://www.typescriptlang.org/) (using [Bun](https://bun.sh/))\n- Powered by [Arrow](https://arrow.apache.org/) × [DuckDB](https://duckdb.org/) × [Iceberg](https://iceberg.apache.org/)\n- Powered by [Redis](https://redis.io/) (using [Amazon ElastiCache for Redis](https://aws.amazon.com/elasticache/redis/)) for state management\n- Accelerated by [NAT hole punching](https://github.com/spcl/tcpunch) for superfast data shuffles\n- Integrated with [Apache Iceberg](https://iceberg.apache.org/), [Apache Hudi](https://hudi.apache.org/), and [Delta Lake](https://delta.io/)\n- Deployed on [AWS](https://aws.amazon.com/) first, then [Microsoft Azure](https://azure.microsoft.com/en-us) and [Google Cloud](https://cloud.google.com/)\n- Deployed as two [AWS Lambda functions](functions/) and one [Amazon EC2](https://aws.amazon.com/ec2/) instance\n- Integrated with [Amazon Athena](https://aws.amazon.com/athena/) (for write queries on lakehouse tables)\n- Packaged as an [AWS CloudFormation](https://aws.amazon.com/cloudformation/) template (using [Terraform](https://www.terraform.io/))\n- Released as a free [AWS Marketplace](https://aws.amazon.com/marketplace) product\n- Running on your [Amazon VPC](https://aws.amazon.com/vpc/)\n- Licensed under [MIT License](https://opensource.org/licenses/MIT)\n\n## Features\n- [Distributed SQL query planner](docs/Query%20Planner.md) powered by [DuckDB](https://duckdb.org/)\n- [Distributed SQL query engine](docs/Query%20Engine.md) powered by [DuckDB](https://duckdb.org/)\n- Distributed SQL query execution coordinated by [Redis](https://redis.io/) (using [Amazon ElastiCache for Redis](https://aws.amazon.com/elasticache/redis/))\n- Distributed data shuffles enabled by direct Lambda-to-Lambda communication through [NAT hole punching](https://github.com/spcl/tcpunch)\n- Read queries executed by [DuckDB](https://duckdb.org/) (on [AWS Lambda](https://aws.amazon.com/lambda/))\n- Write queries against Object Store objects executed by [DuckDB](https://duckdb.org/)\n- Write queries against Lakehouse tables executed by [Amazon Athena](https://aws.amazon.com/athena/)\n- Built-in [Malloy](https://github.com/malloydata/malloy/tree/main/packages/malloy) to SQL translator\n- Built-in [PRQL](https://prql-lang.org/) to SQL translator\n- Built-in [SQL dialect converter](https://github.com/tobymao/sqlglot)\n- Built-in [SQL parser | stringifier](https://twitter.com/ghalimi/status/1625172235895046146)\n- Sub-500ms table scanning API (fetch table partitions from filter predicates) running on standalone function\n- Advanced table metadata managed by serverless [Metastore](docs/Metastore.md)\n- Concurrent support for multiple table formats ([Apache Iceberg](https://iceberg.apache.org/), [Apache Hudi](https://hudi.apache.org/), and [Delta Lake](https://delta.io/))\n- Concurrent suport for multiple Lakehouse instances\n- Native support for all Lakehouse Catalogs ([AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html), [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), and [Amazon RDS](https://aws.amazon.com/rds/))\n- Support for authentication and authorization\n- Support for synchronous and asynchronous invocations\n- Support for cascading remote invocations with [`SELECT THROUGH`](docs/Clientless.md) syntax\n- Joins across heterogenous tables using different table formats\n- Joins across tables managed by different Lakehouse instances\n- Small filtered partitions [cached](FAQ.md#how-does-partition-caching-work) on [AWS Lambda](https://aws.amazon.com/lambda/) functions\n- Query results returned as HTTP response, serialized on Object Store, or streamed through [Apache Arrow](https://arrow.apache.org/)\n- Query results [cached](FAQ.md#how-does-query-result-caching-work) on Object Store ([Amazon S3](https://aws.amazon.com/s3/)) and CDN ([Amazon CloudFront](https://aws.amazon.com/cloudfront/))\n- [Query logs](docs/Logs.md) recorded as [JSON](https://redis.io/docs/stack/json/) values in [Redis](https://redis.io/) cluster or on data lake using Parquet file\n- Transparent support for all file formats supported by [DuckDB](https://duckdb.org/) and the Lakehouse\n- Transparent support for all table lifecycle features offered by the Lakehouse\n- Planned support for deployment on [AWS Fargate](https://aws.amazon.com/fargate/)\n\n## Deployment\nPuffinDB will support four [incremental deployment options](FAQ.md#why-support-so-many-deployment-options):\n- [Node.js](https://nodejs.org/en/) and [Python](https://www.python.org/) modules deeply integrated within your own tool or application\n- [AWS Lambda functions](functions/) deployed within your own cloud platform\n- [AWS CloudFormation](https://aws.amazon.com/cloudformation/) template deployed within your own [VPC](https://aws.amazon.com/vpc/)\n- [AWS Marketplace](https://aws.amazon.com/marketplace) product added to your own cloud environment\n\n## Philosophy\n- **Developer-first** — no non-sense, zero friction\n- **Lowest latency** — every millisecond counts\n- **Elastic design** — from kilobytes to petabytes\n\n## FAQ\nPlease check our [Frequently Asked Questions](FAQ.md).\n\n## Roadmap\nPlease check our [Roadmap](ROADMAP.md).\n\n## Sponsors\nThis project was initiated and is currently funded by [STOIC](https://stoic.com/).\n\nPlease check our [sponsors](SPONSORS.md) page for sponsorship opportunities.\n\n## Credits\nThis project leverages several [DuckDB](https://duckdb.org/) features implemented by [DuckDB Labs](https://duckdblabs.com/) and funded by [STOIC](https://stoic.com/):\n\n- Support for [Apache Arrow](https://arrow.apache.org/) streaming when using [Node.js](https://nodejs.org/en/) deployment (released)\n- Support for user-defined functions when using [Node.js](https://nodejs.org/en/) deployment (released)\n- Support for map-reduced queries with binary map results using new [`COMBINE`](https://github.com/duckdb/duckdb/pull/2998) function (released)\n- Support for import of Hive partitions (released)\n- Support for [partitioned exports](https://github.com/duckdb/duckdb/pull/5964) with `COPY ... TO ... PARTITION_BY` (released)\n- Support for SQL query parsing | stringifying through standard query API ([under development](https://twitter.com/ghalimi/status/1625172235895046146))\n- Support for [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs) (development starting soon)\n\nWe are also considering funding the following projects:\n\n- Support for `SELECT * THROUGH 'https://myPuffinDB.com/' FROM remoteTable` syntax (*Cf.* [EDDI](EDDI.md))\n- Support for `FIXED` fixed-length character strings (*Cf.* [#3](https://github.com/sutoiku/puffin/issues/3))\n- Support for `C` and `S` [`tpch-dbgen`](https://github.com/electrum/tpch-dbgen) options in `tpch` [extension](https://duckdb.org/docs/extensions/overview.html)\n\nThis project was initially inspired by this excellent [article](https://towardsdatascience.com/boost-your-cloud-data-applications-with-duckdb-and-iceberg-api-67677666fbd3) from [Alon Agmon](https://medium.com/@alon.agmon).\n\n## Discussions\nMost discussions about this project are currently taking place on the [@ghalimi](https://twitter.com/ghalimi) Twitter account.\n\nFor a lower-frequency alternative, please follow [@PuffinDB](https://twitter.com/PuffinDB).\n\n## Notes\nPuffinDB should not be confused with the [Puffin file format](https://iceberg.apache.org/puffin-spec/).\n\n*Be stoic, be kind, be cool. Like a puffin...*\n\nⒸ [Sutoiku, Inc.](https://stoic.com/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsutoiku%2Fpuffin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsutoiku%2Fpuffin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsutoiku%2Fpuffin/lists"}