{"id":34103918,"url":"https://github.com/sidequery/dlt-iceberg","last_synced_at":"2026-02-09T09:15:57.953Z","repository":{"id":328665442,"uuid":"1075107685","full_name":"sidequery/dlt-iceberg","owner":"sidequery","description":"An Iceberg destination for DLT that supports REST catalogs","archived":false,"fork":false,"pushed_at":"2026-01-27T01:19:08.000Z","size":333,"stargazers_count":6,"open_issues_count":3,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-27T13:26:46.879Z","etag":null,"topics":["apache-iceberg","data-engineering","datalake","dlt","dlthub","etl","iceberg"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sidequery.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-13T03:40:21.000Z","updated_at":"2026-01-27T01:19:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/sidequery/dlt-iceberg","commit_stats":null,"previous_names":["sidequery/dlt-iceberg"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/sidequery/dlt-iceberg","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidequery%2Fdlt-iceberg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidequery%2Fdlt-iceberg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidequery%2Fdlt-iceberg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidequery%2Fdlt-iceberg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sidequery","download_url":"https://codeload.github.com/sidequery/dlt-iceberg/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sidequery%2Fdlt-iceberg/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29260426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-09T04:11:57.159Z","status":"ssl_error","status_checked_at":"2026-02-09T04:11:56.117Z","response_time":56,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-iceberg","data-engineering","datalake","dlt","dlthub","etl","iceberg"],"created_at":"2025-12-14T17:55:34.194Z","updated_at":"2026-02-09T09:15:57.944Z","avatar_url":"https://github.com/sidequery.png","language":"Python","readme":"# dlt-iceberg\n\nA [dlt](https://dlthub.com/) destination for [Apache Iceberg](https://iceberg.apache.org/) tables using REST catalogs.\n\n## Features\n\n- **Atomic Multi-File Commits**: Multiple parquet files committed as single Iceberg snapshot per table\n- **REST Catalog Support**: Works with Nessie, Polaris, AWS Glue, Unity Catalog\n- **Credential Vending**: Most REST catalogs vend storage credentials automatically\n- **Partitioning**: Full support for Iceberg partition transforms via `iceberg_adapter()`\n- **Merge Strategies**: Delete-insert and upsert with hard delete support\n- **DuckDB Integration**: Query loaded data via `pipeline.dataset()`\n- **Schema Evolution**: Automatic schema updates when adding columns\n\n## Installation\n\n```bash\npip install dlt-iceberg\n```\n\nOr with uv:\n\n```bash\nuv add dlt-iceberg\n```\n\n## Quick Start\n\n```python\nimport dlt\nfrom dlt_iceberg import iceberg_rest\n\n@dlt.resource(name=\"events\", write_disposition=\"append\")\ndef generate_events():\n    yield {\"event_id\": 1, \"value\": 100}\n\npipeline = dlt.pipeline(\n    pipeline_name=\"my_pipeline\",\n    destination=iceberg_rest(\n        catalog_uri=\"https://my-catalog.example.com/api/catalog\",\n        namespace=\"analytics\",\n        warehouse=\"my_warehouse\",\n        credential=\"client-id:client-secret\",\n        oauth2_server_uri=\"https://my-catalog.example.com/oauth/tokens\",\n    ),\n)\n\npipeline.run(generate_events())\n```\n\n### Query Loaded Data\n\n```python\n# Query data via DuckDB\ndataset = pipeline.dataset()\n\n# Access as dataframe\ndf = dataset[\"events\"].df()\n\n# Run SQL queries\nresult = dataset.query(\"SELECT * FROM events WHERE value \u003e 50\").fetchall()\n\n# Get Arrow table\narrow_table = dataset[\"events\"].arrow()\n```\n\n### Merge/Upsert\n\n```python\n@dlt.resource(\n    name=\"users\",\n    write_disposition=\"merge\",\n    primary_key=\"user_id\"\n)\ndef generate_users():\n    yield {\"user_id\": 1, \"name\": \"Alice\", \"status\": \"active\"}\n\npipeline.run(generate_users())\n```\n\n## Configuration\n\n### Required Options\n\n```python\niceberg_rest(\n    catalog_uri=\"...\",    # REST catalog endpoint (or sqlite:// for local)\n    namespace=\"...\",      # Iceberg namespace (database)\n)\n```\n\n### Authentication\n\nChoose based on your catalog:\n\n| Catalog | Auth Method |\n|---------|-------------|\n| Polaris, Lakekeeper | `credential` + `oauth2_server_uri` |\n| Unity Catalog | `token` |\n| AWS Glue | `sigv4_enabled` + `signing_region` |\n| Local SQLite | None needed |\n\nMost REST catalogs (Polaris, Lakekeeper, etc.) **vend storage credentials automatically** via the catalog API. You typically don't need to configure S3/GCS/Azure credentials manually.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eAdvanced Options\u003c/b\u003e\u003c/summary\u003e\n\n```python\niceberg_rest(\n    # ... required options ...\n\n    # Manual storage credentials (usually not needed with credential vending)\n    s3_endpoint=\"...\",\n    s3_access_key_id=\"...\",\n    s3_secret_access_key=\"...\",\n    s3_region=\"...\",\n\n    # Performance tuning\n    max_retries=5,               # Retry attempts for transient failures\n    retry_backoff_base=2.0,      # Exponential backoff multiplier\n    merge_batch_size=500000,     # Rows per batch for merge operations\n    strict_casting=False,        # Fail on potential data loss\n\n    # Table management\n    table_location_layout=None,  # Custom table location pattern\n    register_new_tables=False,   # Register tables found in storage\n    hard_delete_column=\"_dlt_deleted_at\",  # Column for hard deletes\n)\n```\n\n\u003c/details\u003e\n\n## Catalog Examples\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLakekeeper (Docker)\u003c/b\u003e\u003c/summary\u003e\n\n```python\niceberg_rest(\n    catalog_uri=\"http://localhost:8282/catalog/\",\n    warehouse=\"test-warehouse\",\n    namespace=\"my_namespace\",\n    s3_endpoint=\"http://localhost:9000\",\n    s3_access_key_id=\"minioadmin\",\n    s3_secret_access_key=\"minioadmin\",\n    s3_region=\"us-east-1\",\n)\n```\n\nStart Lakekeeper + MinIO with `docker compose up -d`. Lakekeeper supports credential vending in production.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003ePolaris\u003c/b\u003e\u003c/summary\u003e\n\n```python\niceberg_rest(\n    catalog_uri=\"https://polaris.example.com/api/catalog\",\n    warehouse=\"my_warehouse\",\n    namespace=\"production\",\n    credential=\"client-id:client-secret\",\n    oauth2_server_uri=\"https://polaris.example.com/api/catalog/v1/oauth/tokens\",\n)\n```\n\nStorage credentials are vended automatically by the catalog.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eUnity Catalog (Databricks)\u003c/b\u003e\u003c/summary\u003e\n\n```python\niceberg_rest(\n    catalog_uri=\"https://\u003cworkspace\u003e.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest\",\n    warehouse=\"\u003ccatalog-name\u003e\",\n    namespace=\"\u003cschema-name\u003e\",\n    token=\"\u003cdatabricks-token\u003e\",\n)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eAWS Glue\u003c/b\u003e\u003c/summary\u003e\n\n```python\niceberg_rest(\n    catalog_uri=\"https://glue.us-east-1.amazonaws.com/iceberg\",\n    warehouse=\"\u003caccount-id\u003e:s3tablescatalog/\u003cbucket\u003e\",\n    namespace=\"my_database\",\n    sigv4_enabled=True,\n    signing_region=\"us-east-1\",\n)\n```\n\nRequires AWS credentials in environment (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`).\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLocal SQLite Catalog\u003c/b\u003e\u003c/summary\u003e\n\n```python\niceberg_rest(\n    catalog_uri=\"sqlite:///catalog.db\",\n    warehouse=\"file:///path/to/warehouse\",\n    namespace=\"my_namespace\",\n)\n```\n\nGreat for local development and testing.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eNessie (Docker)\u003c/b\u003e\u003c/summary\u003e\n\n```python\niceberg_rest(\n    catalog_uri=\"http://localhost:19120/iceberg/main\",\n    namespace=\"my_namespace\",\n    s3_endpoint=\"http://localhost:9000\",\n    s3_access_key_id=\"minioadmin\",\n    s3_secret_access_key=\"minioadmin\",\n    s3_region=\"us-east-1\",\n)\n```\n\nStart Nessie + MinIO with `docker compose up -d` (see docker-compose.yml in repo).\n\n\u003c/details\u003e\n\n## Partitioning\n\n### Using iceberg_adapter (Recommended)\n\nThe `iceberg_adapter` function provides a clean API for configuring Iceberg partitioning:\n\n```python\nfrom dlt_iceberg import iceberg_adapter, iceberg_partition\n\n@dlt.resource(name=\"events\")\ndef events():\n    yield {\"event_date\": \"2024-01-01\", \"user_id\": 123, \"region\": \"US\"}\n\n# Single partition\nadapted = iceberg_adapter(events, partition=\"region\")\n\n# Multiple partitions with transforms\nadapted = iceberg_adapter(\n    events,\n    partition=[\n        iceberg_partition.month(\"event_date\"),\n        iceberg_partition.bucket(10, \"user_id\"),\n        \"region\",  # identity partition\n    ]\n)\n\npipeline.run(adapted)\n```\n\n### Partition Transforms\n\n```python\n# Temporal transforms (for timestamp/date columns)\niceberg_partition.year(\"created_at\")\niceberg_partition.month(\"created_at\")\niceberg_partition.day(\"created_at\")\niceberg_partition.hour(\"created_at\")\n\n# Identity (no transformation)\niceberg_partition.identity(\"region\")\n\n# Bucket (hash into N buckets)\niceberg_partition.bucket(10, \"user_id\")\n\n# Truncate (truncate to width)\niceberg_partition.truncate(4, \"email\")\n\n# Custom partition field names\niceberg_partition.month(\"created_at\", \"event_month\")\niceberg_partition.bucket(8, \"user_id\", \"user_bucket\")\n```\n\n### Using Column Hints\n\nYou can also use dlt column hints for partitioning:\n\n```python\n@dlt.resource(\n    name=\"events\",\n    columns={\n        \"event_date\": {\n            \"data_type\": \"date\",\n            \"partition\": True,\n            \"partition_transform\": \"day\",\n        },\n        \"user_id\": {\n            \"data_type\": \"bigint\",\n            \"partition\": True,\n            \"partition_transform\": \"bucket[10]\",\n        }\n    }\n)\ndef events():\n    ...\n```\n\n## Write Dispositions\n\n### Append\n```python\nwrite_disposition=\"append\"\n```\nAdds new data without modifying existing rows.\n\n### Replace\n```python\nwrite_disposition=\"replace\"\n```\nTruncates table and inserts new data.\n\n### Merge\n\n#### Delete-Insert Strategy (Default)\n```python\n@dlt.resource(\n    write_disposition={\"disposition\": \"merge\", \"strategy\": \"delete-insert\"},\n    primary_key=\"user_id\"\n)\n```\nDeletes matching rows then inserts new data. Single atomic transaction.\n\n#### Upsert Strategy\n```python\n@dlt.resource(\n    write_disposition={\"disposition\": \"merge\", \"strategy\": \"upsert\"},\n    primary_key=\"user_id\"\n)\n```\nUpdates existing rows, inserts new rows.\n\n#### Hard Deletes\n\nMark rows for deletion by setting the `_dlt_deleted_at` column:\n\n```python\n@dlt.resource(\n    write_disposition={\"disposition\": \"merge\", \"strategy\": \"delete-insert\"},\n    primary_key=\"user_id\"\n)\ndef users_with_deletes():\n    from datetime import datetime\n    yield {\"user_id\": 1, \"name\": \"alice\", \"_dlt_deleted_at\": None}  # Keep\n    yield {\"user_id\": 2, \"name\": \"bob\", \"_dlt_deleted_at\": datetime.now()}  # Delete\n```\n\n## Development\n\n### Run Tests\n\n```bash\n# Start Docker services (for Nessie tests)\ndocker compose up -d\n\n# Run all tests\nuv run pytest tests/ -v\n\n# Run only unit tests (no Docker required)\nuv run pytest tests/ --ignore=tests/nessie -v\n\n# Run Nessie integration tests\nuv run pytest tests/nessie/ -v\n```\n\n### Project Structure\n\n```\ndlt-iceberg/\n├── src/dlt_iceberg/\n│   ├── __init__.py           # Public API\n│   ├── destination_client.py # Class-based destination (atomic commits)\n│   ├── destination.py        # Function-based destination (legacy)\n│   ├── adapter.py            # iceberg_adapter() for partitioning\n│   ├── sql_client.py         # DuckDB integration for dataset()\n│   ├── schema_converter.py   # dlt → Iceberg schema conversion\n│   ├── schema_casting.py     # Arrow table casting\n│   ├── schema_evolution.py   # Schema updates\n│   ├── partition_builder.py  # Partition specs\n│   └── error_handling.py     # Retry logic\n├── tests/\n│   ├── test_adapter.py       # iceberg_adapter tests\n│   ├── test_capabilities.py  # Hard delete, partition names tests\n│   ├── test_dataset.py       # DuckDB integration tests\n│   ├── test_merge_disposition.py\n│   ├── test_schema_evolution.py\n│   └── ...\n├── examples/\n│   ├── incremental_load.py   # CSV incremental loading\n│   ├── merge_load.py         # CSV merge/upsert\n│   └── data/                 # Sample CSV files\n└── docker-compose.yml        # Nessie + MinIO for testing\n```\n\n## How It Works\n\nThe class-based destination uses dlt's `JobClientBase` interface to accumulate parquet files during a load and commit them atomically in `complete_load()`:\n\n1. dlt extracts data and writes parquet files\n2. Each file is registered in module-level global state\n3. After all files complete, `complete_load()` is called\n4. All files for a table are combined and committed as single Iceberg snapshot\n5. Each table gets one snapshot per load\n\nThis ensures atomic commits even though dlt creates multiple client instances.\n\n## License\n\nMIT License - see LICENSE file\n\n## Resources\n\n- [dlt Documentation](https://dlthub.com/docs)\n- [Apache Iceberg](https://iceberg.apache.org/)\n- [PyIceberg](https://py.iceberg.apache.org/)\n- [Iceberg REST Spec](https://iceberg.apache.org/rest-catalog-spec/)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsidequery%2Fdlt-iceberg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsidequery%2Fdlt-iceberg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsidequery%2Fdlt-iceberg/lists"}