{"id":20602025,"url":"https://github.com/datafusion-contrib/datafusion-federation","last_synced_at":"2025-05-10T00:31:25.793Z","repository":{"id":220985298,"uuid":"753057778","full_name":"datafusion-contrib/datafusion-federation","owner":"datafusion-contrib","description":"Allow DataFusion to resolve queries across remote query engines while pushing down as much compute as possible down.","archived":false,"fork":false,"pushed_at":"2025-04-25T12:54:57.000Z","size":680,"stargazers_count":127,"open_issues_count":3,"forks_count":23,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-05-05T04:46:37.166Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datafusion-contrib.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-02-05T11:40:42.000Z","updated_at":"2025-04-29T08:21:54.000Z","dependencies_parsed_at":"2024-02-05T15:53:13.074Z","dependency_job_id":"e4cb650f-cc4a-4a1f-b632-6638092a24b6","html_url":"https://github.com/datafusion-contrib/datafusion-federation","commit_stats":null,"previous_names":["datafusion-contrib/datafusion-federation"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafusion-contrib%2Fdatafusion-federation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafusion-contrib%2Fdatafusion-federation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafusion-contrib%2Fdatafusion-federation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafusion-contrib%2Fdatafusion-federation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datafusion-contrib","download_url":"https://codeload.github.com/datafusion-contrib/datafusion-federation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253346452,"owners_count":21894264,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T09:12:44.609Z","updated_at":"2025-05-10T00:31:25.250Z","avatar_url":"https://github.com/datafusion-contrib.png","language":"Rust","funding_links":[],"categories":["Rust","\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust","Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# DataFusion Federation\n\n[![crates.io](https://img.shields.io/crates/v/datafusion-federation.svg)](https://crates.io/crates/datafusion-federation)\n[![docs.rs](https://docs.rs/datafusion-federation/badge.svg)](https://docs.rs/datafusion-federation)\n\nDataFusion Federation allows\n[DataFusion](https://github.com/apache/arrow-datafusion) to execute (part of) a\nquery plan by a remote execution engine.\n\n                                        ┌────────────────┐\n                   ┌────────────┐       │ Remote DBMS(s) │\n    SQL Query ───\u003e │ DataFusion │  ───\u003e │  ( execution   │\n                   └────────────┘       │ happens here ) │\n                                        └────────────────┘\n\nThe goal is to allow resolving queries across remote query engines while\npushing down as much compute as possible to the remote database(s). This allows\nexecution to happen as close to the storage as possible. This concept is\nreferred to as 'query federation'.\n\n\u003e [!TIP]\n\u003e This repository implements the federation framework itself. If you want to\n\u003e connect to a specific database, check out the compatible providers available\n\u003e in\n\u003e [datafusion-contrib/datafusion-table-providers](https://github.com/datafusion-contrib/datafusion-table-providers/).\n\n## Usage\n\nCheck out the [examples](./datafusion-federation/examples/) to get a feel for\nhow it works.\n\nFor a complete step-by-step example of how federation works, you can check the\nexample [here](./datafusion-federation/examples/df-csv-advanced.rs). \n\n## Potential use-cases:\n\n- Querying across SQLite, MySQL, PostgreSQL, ...\n- Pushing down SQL or [Substrait](https://substrait.io/) plans.\n- DataFusion -\u003e Flight SQL -\u003e DataFusion\n- ..\n\n## Design concept\n\nSay you have a query plan as follows:\n\n                   ┌────────────┐\n                   │    Join    │\n                   └────────────┘\n                          ▲\n                  ┌───────┴────────┐\n           ┌────────────┐   ┌────────────┐\n           │   Scan A   │   │    Join    │\n           └────────────┘   └────────────┘\n                                   ▲\n                           ┌───────┴────────┐\n                    ┌────────────┐   ┌────────────┐\n                    │   Scan B   │   │   Scan C   │\n                    └────────────┘   └────────────┘\n\nDataFusion Federation will identify the largest possible sub-plans that\ncan be executed by an external database:\n\n                   ┌────────────┐      Optimizer recognizes\n                   │    Join    │      that B and C are\n                   └────────────┘      available in an\n                          ▲            external database\n           ┌──────────────┴────────┐\n           │       ┌ ─  ─ ─ ─  ─ ─ ┴ ─ ── ─ ─ ─  ─ ─┐\n    ┌────────────┐          ┌────────────┐          │\n    │   Scan A   │ │        │    Join    │\n    └────────────┘          └────────────┘          │\n                   │               ▲\n                           ┌───────┴────────┐       │\n                    ┌────────────┐   ┌────────────┐ │\n                   ││   Scan B   │   │   Scan C   │\n                    └────────────┘   └────────────┘ │\n                    ─ ── ─ ─ ── ─ ─ ─ ─  ─ ─ ─ ── ─ ┘\n\nThe sub-plans are cut out and replaced by an opaque federation node in the plan:\n\n                   ┌────────────┐\n                   │    Join    │\n                   └────────────┘    Rewritten Plan\n                          ▲\n                 ┌────────┴───────────┐\n                 │                    │\n          ┌────────────┐    ┏━━━━━━━━━━━━━━━━━━┓\n          │   Scan A   │    ┃     Scan B+C     ┃\n          └────────────┘    ┃  (TableProvider  ┃\n                            ┃ that can execute ┃\n                            ┃ sub-plan in an   ┃\n                            ┃external database)┃\n                            ┗━━━━━━━━━━━━━━━━━━┛\n\nDifferent databases may have different query languages and execution\ncapabilities. To accommodate for this, we allow each 'federation provider' to\nself-determine what part of a sub-plan it will actually federate. This is done\nby letting each federation provider define its own optimizer rule. When a\nsub-plan is 'cut out' of the overall plan, it is first passed the federation\nprovider's optimizer rule. This optimizer rule determines the part of the plan\nthat is cut out, based on the execution capabilities of the database it\nrepresents.\n\n## Implementation\n\nA remote database is represented by the `FederationProvider` trait. To identify\ntable scans that are available in the same database, they implement\n`FederatedTableSource` trait. This trait allows lookup of the corresponding\n`FederationProvider`.\n\nIdentifying sub-plans to federate is done by the `FederationOptimizerRule`.\nThis rule needs to be registered in your DataFusion SessionState. One easy way\nto do this is using `default_session_state`. To do its job, the\n`FederationOptimizerRule` currently requires that all TableProviders that need\nto be federated are `FederatedTableProviderAdaptor`s. The\n`FederatedTableProviderAdaptor` also has a fallback mechanism that allows\nimplementations to fallback to a 'vanilla' TableProvider in case the\n`FederationOptimizerRule` isn't registered.\n\nThe `FederationProvider` can provide a `compute_context`. This allows it to\ndifferentiate between multiple remote execution context of the same type. For\nexample two different mysql instances, database schemas, access level, etc. The\n`FederationProvider` also returns the `Optimizer` that is allows it to\nself-determine what part of a sub-plan it can federate.\n\nThe `sql` module implements a generic `FederationProvider` for SQL execution\nengines. A specific SQL engine implements the `SQLExecutor` trait for its\nengine specific execution. There are a number of compatible providers available\nin\n[datafusion-contrib/datafusion-table-providers](https://github.com/datafusion-contrib/datafusion-table-providers/).\n\n## Status\n\nThe project is in alpha status. Contributions welcome; land a PR = commit\naccess.\n\n- [Docs (release)](https://docs.rs/datafusion-federation)\n- [Docs (main)](https://datafusion-contrib.github.io/datafusion-federation/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatafusion-contrib%2Fdatafusion-federation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatafusion-contrib%2Fdatafusion-federation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatafusion-contrib%2Fdatafusion-federation/lists"}