https://github.com/datafusion-contrib/datafusion-distributed
Library for bringing distributed capabilities to Apache DataFusion
https://github.com/datafusion-contrib/datafusion-distributed
arrow datafusion distributed distributed-computing distributed-da distributed-systems query-e
Last synced: 5 months ago
JSON representation
Library for bringing distributed capabilities to Apache DataFusion
- Host: GitHub
- URL: https://github.com/datafusion-contrib/datafusion-distributed
- Owner: datafusion-contrib
- License: apache-2.0
- Created: 2025-06-19T17:31:37.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-01-20T20:45:15.000Z (5 months ago)
- Last Synced: 2026-01-20T21:43:06.986Z (5 months ago)
- Topics: arrow, datafusion, distributed, distributed-computing, distributed-da, distributed-systems, query-e
- Language: Rust
- Homepage: https://datafusion-contrib.github.io/datafusion-distributed/
- Size: 1.59 MB
- Stars: 63
- Watchers: 4
- Forks: 24
- Open Issues: 52
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# DataFusion Distributed
Library that brings distributed execution capabilities to [Apache DataFusion](https://github.com/apache/datafusion).
## What can you do with this crate?
This crate is a toolkit that extends [Apache DataFusion](https://github.com/apache/datafusion) with distributed
capabilities,
providing a developer experience as close as possible to vanilla DataFusion while being unopinionated about the
networking stack used for hosting the different workers involved in a query.
Users of this library can expect to take their existing single-node DataFusion-based systems and add distributed
capabilities with minimal changes.
## Core tenets of the project
- Be as close as possible to vanilla DataFusion, providing a seamless integration with existing DataFusion systems and
a familiar API for building applications.
- Unopinionated about networking. This crate does not take any opinion about the networking stack, and users are
expected to leverage their own infrastructure for hosting DataFusion nodes.
- No coordinator-worker architecture. To keep infrastructure simple, any node can act as a coordinator or a worker.
# Benchmarks

# Docs
The user and contributor guide can be found here:
https://datafusion-contrib.github.io/datafusion-distributed
## Getting familiar with distributed DataFusion
There are some runnable examples showcasing how to provide a localhost implementation for Distributed DataFusion in
[examples/](examples):
- [localhost_worker.rs](examples/localhost_worker.rs): code that spawns a Worker listening for physical
plans over the network.
- [localhost_run.rs](examples/localhost_run.rs): code that distributes a query across the spawned Workers and executes
it.
The integration tests also provide an idea about how to use the library and what can be achieved with it:
- [tpch_validation_test.rs](tests/tpch_plans_test.rs): executes all TPCH queries and performs assertions over the
distributed plans.
- [custom_config_extension.rs](tests/custom_config_extension.rs): showcases how to propagate custom DataFusion config
extensions.
- [custom_extension_codec.rs](tests/custom_extension_codec.rs): showcases how to propagate custom physical extension
codecs.
- [distributed_aggregation.rs](tests/distributed_aggregation.rs): showcases how to manually place `ArrowFlightReadExec`
nodes in a plan and build a distributed query out of it.