https://github.com/apache/datafusion-ray

Apache DataFusion Ray
https://github.com/apache/datafusion-ray

Last synced: 3 months ago
JSON representation

Apache DataFusion Ray

Host: GitHub
URL: https://github.com/apache/datafusion-ray
Owner: apache
License: apache-2.0
Created: 2024-09-19T19:56:04.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-04-02T17:46:15.000Z (3 months ago)
Last Synced: 2025-04-03T04:16:53.500Z (3 months ago)
Language: Python
Homepage: https://datafusion.apache.org/ray
Size: 418 KB
Stars: 180
Watchers: 24
Forks: 18
Open Issues: 18
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.md
- License: LICENSE.txt

Awesome Lists containing this project

awesome-datafusion - Apache DataFusion Ray
awesome-datafusion - Apache DataFusion Ray

README

        

# DataFusion for Ray

[![Apache licensed][license-badge]][license-url]

[![Python Tests][actions-badge]][actions-url]

[![Discord chat][discord-badge]][discord-url]

[license-badge]: https://img.shields.io/badge/license-Apache%20v2-blue.svg

[license-url]: https://github.com/apache/datafusion-ray/blob/main/LICENSE.txt

[actions-badge]: https://github.com/apache/datafusion-ray/actions/workflows/main.yml/badge.svg

[actions-url]: https://github.com/apache/datafusion-ray/actions?query=branch%3Amain

[discord-badge]: https://img.shields.io/badge/Chat-Discord-purple

[discord-url]: https://discord.com/invite/Qw5gKqHxUM

## Overview

DataFusion for Ray is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run on a

Ray cluster. This integration allows users to leverage Ray's dynamic scheduling capabilities while executing

queries in a distributed fashion.

## Execution Modes

DataFusion for Ray supports two execution modes:

### Streaming Execution

This mode mimics the default execution strategy of DataFusion. Each operator in the query plan starts executing

as soon as its inputs are available, leading to a more pipelined execution model.

### Batch Execution

_Note: Batch Execution is not implemented yet. Tracking issue: _

In this mode, execution follows a staged model similar to Apache Spark. Each query stage runs to completion, producing

intermediate shuffle files that are persisted and used as input for the next stage.

## Getting Started

See the [contributor guide] for instructions on building DataFusion for Ray.

Once installed, you can run queries using DataFusion's familiar API while leveraging the distributed execution

capabilities of Ray.

```python

# from example in ./examples/http_csv.py

import ray

from datafusion_ray import DFRayContext, df_ray_runtime_env

ray.init(runtime_env=df_ray_runtime_env)

ctx = DFRayContext()

ctx.register_csv(

    "aggregate_test_100",

    "https://github.com/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv",

)

df = ctx.sql("SELECT c1,c2,c3 FROM aggregate_test_100 LIMIT 5")

df.show()

```

## Contributing

Contributions are welcome! Please open an issue or submit a pull request if you would like to contribute. See the

[contributor guide] for more information.

## License

DataFusion for Ray is licensed under Apache 2.0.

[contributor guide]: docs/contributing.md

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apache/datafusion-ray

Awesome Lists containing this project

README