https://github.com/kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
https://github.com/kwai/blaze

big-data datafusion rust-lang spark

Last synced: 5 months ago
JSON representation

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

Host: GitHub
URL: https://github.com/kwai/blaze
Owner: kwai
License: apache-2.0
Created: 2021-06-28T07:29:43.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2025-04-11T08:57:30.000Z (6 months ago)
Last Synced: 2025-04-12T08:12:15.201Z (6 months ago)
Topics: big-data, datafusion, rust-lang, spark
Language: Rust
Homepage: https://blaze-project.github.io/
Size: 9.38 MB
Stars: 1,440
Watchers: 23
Forks: 151
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

stars - kwai/blaze - fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core. (HarmonyOS / Windows Manager)

README

# BLAZE

[![TPC-DS](https://github.com/blaze-init/blaze/actions/workflows/tpcds.yml/badge.svg?branch=master)](https://github.com/blaze-init/blaze/actions/workflows/tpcds.yml)
[![master-ce7-builds](https://github.com/blaze-init/blaze/actions/workflows/build-ce7-releases.yml/badge.svg?branch=master)](https://github.com/blaze-init/blaze/actions/workflows/build-ce7-releases.yml)

![dev/blaze-logo.png](./dev/blaze-logo.png)

The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines
the power of the [Apache DataFusion](https://arrow.apache.org/datafusion/) library and the scale of the Spark distributed
computing framework.

Blaze takes a fully optimized physical plan from Spark, mapping it into DataFusion's execution plan, and performs native
plan computation in Spark executors.

Blaze is composed of the following high-level components:

- **Spark Extension**: hooks the whole accelerator into Spark execution lifetime.
- **Spark Shims**: specialized codes for different versions of spark.
- **Native Engine**: implements the native engine in rust, including:
- ExecutionPlan protobuf specification
- JNI gateway
- Customized operators, expressions, functions

Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:

- Various object stores.
- Operators.
- Simple and Aggregate functions.
- File formats.

We encourage you to [extend DataFusion](https://github.com/apache/arrow-datafusion) capability directly and add the
supports in Blaze with simple modifications in plan-serde and extension translation.

## Build from source

To build Blaze, please follow the steps below:

1. Install Rust

The native execution lib is written in Rust. So you're required to install Rust (nightly) first for
compilation. We recommend you to use [rustup](https://rustup.rs/).

2. Install Protobuf

Ensure `protoc` is available in PATH environment. protobuf can be installed via linux system package
manager (or Homebrew on mac), or manually download and build from https://github.com/protocolbuffers/protobuf/releases .

3. Install JDK+Maven

Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.

4. Check out the source code.

```shell
git clone git@github.com:kwai/blaze.git
cd blaze
```

5. Build the project.

Specify shims package of which spark version that you would like to run on.

Currently, we have supported these shims:

* spark-3.0 - for spark3.0.x
* spark-3.1 - for spark3.1.x
* spark-3.2 - for spark3.2.x
* spark-3.3 - for spark3.3.x
* spark-3.4 - for spark3.4.x
* spark-3.5 - for spark3.5.x.

You could either build Blaze in pre mode for debugging or in release mode to unlock the full potential of
Blaze.

```shell
SHIM=spark-3.3 # or spark-3.0/spark-3.1/spark-3.2/spark-3.3/spark-3.4/spark-3.5
MODE=release # or pre
mvn clean package -P"${SHIM}" -P"${MODE}"
```

Skip build native (native lib is already built, and you can check the native lib in `native-engine/_build/${MODE}`).

```shell
SHIM=spark-3.3 # or spark-3.0/spark-3.1/spark-3.2/spark-3.3/spark-3.4/spark-3.5
MODE=release # or pre
mvn clean package -P"${SHIM}" -P"${MODE}" -DskipBuildNative
```

After the build is finished, a fat Jar package that contains all the dependencies will be generated in the `target`
directory.

## Build with docker

You can use the following command to build a centos-7 compatible release:
```shell
SHIM=spark-3.3 MODE=release ./release-docker.sh
```

## Run Spark Job with Blaze Accelerator

This section describes how to submit and configure a Spark Job with Blaze support.

1. move blaze jar package to spark client classpath (normally `spark-xx.xx.xx/jars/`).

2. add the follow confs to spark configuration in `spark-xx.xx.xx/conf/spark-default.conf`:

```properties
spark.blaze.enable true
spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
spark.memory.offHeap.enabled false

# suggested executor memory configuration
spark.executor.memory 4g
spark.executor.memoryOverhead 4096
```

3. submit a query with spark-sql, or other tools like spark-thriftserver:
```shell
spark-sql -f tpcds/q01.sql
```

## Integrate with Apache Celeborn
Blaze has supported Celeborn integration now, use the following configurations to enable shuffling with Celeborn:

```properties

# change celeborn endpoint and storage directory to the correct location
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.celeborn.BlazeCelebornShuffleManager
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.celeborn.master.endpoints localhost:9097
spark.celeborn.client.spark.shuffle.writer hash
spark.celeborn.client.push.replicate.enabled false
spark.celeborn.storage.availableTypes HDFS
spark.celeborn.storage.hdfs.dir hdfs:///home/celeborn
spark.sql.adaptive.localShuffleReader.enabled false
```
## Integrate with Apache Uniffle
Blaze supports integration with Apache Uniffle, a high-performance remote shuffle service for Apache Spark.

To enable Uniffle as the shuffle manager in Blaze, configure your Spark application with the following settings in
`spark-defaults.conf` or via Spark submit options:

```properties

spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.uniffle.BlazeUniffleShuffleManager
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.rss.coordinator.quorum :19999,:19999
spark.rss.enabled true
```
Notes:

* Uniffle Client Dependency: Ensure the Uniffle client library (e.g., `rss-client-spark3-shaded-0.9.2.jar` for Uniffle 0.9.2 or later) is included in your Spark application's classpath.
* Coordinator Endpoints: Replace `:19999` with the actual Uniffle coordinator address in your cluster.
* For detailed setup and advanced configuration, refer to the [Apache Uniffle Documentation](https://uniffle.apache.org/docs/client-guide).

## Performance

Check [TPC-H Benchmark Results](./benchmark-results/tpch.md).
The latest benchmark result shows that Blaze saved more than 50% time on TPC-H 1TB datasets comparing with Vanilla Spark 3.5.

Stay tuned and join us for more upcoming thrilling numbers.

TPC-H Query time:
![tpch-blaze400-spark351.png](./benchmark-results/tpch-blaze400-spark351.png)

We also encourage you to benchmark Blaze and share the results with us. 🤗

## Community

We're using [Discussions](https://github.com/blaze-init/blaze/discussions) to connect with other members
of our community. We hope that you:
- Ask questions you're wondering about.
- Share ideas.
- Engage with other community members.
- Welcome others who are open-minded. Remember that this is a community we build together 💪 .

## License

Blaze is licensed under the Apache 2.0 License. A copy of the license
[can be found here.](LICENSE.txt)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kwai/blaze

Awesome Lists containing this project

README