Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sal-openlab/datafusion-server
Rust DataFusion Server
https://github.com/sal-openlab/datafusion-server
arrow datafusion rust sql
Last synced: about 1 month ago
JSON representation
Rust DataFusion Server
- Host: GitHub
- URL: https://github.com/sal-openlab/datafusion-server
- Owner: sal-openlab
- License: mit
- Created: 2023-09-20T12:44:31.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-11T07:01:52.000Z (8 months ago)
- Last Synced: 2024-05-11T11:22:52.294Z (8 months ago)
- Topics: arrow, datafusion, rust, sql
- Language: Rust
- Homepage:
- Size: 1.49 MB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# datafusion-server crate
[![crates.io](https://img.shields.io/crates/v/datafusion-server?color=blue)](https://crates.io/crates/datafusion-server)
[![license](https://img.shields.io/github/license/sal-openlab/datafusion-server?color=blue)](./LICENSE)
[![build](https://img.shields.io/github/actions/workflow/status/sal-openlab/datafusion-server/push-trigger.yml?logo=github)](https://github.com/sal-openlab/datafusion-server/actions?query=workflow%3Apush-trigger)
[![pages](https://img.shields.io/github/actions/workflow/status/sal-openlab/datafusion-server/doc-trigger.yml?logo=github&label=docs)](https://sal-openlab.github.io/datafusion-server/)Multiple session, variety of data sources query server implemented by Rust.
* Asynchronous architecture used by [Tokio](https://tokio.rs/) ecosystem
* [Apache Arrow](https://arrow.apache.org/) with [Apache DataFusion](https://arrow.apache.org/datafusion/)
+ Supports multiple data source with SQL queries
* Python plugin feature for data source connector and post processor
* Horizontal scaling architecture between servers using
the [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) gRPC featurePlease see the **[Documentation](https://sal-openlab.github.io/datafusion-server/introduction/)** for an introductory
tutorial and a full usage guide. Additionally,
the [REST API documentation](https://sal-openlab.github.io/datafusion-server/api/v1/) is available according to the
OpenAPI specification. Also, refer to
the [CHANGELOG](https://github.com/sal-openlab/datafusion-server/blob/main/CHANGELOG.md) for the latest information.## System Overview
![System Diagram](https://github.com/sal-openlab/datafusion-server/blob/main/doc/system-diagram.svg?raw=true)
## License
License under the [MIT](LICENSE)
Copyright © 2022 - 2024 SAL Ltd. - https://sal.co.jp
## Supported environment
* Linux
* BSD based Unix incl. macOS / Mac OSX
* SVR based Unix
* Windows incl. WSL2 / Cygwinand other [LLVM](https://llvm.org/docs/GettingStarted.html#hardware) supported environment.
## Using pre-built Docker image (Currently available amd64 architecture only)
### Pre-require
* Docker CE / EE v20+
### Pull container image from GitHub container registry
```sh
$ docker pull ghcr.io/sal-openlab/datafusion-server/datafusion-server:latest
```or built without Python plugin version.
```sh
$ docker pull ghcr.io/sal-openlab/datafusion-server/datafusion-server-without-plugin:latest
```### Executing container
```sh
$ docker run -d --rm \
-p 4000:4000 \
-v ./data:/var/datafusion-server/data \
--name datafusion-server \
ghcr.io/sal-openlab/datafusion-server/datafusion-server:latest
```If you are only using sample data in a container, omit the `-v ./data:/var/xapi-server/data`.
## Build container your self
### Pre-require
* Docker CE / EE v20+
### Build two containers, datafusion-server and datafusion-server-without-plugin
```sh
$ cd
$ ./make-containers.sh
```### Executing container
```sh
$ docker run -d --rm \
-p 4000:4000 \
-v ./bin/data:/var/datafusion-server/data \
--name datafusion-server \
datafusion-server:0.19.3
```If you are only using sample data in a container, omit the `-v ./bin/data:/var/xapi-server/data`.
## Build from source code for use in your project
### Pre-require
* Rust Toolchain 1.76+ (Edition 2021) from https://www.rust-lang.org
* _or_ the Rust official container from https://hub.docker.com/_/rust### How to run
```sh
$ cargo init server-executor
$ cd server-executor
```#### Example of Cargo.toml
```toml
[package]
name = "server-executor"
version = "0.1.0"
edition = "2021"[dependencies]
datafusion-server = "0.19.3"
clap = { version = "4.5", features = ["derive"] }
```#### Example of src/main.rs
```rust
use std::path::PathBuf;use clap::Parser;
use datafusion_server::settings::Settings;#[derive(Parser)]
#[clap(author, version, about = "Arrow and other large datasets web server", long_about = None)]
struct Args {
#[clap(
long,
value_parser,
short = 'f',
value_name = "FILE",
help = "Configuration file",
default_value = "./config.toml"
)]
config: PathBuf,
}fn main() -> Result<(), Box> {
let args = Args::parse();
let settings = Settings::new_with_file(&args.config)?;
datafusion_server::execute(settings)?;
Ok(())
}
```For details, further reading [main.rs](bin/src/main.rs) and [Config.toml](bin/Cargo.toml).
#### Example of config.toml
```toml
# Configuration file of datafusion-server[server]
port = 4000
flight_grpc_port = 50051
base_url = "/"
data_dir = "./data"
plugin_dir = "./plugins"[session]
default_keep_alive = 3600 # in seconds
upload_limit_size = 20 # MB[log]
# trace, debug, info, warn, error
level = "debug"
```#### Debug build and run
```sh
$ cargo run
```## datafusion-server with Python plugins feature
Require Python interpreter v3.7+
### How to run
#### Example of Cargo.toml
```toml
[dependencies]
datafusion-server = { version = "0.19.3", features = ["plugin"] }
```#### Debug build and run
```sh
$ cargo run
```### Release build with full optimization
#### Example of Cargo.toml
```toml
[profile.release]
opt-level = 'z'
strip = true
lto = "fat"
codegen-units = 1[dependencies]
datafusion-server = { version = "0.19.3", features = ["plugin"] }
```#### Build for release
```sh
$ cargo build --release
```### Clean workspace
```sh
$ cargo clean
```## Usage
### Multiple data sources with SQL query
* Can be used many kind of data source format (Parquet, JSON, ndJSON, CSV, ...).
* Data can be retrieved from the local file system and from external REST services.
+ Processing by JSONPath can be performed if necessary.
* Query execution across multiple data sources.
+ SQL query engine uses Arrow DataFusion.
- Details https://arrow.apache.org/datafusion/user-guide/sql/index.html for more information.
* Arrow, JSON and CSV formats to response.#### Example (local file)
```sh
$ curl -X "POST" "http://localhost:4000/dataframe/query" \
-H 'Content-Type: application/json' \
-d $'
{
"dataSources": [
{
"format": "csv",
"name": "sales",
"location": "file:///superstore.csv",
"options": {
"inferSchemaRows": 100,
"hasHeader": true
}
}
],
"query": {
"sql": "SELECT * FROM sales"
},
"response": {
"format": "json"
}
}'
```#### Example (remote REST API)
```sh
$ curl -X "POST" "http://localhost:4000/dataframe/query" \
-H 'Content-Type: application/json' \
-H 'Accept: text/csv' \
-d $'
{
"dataSources": [
{
"format": "json",
"name": "population",
"location": "https://datausa.io/api/data?drilldowns=State&measures=Population",
"options": {
"jsonPath": "$.data[*]"
}
}
],
"query": {
"sql": "SELECT * FROM population WHERE \"ID Year\">=2020"
}
}'
```#### Example (Python datasource connector plugin)
```sh
$ curl -X "POST" "http://localhost:4000/dataframe/query" \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d $'
{
"dataSources": [
{
"format": "arrow",
"name": "example",
"location": "excel://example-workbook.xlsx/Sheet1",
"pluginOptions": {
"skipRows": 2
}
}
],
"query": {
"sql": "SELECT * FROM example"
}
}'
```