https://github.com/arrow-udf/arrow-udf
A User-Defined Function Framework for Apache Arrow.
https://github.com/arrow-udf/arrow-udf
arrow python rust udf wasm
Last synced: 6 months ago
JSON representation
A User-Defined Function Framework for Apache Arrow.
- Host: GitHub
- URL: https://github.com/arrow-udf/arrow-udf
- Owner: arrow-udf
- License: apache-2.0
- Created: 2023-12-13T11:45:30.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-22T16:35:09.000Z (7 months ago)
- Last Synced: 2025-04-03T04:59:57.712Z (6 months ago)
- Topics: arrow, python, rust, udf, wasm
- Language: Rust
- Homepage:
- Size: 908 KB
- Stars: 88
- Watchers: 9
- Forks: 17
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Arrow User-Defined Functions Framework
Easily create and run user-defined functions (UDF) on Apache Arrow.
You can define functions in Rust, Python, Java or JavaScript.
The functions can be executed natively, or in WebAssembly, or in a [remote server].| Language | Native | WebAssembly | Remote |
| ---------- |------------------------------------|--------------------------|---------------------------|
| Rust | [arrow-udf] | [arrow-udf-runtime/wasm] | |
| Python | [arrow-udf-runtime/python] | | [arrow-udf-remote/python] |
| JavaScript | [arrow-udf-runtime/javascript] | | |
| Java | | | [arrow-udf-remote/java] |[remote server]: ./arrow-udf-runtime/src/remote
[arrow-udf]: ./arrow-udf
[arrow-udf-runtime/python]: ./arrow-udf-runtime/src/python
[arrow-udf-runtime/javascript]: ./arrow-udf-runtime/src/javascript
[arrow-udf-runtime/wasm]: ./arrow-udf-runtime/src/wasm
[arrow-udf-remote/python]: ./arrow-udf-remote/python
[arrow-udf-remote/java]: ./arrow-udf-remote/java> [!NOTE]
> [arrow-udf] generates `RecordBatch` Rust functions from scalar functions, and can be used in more general contexts
> whenever you need to work with Arrow Data in Rust, not specifically user-provided code.
>
> Other crates are more focused on providing runtimes or protocols for running user-provided code.- `arrow-udf`: You call `fn(&RecordBatch)->RecordBatch` directly, as if you wrote it by hand.
- `arrow-udf-runtime/python`/`arrow-udf-runtime/javascript`: You first `add_function` to a `Runtime`, and then call it with the `Runtime`.
- `arrow-udf-runtime/wasm`: You first create a `Runtime` with compiled WASM binary, and then `find_function` and call it.
- `arrow-udf-runtime/remote`: You start a `Client` to call the function running in a remote `Server` process.You can also use this library to add custom functions to DuckDB, see [arrow-udf-duckdb-example].
[arrow-udf-duckdb-example]: ./arrow-udf-duckdb-example
## Extension Types
In addition to the standard types defined by Arrow, these crates also support the following data types through Arrow's [extension type](https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types). When using extension types, you need to add the `ARROW:extension:name` key to the field's metadata.
| Extension Type | Physical Type | `ARROW:extension:name` |
| -------------- | ------------------------- | ------------------------ |
| JSON | Utf8, Binary, LargeBinary | `arrowudf.json` |
| Decimal | Utf8 | `arrowudf.decimal` |Alternatively, you can configure the extension metadata key and values to look for when converting between Arrow and extension types:
```rust
let mut js_runtime = arrow_udf_runtime::javascript::Runtime::new().unwrap();
let converter = js_runtime.converter_mut();
converter.set_arrow_extension_key("Extension");
converter.set_json_extension_name("Variant");
converter.set_decimal_extension_name("Decimal");
```### JSON Type
JSON type is stored in string array in text form.
```rust
let json_field = Field::new(name, DataType::Utf8, true)
.with_metadata([("ARROW:extension:name".into(), "arrowudf.json".into())].into());
let json_array = StringArray::from(vec![r#"{"key": "value"}"#]);
```### Decimal Type
Different from the fixed-point decimal type built into Arrow, this decimal type represents floating-point numbers with arbitrary precision or scale, that is, the [unconstrained numeric](https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-NUMERIC-DECIMAL) in Postgres. The decimal type is stored in a string array in text form.
```rust
let decimal_field = Field::new(name, DataType::Utf8, true)
.with_metadata([("ARROW:extension:name".into(), "arrowudf.decimal".into())].into());
let decimal_array = StringArray::from(vec!["0.0001", "-1.23", "0"]);
```## Benchmarks
We have benchmarked the performance of function calls in different environments.
You can run the benchmarks with the following command:```sh
cargo bench --bench bench
```Performance comparison of calling `gcd` on a chunk of 1024 rows:
```
gcd/native 1.5237 µs x1
gcd/wasm 15.547 µs x10
gcd/js(quickjs) 85.007 µs x55
gcd/python 175.29 µs x115
```## Who is using this library?
- [RisingWave]: A Distributed SQL Database for Stream Processing.
- [Databend]: An open-source cloud data warehouse that serves as a cost-effective alternative to Snowflake.[RisingWave]: https://github.com/risingwavelabs/risingwave
[Databend]: https://github.com/datafuselabs/databend