https://github.com/apache/hudi-rs

The native Rust implementation for Apache Hudi, with Python API bindings.
https://github.com/apache/hudi-rs

apache hudi python rust

Last synced: 5 months ago
JSON representation

The native Rust implementation for Apache Hudi, with Python API bindings.

Host: GitHub
URL: https://github.com/apache/hudi-rs
Owner: apache
License: apache-2.0
Created: 2024-05-02T21:40:33.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-05-05T19:33:21.000Z (5 months ago)
Last Synced: 2025-05-10T17:16:31.124Z (5 months ago)
Topics: apache, hudi, python, rust
Language: Rust
Homepage: https://hudi.apache.org/
Size: 833 KB
Stars: 217
Watchers: 17
Forks: 42
Open Issues: 41
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

          



  

    

  





  The native Rust implementation for Apache Hudi, with Python API bindings.

  


  


  

    

  

  

    

  

  

    

  

  

    

  

  

    

  



The `hudi-rs` project aims to broaden the use of [Apache Hudi](https://github.com/apache/hudi) for a diverse range of

users and projects.

| Source                  | Downloads                   | Installation Command |

|-------------------------|-----------------------------|----------------------|

| [**PyPi.org**][pypi]    | [![][pypi-badge]][pypi]     | `pip install hudi`   |

| [**Crates.io**][crates] | [![][crates-badge]][crates] | `cargo add hudi`     |

[pypi]: https://pypi.org/project/hudi/

[pypi-badge]: https://img.shields.io/pypi/dm/hudi?style=flat-square&color=51AEF3

[crates]: https://crates.io/crates/hudi

[crates-badge]: https://img.shields.io/crates/d/hudi?style=flat-square&color=163669

## Usage Examples

> [!NOTE]

> These examples expect a Hudi table exists at `/tmp/trips_table`, created using

> the [quick start guide](https://hudi.apache.org/docs/quick-start-guide).

### Snapshot Query

Snapshot query reads the latest version of the data from the table. The table API also accepts partition filters.

#### Python

```python

from hudi import HudiTableBuilder

import pyarrow as pa

hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()

batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])

# convert to PyArrow table

arrow_table = pa.Table.from_batches(batches)

result = arrow_table.select(["rider", "city", "ts", "fare"])

print(result)

```

#### Rust

```rust

use hudi::error::Result;

use hudi::table::builder::TableBuilder as HudiTableBuilder;

use arrow::compute::concat_batches;

#[tokio::main]

async fn main() -> Result<()> {

    let hudi_table = HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;

    let batches = hudi_table.read_snapshot(&[("city", "=", "san_francisco")]).await?;

    let batch = concat_batches(&batches[0].schema(), &batches)?;

    let columns = vec!["rider", "city", "ts", "fare"];

    for col_name in columns {

        let idx = batch.schema().index_of(col_name).unwrap();

        println!("{}: {}", col_name, batch.column(idx));

    }

    Ok(())

}

```

To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set `hoodie.read.use.read_optimized.mode` when creating the table.

#### Python

```python

hudi_table = (

    HudiTableBuilder

    .from_base_uri("/tmp/trips_table")

    .with_option("hoodie.read.use.read_optimized.mode", "true")

    .build()

)

```

#### Rust

```rust

let hudi_table = 

    HudiTableBuilder::from_base_uri("/tmp/trips_table")

    .with_option("hoodie.read.use.read_optimized.mode", "true")

    .build().await?;

```

> [!NOTE]

> Currently reading MOR tables is limited to tables with Parquet data blocks.

### Time-Travel Query

Time-travel query reads the data at a specific timestamp from the table. The table API also accepts partition filters.

#### Python

```python

batches = (

    hudi_table

    .read_snapshot_as_of("20241231123456789", filters=[("city", "=", "san_francisco")])

)

```

#### Rust

```rust

let batches = 

    hudi_table

    .read_snapshot_as_of("20241231123456789", &[("city", "=", "san_francisco")]).await?;

```

### Incremental Query

Incremental query reads the changed data from the table for a given time range.

#### Python

```python

# read the records between t1 (exclusive) and t2 (inclusive)

batches = hudi_table.read_incremental_records(t1, t2)

# read the records after t1

batches = hudi_table.read_incremental_records(t1)

```

#### Rust

```rust

// read the records between t1 (exclusive) and t2 (inclusive)

let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;

// read the records after t1

let batches = hudi_table.read_incremental_records(t1, None).await?;

```

> [!NOTE]

> Currently the only supported format for the timestamp arguments is Hudi Timeline format: `yyyyMMddHHmmssSSS` or `yyyyMMddHHmmss`.

## Query Engine Integration

Hudi-rs provides APIs to support integration with query engines. The sections below highlight some commonly used APIs.

### Table API

Create a Hudi table instance using its constructor or the `TableBuilder` API.

| Stage           | API                                       | Description                                                                    |

|-----------------|-------------------------------------------|--------------------------------------------------------------------------------|

| Query planning  | `get_file_slices()`                       | For snapshot query, get a list of file slices.                                 |

|                 | `get_file_slices_splits()`                | For snapshot query, get a list of file slices in splits.                       |

|                 | `get_file_slices_as_of()`                 | For time-travel query, get a list of file slices at a given time.              |

|                 | `get_file_slices_splits_as_of()`          | For time-travel query, get a list of file slices in splits at a given time.    |

|                 | `get_file_slices_between()`               | For incremental query, get a list of changed file slices between a time range. |

| Query execution | `create_file_group_reader_with_options()` | Create a file group reader instance with the table instance's configs.         |

### File Group API

Create a Hudi file group reader instance using its constructor or the Hudi table API `create_file_group_reader_with_options()`.

| Stage           | API                                   | Description                                                                                                                                                                        |

|-----------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

| Query execution | `read_file_slice()`                   | Read records from a given file slice; based on the configs, read records from only base file, or from base file and log files, and merge records based on the configured strategy. |

### Apache DataFusion

Enabling the `hudi` crate with `datafusion` feature will provide a [DataFusion](https://datafusion.apache.org/) 

extension to query Hudi tables.

Add crate hudi with datafusion feature to your application to query a Hudi table.

```shell

cargo new my_project --bin && cd my_project

cargo add tokio@1 datafusion@43

cargo add hudi --features datafusion

```

Update `src/main.rs` with the code snippet below then `cargo run`.

```rust

use std::sync::Arc;

use datafusion::error::Result;

use datafusion::prelude::{DataFrame, SessionContext};

use hudi::HudiDataSource;

#[tokio::main]

async fn main() -> Result<()> {

    let ctx = SessionContext::new();

    let hudi = HudiDataSource::new_with_options(

        "/tmp/trips_table",

        [("hoodie.read.input.partitions", "5")]).await?;

    ctx.register_table("trips_table", Arc::new(hudi))?;

    let df: DataFrame = ctx.sql("SELECT * from trips_table where city = 'san_francisco'").await?;

    df.show().await?;

    Ok(())

}

```

### Other Integrations

Hudi is also integrated with

- [Daft](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)

- [Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_hudi.html#ray.data.read_hudi)

### Work with cloud storage

Ensure cloud storage credentials are set properly as environment variables, e.g., `AWS_*`, `AZURE_*`, or `GOOGLE_*`.

Relevant storage environment variables will then be picked up. The target table's base uri with schemes such

as `s3://`, `az://`, or `gs://` will be processed accordingly.

Alternatively, you can pass the storage configuration as options via Table APIs.

#### Python

```python

from hudi import HudiTableBuilder

hudi_table = (

    HudiTableBuilder

    .from_base_uri("s3://bucket/trips_table")

    .with_option("aws_region", "us-west-2")

    .build()

)

```

#### Rust

```rust

use hudi::table::builder::TableBuilder as HudiTableBuilder;

async fn main() -> Result<()> {

    let hudi_table = 

        HudiTableBuilder::from_base_uri("s3://bucket/trips_table")

        .with_option("aws_region", "us-west-2")

        .build().await?;

}

```

## Contributing

Check out the [contributing guide](./CONTRIBUTING.md) for all the details about making contributions to the project.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apache/hudi-rs

Awesome Lists containing this project

README