https://github.com/datafusion-contrib/datafusion-objectstore-s3
S3 as an ObjectStore for DataFusion
https://github.com/datafusion-contrib/datafusion-objectstore-s3
datafusion rust
Last synced: 3 months ago
JSON representation
S3 as an ObjectStore for DataFusion
- Host: GitHub
- URL: https://github.com/datafusion-contrib/datafusion-objectstore-s3
- Owner: datafusion-contrib
- License: apache-2.0
- Created: 2022-01-04T16:14:51.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2023-03-12T05:54:55.000Z (over 3 years ago)
- Last Synced: 2025-10-12T00:58:18.000Z (9 months ago)
- Topics: datafusion, rust
- Language: Rust
- Homepage:
- Size: 73.2 KB
- Stars: 66
- Watchers: 10
- Forks: 14
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DataFusion-ObjectStore-S3
S3 as an ObjectStore for [Datafusion](https://github.com/apache/arrow-datafusion).
## Querying files on S3 with DataFusion
This crate implements the DataFusion `ObjectStore` trait on AWS S3 and implementers of the S3 standard. We leverage the official [AWS Rust SDK](https://github.com/awslabs/aws-sdk-rust) for interacting with S3. While it is our understanding that the AWS APIs we are using a relatively stable, we can make no assurances on API stability either on AWS' part or within this crate. This crates API is tightly connected with DataFusion, a fast moving project, and as such we will make changes inline with those upstream changes.
## Examples
Examples for querying AWS and other implementors, such as MinIO, are shown below.
Load credentials from default AWS credential provider (such as environment or ~/.aws/credentials)
```rust
let s3_file_system = Arc::new(S3FileSystem::default().await);
```
`S3FileSystem::default()` is a convenience wrapper for `S3FileSystem::new(None, None, None, None, None, None)`.
Connect to implementor of S3 API (MinIO, in this case) using access key and secret.
```rust
// Example credentials provided by MinIO
const ACCESS_KEY_ID: &str = "AKIAIOSFODNN7EXAMPLE";
const SECRET_ACCESS_KEY: &str = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY";
const PROVIDER_NAME: &str = "Static";
const MINIO_ENDPOINT: &str = "http://localhost:9000";
let s3_file_system = S3FileSystem::new(
Some(SharedCredentialsProvider::new(Credentials::new(
MINIO_ACCESS_KEY_ID,
MINIO_SECRET_ACCESS_KEY,
None,
None,
PROVIDER_NAME,
))), // Credentials provider
None, // Region
Some(Endpoint::immutable(Uri::from_static(MINIO_ENDPOINT))), // Endpoint
None, // RetryConfig
None, // AsyncSleep
None, // TimeoutConfig
)
.await;
```
Using DataFusion's `ListingTableConfig` we register a table into a DataFusion `ExecutionContext` so that it can be queried.
```rust
let filename = "data/alltypes_plain.snappy.parquet";
let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;
let table = ListingTable::try_new(config)?;
let mut ctx = ExecutionContext::new();
ctx.register_table("tbl", Arc::new(table))?;
let df = ctx.sql("SELECT * FROM tbl").await?;
df.show()
```
We can also register the `S3FileSystem` directly as an `ObjectStore` on an `ExecutionContext`. This provides an idiomatic way of creating `TableProviders` that can be queried.
```rust
execution_ctx.register_object_store(
"s3",
Arc::new(S3FileSystem::default().await),
);
let input_uri = "s3://parquet-testing/data/alltypes_plain.snappy.parquet";
let (object_store, _) = ctx.object_store(input_uri)?;
let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;
let mut table_provider: Arc = Arc::new(ListingTable::try_new(config)?);
```
## Testing
Tests are run with [MinIO](https://min.io/) which provides a containerized implementation of the Amazon S3 API.
First clone the test data repository:
```bash
git submodule update --init --recursive
```
Then start the MinIO container:
```bash
docker run \
--detach \
--rm \
--publish 9000:9000 \
--publish 9001:9001 \
--name minio \
--volume "$(pwd)/parquet-testing:/data" \
--env "MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE" \
--env "MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
quay.io/minio/minio server /data \
--console-address ":9001"
```
Once started, run tests in normal fashion:
```bash
cargo test
```