Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/constellation-rs/amadeus

Harmonious distributed data analysis in Rust.
https://github.com/constellation-rs/amadeus

data-analysis data-processing distributed-computing parallel-computing rust stream-processing

Last synced: 4 days ago
JSON representation

Harmonious distributed data analysis in Rust.

Awesome Lists containing this project

README

        


Amadeus


Harmonious distributed data processing & analysis in Rust


Crates.io
Apache-2.0 licensed
Build Status


πŸ“– Docs | 🌐 Home | πŸ’¬ Chat

## Amadeus provides:

- **Distributed streams:** like [Rayon](https://github.com/rayon-rs/rayon)'s parallel iterators, but distributed across a cluster.
- **Data connectors:** to work with CSV, JSON, Parquet, Postgres, S3 and more.
- **ETL and Data Science tooling:** focused on streaming processing & analysis.

Amadeus is a batteries-included, low-level reusable building block for the [Rust](https://www.rust-lang.org/) Distributed Computing and Big Data ecosystems.

## Principles

- **Fearless:** no data races, no unsafe, and lossless data canonicalization.
- **Make distributed computing trivial:** running distributed should be as easy and performant as running locally.
- **Data is gradually typed:** for maximum performance when the schema is known, and flexibility when it's not.
- **Simplicity:** keep interfaces and implementations as simple and reliable as possible.
- **Reliability:** minimize unhandled errors (including OOM), and only surface errors that couldn't be handled internally.

## Why Amadeus?

### Clean & Scalable applications

By design, Amadeus encourages you to write clean and reusable code that works, regardless of data scale, locally or distributed across a cluster. Write once, run at any data scale.

### Community

We aim to create a community that is welcoming and helpful to anyone that is interested! Come join us on [our Zulip chat](https://constellation.zulipchat.com/#narrow/stream/213231-amadeus) to:

* get Amadeus working for your use case;
* discuss direction for the project;
* find good issues to get started with.

### Compatibility out of the box

Amadeus has deep, pluggable, integration with various file formats, databases and interfaces:

| Data format | [`Source`](https://docs.rs/amadeus/0.3/amadeus/trait.Source.html) | [`Destination`](https://docs.rs/amadeus/0.3/amadeus/trait.Destination.html) |
|---|---|---|
| CSV | βœ” | βœ” |
| JSON | βœ” | βœ” |
| XML | [πŸ‘](https://github.com/constellation-rs/amadeus/issues/15) | |
| Parquet | βœ” | [πŸ”¨](https://github.com/constellation-rs/amadeus) |
| Avro | [πŸ”¨](https://github.com/constellation-rs/amadeus) | |
| PostgreSQL | βœ” | [πŸ”¨](https://github.com/constellation-rs/amadeus) |
| HDF5 | [πŸ‘](https://github.com/constellation-rs/amadeus) | |
| Redshift | [πŸ‘](https://github.com/constellation-rs/amadeus) | |
| [CloudFront Logs](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html) | βœ” | – |
| [Common Crawl](http://commoncrawl.org/the-data/get-started/) | βœ” | – |
| S3 | βœ” | [πŸ”¨](https://github.com/constellation-rs/amadeus) |
| HDFS | [πŸ‘](https://github.com/constellation-rs/amadeus) | [πŸ‘](https://github.com/constellation-rs/amadeus) |

βœ” = Working

πŸ”¨ = Work in Progress

πŸ‘ = Requested: check out the issue for how to help!

### Performance

Amadeus is routinely benchmarked and provisional results are very promising:

* A 1.5x to 17x speedup reading Parquet data compared to the official Apache Arrow [`parquet`](https://crates.io/crates/parquet) crate with [these benchmarks](https://github.com/constellation-rs/amadeus/blob/3e96dbdfb77e8f874b6479c36ab4f344ff4781e4/amadeus-parquet/src/internal/file/reader.rs#L1100-L1184).

### Runs Everywhere

Amadeus is a library that can be used on its own as parallel threadpool, or with [**Constellation**](https://github.com/constellation-rs/constellation) as a distributed cluster.

[**Constellation**](https://github.com/constellation-rs/constellation) is a framework for process distribution and communication, and has backends for a bare cluster (Linux or macOS), a managed Kubernetes cluster, and more in the pipeline.

## Examples

This will read the Parquet partitions from the S3 bucket, and print the 100 most frequently occuring URLs.

```rust
use amadeus::prelude::*;
use amadeus::data::{IpAddr, Url};
use std::error::Error;

#[derive(Data, Clone, PartialEq, Debug)]
struct LogLine {
uri: Option,
requestip: Option,
}

#[tokio::main]
async fn main() -> Result<(), Box> {
let pool = ThreadPool::new(None, None)?;

let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
AwsRegion::UsEast1,
"us-east-1.data-analytics",
"cflogworkshop/optimized/cf-accesslogs/",
AwsCredentials::Anonymous,
)))
.await?;

let top_pages = rows
.par_stream()
.map(|row: Result| {
let row = row.unwrap();
(row.uri, row.requestip)
})
.most_distinct(&pool, 100, 0.99, 0.002, 0.0808)
.await;

println!("{:#?}", top_pages);
Ok(())
}
```

This is typed, so faster, and it goes an analytics step further also, prints top 100 URLs by distinct IPs logged.

See the same example but with data dynamically typed.

```rust
use amadeus::prelude::*;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box> {
let pool = ThreadPool::new(None, None)?;

let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
AwsRegion::UsEast1,
"us-east-1.data-analytics",
"cflogworkshop/optimized/cf-accesslogs/",
AwsCredentials::Anonymous,
)))
.await?;

let top_pages = rows
.par_stream()
.map(|row: Result| {
let row = row.ok()?.into_group().ok()?;
row.get("uri")?.clone().into_url().ok()
})
.filter(|row| row.is_some())
.map(Option::unwrap)
.most_frequent(&pool, 100, 0.99, 0.002)
.await;

println!("{:#?}", top_pages);
Ok(())
}
```

What about loading this data into Postgres? This will create and populate a table called "accesslogs".

```rust,ignore
use amadeus::prelude::*;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box> {
let pool = ThreadPool::new(None, None)?;

let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
AwsRegion::UsEast1,
"us-east-1.data-analytics",
"cflogworkshop/optimized/cf-accesslogs/",
AwsCredentials::Anonymous,
)))
.await?;

// Note: this isn't yet implemented!
rows.par_stream()
.pipe(Postgres::new("127.0.0.1", PostgresTable::new("accesslogs")));

Ok(())
}
```

## Running Distributed

Operations can run on a parallel threadpool or on a distributed process pool.

Amadeus uses the [**Constellation**](https://github.com/constellation-rs/constellation) framework for process distribution and communication. Constellation has backends for a bare cluster (Linux or macOS), and a managed Kubernetes cluster.

```rust
use amadeus::dist::prelude::*;
use amadeus::data::{IpAddr, Url};
use constellation::*;
use std::error::Error;

#[derive(Data, Clone, PartialEq, Debug)]
struct LogLine {
uri: Option,
requestip: Option,
}

fn main() -> Result<(), Box> {
init(Resources::default());

// #[tokio::main] isn't supported yet so unfortunately setting up the Runtime must be done explicitly
tokio::runtime::Builder::new()
.threaded_scheduler()
.enable_all()
.build()
.unwrap()
.block_on(async {
let pool = ProcessPool::new(None, None, None, Resources::default())?;

let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(
AwsRegion::UsEast1,
"us-east-1.data-analytics",
"cflogworkshop/optimized/cf-accesslogs/",
AwsCredentials::Anonymous,
)))
.await?;

let top_pages = rows
.dist_stream()
.map(FnMut!(|row: Result| {
let row = row.unwrap();
(row.uri, row.requestip)
}))
.most_distinct(&pool, 100, 0.99, 0.002, 0.0808)
.await;

println!("{:#?}", top_pages);
Ok(())
})
}
```

## Getting started

todo

### Examples

Take a look at the various [examples](examples).

## Contribution

Amadeus is an open source project! If you'd like to contribute, check out the list of [β€œgood first issues”](https://github.com/constellation-rs/amadeus/contribute). These are all (or should be) issues that are suitable for getting started, and they generally include a detailed set of instructions for what to do. Please ask questions and ping us on [our Zulip chat](https://constellation.zulipchat.com/#narrow/stream/213231-amadeus) if anything is unclear!

## License
Licensed under Apache License, Version 2.0, ([LICENSE.txt](LICENSE.txt) or
http://www.apache.org/licenses/LICENSE-2.0).

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
licensed as above, without any additional terms or conditions.