{"id":13648646,"url":"https://github.com/constellation-rs/amadeus","last_synced_at":"2025-04-08T09:10:57.626Z","repository":{"id":43457928,"uuid":"153476162","full_name":"constellation-rs/amadeus","owner":"constellation-rs","description":"Harmonious distributed data analysis in Rust.","archived":false,"fork":false,"pushed_at":"2021-07-23T05:42:46.000Z","size":2575,"stargazers_count":474,"open_issues_count":40,"forks_count":26,"subscribers_count":19,"default_branch":"master","last_synced_at":"2024-12-06T22:42:51.142Z","etag":null,"topics":["data-analysis","data-processing","distributed-computing","parallel-computing","rust","stream-processing"],"latest_commit_sha":null,"homepage":"https://constellation.rs/amadeus","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/constellation-rs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-17T15:00:37.000Z","updated_at":"2024-11-23T05:32:30.000Z","dependencies_parsed_at":"2022-07-30T07:37:59.148Z","dependency_job_id":null,"html_url":"https://github.com/constellation-rs/amadeus","commit_stats":null,"previous_names":[],"tags_count":24,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/constellation-rs%2Famadeus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/constellation-rs%2Famadeus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/constellation-rs%2Famadeus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/constellation-rs%2Famadeus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/constellation-rs","download_url":"https://codeload.github.com/constellation-rs/amadeus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247809964,"owners_count":20999816,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-processing","distributed-computing","parallel-computing","rust","stream-processing"],"created_at":"2024-08-02T01:04:25.490Z","updated_at":"2025-04-08T09:10:57.549Z","avatar_url":"https://github.com/constellation-rs.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg alt=\"Amadeus\" src=\"https://raw.githubusercontent.com/constellation-rs/amadeus/master/logo.svg?sanitize=true\" width=\"450\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    Harmonious distributed data processing \u0026 analysis in Rust\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://crates.io/crates/amadeus\"\u003e\u003cimg src=\"https://img.shields.io/crates/v/amadeus.svg?maxAge=86400\" alt=\"Crates.io\" /\u003e\u003c/a\u003e\n    \u003ca href=\"LICENSE.txt\"\u003e\u003cimg src=\"https://img.shields.io/crates/l/amadeus.svg?maxAge=2592000\" alt=\"Apache-2.0 licensed\" /\u003e\u003c/a\u003e\n    \u003ca href=\"https://dev.azure.com/alecmocatta/amadeus/_build?definitionId=26\"\u003e\u003cimg src=\"https://dev.azure.com/alecmocatta/amadeus/_apis/build/status/tests?branchName=master\" alt=\"Build Status\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://docs.rs/amadeus\"\u003e📖 Docs\u003c/a\u003e | \u003ca href=\"https://constellation.rs/amadeus\"\u003e🌐 Home\u003c/a\u003e | \u003ca href=\"https://constellation.zulipchat.com/#narrow/stream/213231-amadeus\"\u003e💬 Chat\u003c/a\u003e\n\u003c/p\u003e\n\n## Amadeus provides:\n\n- **Distributed streams:** like [Rayon](https://github.com/rayon-rs/rayon)'s parallel iterators, but distributed across a cluster.\n- **Data connectors:** to work with CSV, JSON, Parquet, Postgres, S3 and more.\n- **ETL and Data Science tooling:** focused on streaming processing \u0026 analysis.\n\nAmadeus is a batteries-included, low-level reusable building block for the [Rust](https://www.rust-lang.org/) Distributed Computing and Big Data ecosystems.\n\n## Principles\n\n- **Fearless:** no data races, no unsafe, and lossless data canonicalization.\n- **Make distributed computing trivial:** running distributed should be as easy and performant as running locally.\n- **Data is gradually typed:** for maximum performance when the schema is known, and flexibility when it's not.\n- **Simplicity:** keep interfaces and implementations as simple and reliable as possible.\n- **Reliability:** minimize unhandled errors (including OOM), and only surface errors that couldn't be handled internally.\n\n## Why Amadeus?\n\n### Clean \u0026 Scalable applications\n\nBy design, Amadeus encourages you to write clean and reusable code that works, regardless of data scale, locally or distributed across a cluster. Write once, run at any data scale.\n\n### Community\n\nWe aim to create a community that is welcoming and helpful to anyone that is interested! Come join us on [our Zulip chat](https://constellation.zulipchat.com/#narrow/stream/213231-amadeus) to:\n\n * get Amadeus working for your use case;\n * discuss direction for the project;\n * find good issues to get started with.\n\n### Compatibility out of the box\n\nAmadeus has deep, pluggable, integration with various file formats, databases and interfaces:\n\n| Data format | [`Source`](https://docs.rs/amadeus/0.3/amadeus/trait.Source.html) | [`Destination`](https://docs.rs/amadeus/0.3/amadeus/trait.Destination.html) |\n|---|---|---|\n| CSV | ✔ | ✔ |\n| JSON | ✔ | ✔ |\n| XML | [👐](https://github.com/constellation-rs/amadeus/issues/15) |  |\n| Parquet | ✔ | [🔨](https://github.com/constellation-rs/amadeus) |\n| Avro | [🔨](https://github.com/constellation-rs/amadeus) |  |\n| PostgreSQL | ✔ | [🔨](https://github.com/constellation-rs/amadeus) |\n| HDF5 | [👐](https://github.com/constellation-rs/amadeus) |  |\n| Redshift | [👐](https://github.com/constellation-rs/amadeus) |  |\n| [CloudFront Logs](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html) | ✔ | – |\n| [Common Crawl](http://commoncrawl.org/the-data/get-started/) | ✔ | – |\n| S3 | ✔ | [🔨](https://github.com/constellation-rs/amadeus) |\n| HDFS | [👐](https://github.com/constellation-rs/amadeus) | [👐](https://github.com/constellation-rs/amadeus) |\n\n✔ = Working\u003cbr/\u003e\n🔨 = Work in Progress\u003cbr/\u003e\n👐 = Requested: check out the issue for how to help!\n\n### Performance\n\nAmadeus is routinely benchmarked and provisional results are very promising:\n\n * A 1.5x to 17x speedup reading Parquet data compared to the official Apache Arrow [`parquet`](https://crates.io/crates/parquet) crate with [these benchmarks](https://github.com/constellation-rs/amadeus/blob/3e96dbdfb77e8f874b6479c36ab4f344ff4781e4/amadeus-parquet/src/internal/file/reader.rs#L1100-L1184).\n\n### Runs Everywhere\n\nAmadeus is a library that can be used on its own as parallel threadpool, or with [**Constellation**](https://github.com/constellation-rs/constellation) as a distributed cluster.\n\n[**Constellation**](https://github.com/constellation-rs/constellation) is a framework for process distribution and communication, and has backends for a bare cluster (Linux or macOS), a managed Kubernetes cluster, and more in the pipeline.\n\n## Examples\n\nThis will read the Parquet partitions from the S3 bucket, and print the 100 most frequently occuring URLs.\n\n```rust\nuse amadeus::prelude::*;\nuse amadeus::data::{IpAddr, Url};\nuse std::error::Error;\n\n#[derive(Data, Clone, PartialEq, Debug)]\nstruct LogLine {\n    uri: Option\u003cString\u003e,\n    requestip: Option\u003cIpAddr\u003e,\n}\n\n#[tokio::main]\nasync fn main() -\u003e Result\u003c(), Box\u003cdyn Error\u003e\u003e {\n    let pool = ThreadPool::new(None, None)?;\n\n    let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(\n        AwsRegion::UsEast1,\n        \"us-east-1.data-analytics\",\n        \"cflogworkshop/optimized/cf-accesslogs/\",\n        AwsCredentials::Anonymous,\n    )))\n    .await?;\n\n    let top_pages = rows\n        .par_stream()\n        .map(|row: Result\u003cLogLine, _\u003e| {\n            let row = row.unwrap();\n            (row.uri, row.requestip)\n        })\n        .most_distinct(\u0026pool, 100, 0.99, 0.002, 0.0808)\n        .await;\n\n    println!(\"{:#?}\", top_pages);\n    Ok(())\n}\n```\n\nThis is typed, so faster, and it goes an analytics step further also, prints top 100 URLs by distinct IPs logged.\n\n\u003cdetails\u003e\n\u003csummary\u003eSee the same example but with data dynamically typed.\u003c/summary\u003e\n\n```rust\nuse amadeus::prelude::*;\nuse std::error::Error;\n\n#[tokio::main]\nasync fn main() -\u003e Result\u003c(), Box\u003cdyn Error\u003e\u003e {\n    let pool = ThreadPool::new(None, None)?;\n\n    let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(\n        AwsRegion::UsEast1,\n        \"us-east-1.data-analytics\",\n        \"cflogworkshop/optimized/cf-accesslogs/\",\n        AwsCredentials::Anonymous,\n    )))\n    .await?;\n\n    let top_pages = rows\n        .par_stream()\n        .map(|row: Result\u003cValue, _\u003e| {\n            let row = row.ok()?.into_group().ok()?;\n            row.get(\"uri\")?.clone().into_url().ok()\n        })\n        .filter(|row| row.is_some())\n        .map(Option::unwrap)\n        .most_frequent(\u0026pool, 100, 0.99, 0.002)\n        .await;\n\n    println!(\"{:#?}\", top_pages);\n    Ok(())\n}\n```\n\n\u003c/details\u003e\n\nWhat about loading this data into Postgres? This will create and populate a table called \"accesslogs\".\n\n```rust,ignore\nuse amadeus::prelude::*;\nuse std::error::Error;\n\n#[tokio::main]\nasync fn main() -\u003e Result\u003c(), Box\u003cdyn Error\u003e\u003e {\n    let pool = ThreadPool::new(None, None)?;\n\n    let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(\n        AwsRegion::UsEast1,\n        \"us-east-1.data-analytics\",\n        \"cflogworkshop/optimized/cf-accesslogs/\",\n        AwsCredentials::Anonymous,\n    )))\n    .await?;\n\n    // Note: this isn't yet implemented!\n    rows.par_stream()\n        .pipe(Postgres::new(\"127.0.0.1\", PostgresTable::new(\"accesslogs\")));\n\n    Ok(())\n}\n```\n\n## Running Distributed\n\nOperations can run on a parallel threadpool or on a distributed process pool.\n\nAmadeus uses the [**Constellation**](https://github.com/constellation-rs/constellation) framework for process distribution and communication. Constellation has backends for a bare cluster (Linux or macOS), and a managed Kubernetes cluster.\n\n```rust\nuse amadeus::dist::prelude::*;\nuse amadeus::data::{IpAddr, Url};\nuse constellation::*;\nuse std::error::Error;\n\n#[derive(Data, Clone, PartialEq, Debug)]\nstruct LogLine {\n    uri: Option\u003cString\u003e,\n    requestip: Option\u003cIpAddr\u003e,\n}\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn Error\u003e\u003e {\n    init(Resources::default());\n\n    // #[tokio::main] isn't supported yet so unfortunately setting up the Runtime must be done explicitly\n    tokio::runtime::Builder::new()\n        .threaded_scheduler()\n        .enable_all()\n        .build()\n        .unwrap()\n        .block_on(async {\n            let pool = ProcessPool::new(None, None, None, Resources::default())?;\n\n            let rows = Parquet::new(ParquetDirectory::new(S3Directory::new_with(\n                AwsRegion::UsEast1,\n                \"us-east-1.data-analytics\",\n                \"cflogworkshop/optimized/cf-accesslogs/\",\n                AwsCredentials::Anonymous,\n            )))\n            .await?;\n\n            let top_pages = rows\n                .dist_stream()\n                .map(FnMut!(|row: Result\u003cLogLine, _\u003e| {\n                    let row = row.unwrap();\n                    (row.uri, row.requestip)\n                }))\n                .most_distinct(\u0026pool, 100, 0.99, 0.002, 0.0808)\n                .await;\n\n            println!(\"{:#?}\", top_pages);\n            Ok(())\n        })\n}\n```\n\n## Getting started\n\ntodo\n\n### Examples\n\nTake a look at the various [examples](examples).\n\n## Contribution\n\nAmadeus is an open source project! If you'd like to contribute, check out the list of [“good first issues”](https://github.com/constellation-rs/amadeus/contribute). These are all (or should be) issues that are suitable for getting started, and they generally include a detailed set of instructions for what to do. Please ask questions and ping us on [our Zulip chat](https://constellation.zulipchat.com/#narrow/stream/213231-amadeus) if anything is unclear!\n\n## License\nLicensed under Apache License, Version 2.0, ([LICENSE.txt](LICENSE.txt) or\nhttp://www.apache.org/licenses/LICENSE-2.0).\n\nUnless you explicitly state otherwise, any contribution intentionally submitted\nfor inclusion in the work by you, as defined in the Apache-2.0 license, shall be\nlicensed as above, without any additional terms or conditions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconstellation-rs%2Famadeus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconstellation-rs%2Famadeus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconstellation-rs%2Famadeus/lists"}