{"id":8346907,"url":"https://github.com/apache/datafusion-ballista","last_synced_at":"2025-12-12T13:02:59.885Z","repository":{"id":37007937,"uuid":"494107715","full_name":"apache/datafusion-ballista","owner":"apache","description":"Apache DataFusion Ballista Distributed Query Engine","archived":false,"fork":false,"pushed_at":"2025-05-05T16:56:33.000Z","size":21480,"stargazers_count":1750,"open_issues_count":127,"forks_count":217,"subscribers_count":50,"default_branch":"main","last_synced_at":"2025-05-14T01:09:12.833Z","etag":null,"topics":["arrow","big-data","dataframe","distributed","olap","python","query-engine","rust","sql"],"latest_commit_sha":null,"homepage":"https://datafusion.apache.org/ballista","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apache.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-05-19T14:32:27.000Z","updated_at":"2025-05-13T22:08:47.000Z","dependencies_parsed_at":"2023-11-30T19:24:27.696Z","dependency_job_id":"c16cd2b8-2ef1-4da8-8def-60175f6f1af4","html_url":"https://github.com/apache/datafusion-ballista","commit_stats":{"total_commits":4274,"total_committers":364,"mean_commits":"11.741758241758241","dds":0.9087505849321479,"last_synced_commit":"e5cce8a2d7fc1ffa6e234ea7b1c58d284f4eeb13"},"previous_names":["apache/datafusion-ballista","apache/arrow-ballista"],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fdatafusion-ballista","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fdatafusion-ballista/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fdatafusion-ballista/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fdatafusion-ballista/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apache","download_url":"https://codeload.github.com/apache/datafusion-ballista/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254059520,"owners_count":22007771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","big-data","dataframe","distributed","olap","python","query-engine","rust","sql"],"created_at":"2024-04-21T20:01:47.011Z","updated_at":"2025-12-12T13:02:59.878Z","avatar_url":"https://github.com/apache.png","language":"Rust","readme":"\u003c!---\n  Licensed to the Apache Software Foundation (ASF) under one\n  or more contributor license agreements.  See the NOTICE file\n  distributed with this work for additional information\n  regarding copyright ownership.  The ASF licenses this file\n  to you under the Apache License, Version 2.0 (the\n  \"License\"); you may not use this file except in compliance\n  with the License.  You may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\n  Unless required by applicable law or agreed to in writing,\n  software distributed under the License is distributed on an\n  \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n  KIND, either express or implied.  See the License for the\n  specific language governing permissions and limitations\n  under the License.\n--\u003e\n\n# Ballista: Making DataFusion Applications Distributed\n\n[![Apache licensed][license-badge]][license-url]\n\n[license-badge]: https://img.shields.io/badge/license-Apache%20v2-blue.svg\n[license-url]: https://github.com/apache/datafusion-comet/blob/main/LICENSE.txt\n\n\u003cimg src=\"docs/source/_static/images/ballista-logo.png\" width=\"512\" alt=\"logo\"/\u003e\n\nBallista is a distributed query execution engine that enhances [Apache DataFusion](https://github.com/apache/datafusion) by enabling the parallelized execution of workloads across multiple nodes in a distributed environment.\n\nExisting DataFusion application:\n\n```rust\nuse datafusion::prelude::*;\n\n#[tokio::main]\nasync fn main() -\u003e datafusion::error::Result\u003c()\u003e {\n  let ctx = SessionContext::new();\n\n  // register the table\n  ctx.register_csv(\"example\", \"tests/data/example.csv\", CsvReadOptions::new())\n      .await?;\n\n  // create a plan to run a SQL query\n  let df = ctx\n      .sql(\"SELECT a, MIN(b) FROM example WHERE a \u003c= b GROUP BY a LIMIT 100\")\n      .await?;\n\n  // execute and print results\n  df.show().await?;\n  Ok(())\n}\n```\n\ncan be distributed with few lines of code changed:\n\n\u003e [!IMPORTANT]  \n\u003e There is a gap between DataFusion and Ballista, which may bring incompatibilities. The community is actively working to close the gap\n\n```rust\nuse ballista::prelude::*;\nuse datafusion::prelude::*;\n\n#[tokio::main]\nasync fn main() -\u003e datafusion::error::Result\u003c()\u003e {\n    // create SessionContext with ballista support\n    // standalone context will start all required\n    // ballista infrastructure in the background as well\n    let ctx = SessionContext::standalone().await?;\n\n    // everything else remains the same\n\n    // register the table\n    ctx.register_csv(\"example\", \"tests/data/example.csv\", CsvReadOptions::new())\n        .await?;\n\n    // create a plan to run a SQL query\n    let df = ctx\n        .sql(\"SELECT a, MIN(b) FROM example WHERE a \u003c= b GROUP BY a LIMIT 100\")\n        .await?;\n\n    // execute and print results\n    df.show().await?;\n    Ok(())\n}\n```\n\nFor documentation or more examples, please refer to the [Ballista User Guide][user-guide].\n\n## Architecture\n\nA Ballista cluster consists of one or more scheduler processes and one or more executor processes. These processes\ncan be run as native binaries and are also available as Docker Images, which can be easily deployed with\n[Docker Compose](https://datafusion.apache.org/ballista/user-guide/deployment/docker-compose.html) or\n[Kubernetes](https://datafusion.apache.org/ballista/user-guide/deployment/kubernetes.html).\n\nThe following diagram shows the interaction between clients and the scheduler for submitting jobs, and the interaction\nbetween the executor(s) and the scheduler for fetching tasks and reporting task status.\n\n![Ballista Cluster Diagram](docs/source/contributors-guide/ballista_architecture.excalidraw.svg)\n\nSee the [architecture guide](docs/source/contributors-guide/architecture.md) for more details.\n\n## Performance\n\nWe run some simple benchmarks comparing Ballista with Apache Spark to track progress with performance optimizations.\nThese are benchmarks derived from TPC-H and not official TPC-H benchmarks. These results are from running individual\nqueries at scale factor 100 (100 GB) on a single node with a single executor and 8 concurrent tasks.\n\n### Overall Speedup\n\nThe overall speedup is 2.9x\n\n![benchmarks](docs/source/_static/images/tpch_allqueries.png)\n\n### Per Query Comparison\n\n![benchmarks](docs/source/_static/images/tpch_queries_compare.png)\n\n### Relative Speedup\n\n![benchmarks](docs/source/_static/images/tpch_queries_speedup_rel.png)\n\n### Absolute Speedup\n\n![benchmarks](docs/source/_static/images/tpch_queries_speedup_abs.png)\n\n## Getting Started\n\nThe easiest way to get started is to run one of the standalone or distributed [examples](./examples/README.md). After\nthat, refer to the [Getting Started Guide](ballista/client/README.md).\n\n## Project Status\n\nBallista supports a wide range of SQL, including CTEs, Joins, and subqueries and can execute complex queries at scale,\nbut still there is a gap between DataFusion and Ballista which we want to bridge in near future.\n\nRefer to the [DataFusion SQL Reference](https://datafusion.apache.org/user-guide/sql/index.html) for more\ninformation on supported SQL.\n\n## Contribution Guide\n\nPlease see the [Contribution Guide](CONTRIBUTING.md) for information about contributing to Ballista.\n\n[user-guide]: https://datafusion.apache.org/ballista/\n","funding_links":[],"categories":["Rust","其他_大数据","\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust"],"sub_categories":["资源传输下载"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fdatafusion-ballista","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapache%2Fdatafusion-ballista","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fdatafusion-ballista/lists"}