{"id":25049211,"url":"https://github.com/milenkovicm/ballista_python","last_synced_at":"2026-04-29T18:32:31.910Z","repository":{"id":275914968,"uuid":"922229851","full_name":"milenkovicm/ballista_python","owner":"milenkovicm","description":"Ballista cluster pyarrow udf support ","archived":false,"fork":false,"pushed_at":"2026-04-22T20:53:03.000Z","size":341,"stargazers_count":2,"open_issues_count":4,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-04-22T22:30:17.451Z","etag":null,"topics":["arrow","ballista","datafusion","distributed","pyarrow","pyo3","python","rust","rust-lang","udf"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/milenkovicm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-25T16:58:08.000Z","updated_at":"2026-04-22T20:53:03.000Z","dependencies_parsed_at":"2025-02-05T09:26:51.939Z","dependency_job_id":"032f1426-c7a2-4cf8-8e27-dea7770ff4ab","html_url":"https://github.com/milenkovicm/ballista_python","commit_stats":null,"previous_names":["milenkovicm/ballista_python"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/milenkovicm/ballista_python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milenkovicm%2Fballista_python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milenkovicm%2Fballista_python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milenkovicm%2Fballista_python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milenkovicm%2Fballista_python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/milenkovicm","download_url":"https://codeload.github.com/milenkovicm/ballista_python/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milenkovicm%2Fballista_python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32439179,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T18:12:22.909Z","status":"ssl_error","status_checked_at":"2026-04-29T18:11:33.322Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","ballista","datafusion","distributed","pyarrow","pyo3","python","rust","rust-lang","udf"],"created_at":"2025-02-06T08:16:53.644Z","updated_at":"2026-04-29T18:32:31.904Z","avatar_url":"https://github.com/milenkovicm.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ballista (Datafusion) Python UDF Support\n\nMake [Datafusion Ballista](https://github.com/apache/datafusion-ballista) support [Datafusion Python](http://github.com/apache/datafusion-python) and shipping pyarrow UDFs to remote task contexts.\n\n\u003e\n\u003e [!IMPORTANT]\n\u003e\n\u003e This is just a showcase project and it is not meant to be maintained.\n\u003e\n\nThis project tests validity of [datafusion-python/1003](https://github.com/apache/datafusion-python/pull/1003).\n\n\u003e\n\u003e [!NOTE]\n\u003e\n\u003e This project has been part of \"Extending DataFusion Ballista\" show case series\n\u003e\n\u003e - [DataFusion Ballista Python UDF Support](https://github.com/milenkovicm/ballista_python)\n\u003e - [DataFusion Ballista Read Support For Delta Table](https://github.com/milenkovicm/ballista_delta)\n\u003e - [Extending DataFusion Ballista](https://github.com/milenkovicm/ballista_extensions)\n\u003e\n\n![architecture](architecture.excalidraw.svg)\n\n## Environment Setup\n\n```bash\npyenv local 3.12\npython3 -m venv .venv\nsource .venv/bin/activate\npip3 install -r requirements.txt\n```\n\nStart [scheduler](examples/scheduler.rs) and [executor](examples/executor.rs).\n\n## Datafusion Python Ballista Integration\n\n[Patched branch](https://github.com/milenkovicm/datafusion-python/tree/poc_ballista_support) of datafusion-python is needed.\n\nA simple script will execute on ballista cluster:\n\n```python\nfrom datafusion import SessionContext, udf, functions as f\nimport pyarrow.compute as pc\nimport pyarrow\n\n# SessionContext with url specified will connect to ballista cluster\nctx = SessionContext(url = \"df://localhost:50050\")\n\nconversation_rate_multiplier = 0.62137119\n\n# arrow udf definition\ndef to_miles(km_data):\n    return pc.multiply(km_data, conversation_rate_multiplier)    \n\n# datafusion udf definition \nto_miles_udf = udf(to_miles, [pyarrow.float64()], pyarrow.float64(), \"stable\")\n\n# its incorrect to convert passenger_count to miles\ndf = df.select(to_miles_udf(f.col(\"passenger_count\")), f.col(\"passenger_count\"))\n\n# show data \ndf.show()\n```\n\nNote: if notebook complains about `cloudpickle` please `!pip install` it, did not have time to find out how to specify it as a dependency.\n\n## Run Datafusion Python\n\n[rust client](examples/client.rs) can wrap and execute python scrip:\n\n```rust\nlet ctx = SessionContext::remote_with_state(\"df://localhost:50050\", state).await?;\n\nlet code = r#\"\nimport pyarrow.compute as pc\n\nconversation_rate_multiplier = 0.62137119\n\ndef to_miles(km_data):    \n    return pc.multiply(km_data, conversation_rate_multiplier)    \n\"#;\n\nlet udf = PythonUDF::from_code(\"to_miles\", code).expect(\"udf created\");\nlet udf = ScalarUDF::from(udf);\n\nctx.read_parquet(\"./data/alltypes.parquet\", ParquetReadOptions::default())\n    .await?\n    .select(vec![udf.call(vec![lit(1.0) * col(\"id\")])])?\n    .show()\n    .await?;\n\n```\n\nshould produce:\n\n```text\n+------------+------------------------------+\n| double_col | to_miles(?table?.double_col) |\n+------------+------------------------------+\n| 0.0        | 0.0                          |\n| 10.1       | 6.275849019                  |\n| 0.0        | 0.0                          |\n| 10.1       | 6.275849019                  |\n| 0.0        | 0.0                          |\n| 10.1       | 6.275849019                  |\n| 0.0        | 0.0                          |\n| 10.1       | 6.275849019                  |\n+------------+------------------------------+\n```\n\n## Defining SQL Function\n\n```rust\n\nlet config = SessionConfig::new_with_ballista()\n    .with_ballista_logical_extension_codec(Arc::new(PyLogicalCodec::default()))\n    .with_target_partitions(4);\n\nlet state = SessionStateBuilder::new()\n    .with_config(config)\n    .with_default_features()\n    .build();\n\nlet ctx = SessionContext::remote_with_state(\"df://localhost:50050\", state)\n    .await?\n    .with_function_factory(Arc::new(PythonFunctionFactory::default()));\n\nlet sql = r#\"\nCREATE FUNCTION to_miles(DOUBLE)\nRETURNS DOUBLE\nLANGUAGE PYTHON\nAS '\nimport pyarrow.compute as pc\n\nconversation_rate_multiplier = 0.62137119\n\ndef to_miles(km_data):\n    return pc.multiply(km_data, conversation_rate_multiplier)\n'\n\"#;\n\nctx.sql(sql).await?.show().await?;\n\nctx.register_parquet(\"t\", \"./data/alltypes.parquet\", ParquetReadOptions::default())\n    .await?;\n\nctx.sql(\"select double_col, to_miles(double_col) from t\")\n    .await?\n    .show()\n    .await?;\n```\n\n## Implementation Internals\n\nProject creates a custom logical (`PyLogicalCodec`) and physical (`PyPhysicalCodec`) codecs which handle serialization and deserialization of python functions using [cloudpickle](https://github.com/cloudpipe/cloudpickle) library.\n\nCustom codecs are registered on `SessionContext` creation:\n\n```rust\nlet config = SessionConfig::new_with_ballista()\n        .with_ballista_logical_extension_codec(Arc::new(PyLogicalCodec::default()))\n        .with_target_partitions(4);\n\nlet state = SessionStateBuilder::new()\n    .with_config(config)\n    .with_default_features()\n    .build();\n\nlet ctx = SessionContext::remote_with_state(\"df://localhost:50050\", state).await?;\n```\n\nCustom `FunctionFactory` provider `PythonFunctionFactory` has been implemented to provide support for `CREATE FUNCTION` statements.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmilenkovicm%2Fballista_python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmilenkovicm%2Fballista_python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmilenkovicm%2Fballista_python/lists"}