{"id":16811254,"url":"https://github.com/kszucs/firebolt","last_synced_at":"2025-03-17T10:43:03.289Z","repository":{"id":249570232,"uuid":"824245844","full_name":"kszucs/firebolt","owner":"kszucs","description":"Arrow implementation in Mojo","archived":false,"fork":false,"pushed_at":"2024-07-21T21:49:37.000Z","size":22,"stargazers_count":17,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-09T11:03:44.170Z","etag":null,"topics":["apache-arrow","mojo-lang"],"latest_commit_sha":null,"homepage":"","language":"Mojo","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kszucs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-04T17:11:10.000Z","updated_at":"2025-02-21T03:48:01.000Z","dependencies_parsed_at":"2024-07-21T23:09:22.966Z","dependency_job_id":null,"html_url":"https://github.com/kszucs/firebolt","commit_stats":null,"previous_names":["kszucs/firebolt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kszucs%2Ffirebolt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kszucs%2Ffirebolt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kszucs%2Ffirebolt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kszucs%2Ffirebolt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kszucs","download_url":"https://codeload.github.com/kszucs/firebolt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244019570,"owners_count":20384788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-arrow","mojo-lang"],"created_at":"2024-10-13T10:18:10.138Z","updated_at":"2025-03-17T10:43:03.266Z","avatar_url":"https://github.com/kszucs.png","language":"Mojo","funding_links":[],"categories":[],"sub_categories":[],"readme":"# In-progress implementation of Apache Arrow in Mojo\n\nInitial motivation for this project was to learn the Mojo programming language and the best is to learn by doing. Since I've been involved in the Apache Arrow project for a while, I thought it would be a good idea to implement the Arrow specification in Mojo.\n\nThe implementation is far from being complete or usable in practice, but I prefer to share it its early stage so others can join the effort.\n\n### What is Arrow?\n\n[Apache Arrow](https://arrow.apache.org) is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.\n\n### What is Mojo?\n\n[Mojo](https://www.modular.com/mojo) is promising new programming language built on top of MLIR providing the expressiveness of Python, with the performance of systems programming languages.\n\n### Why Arrow in Mojo?\n\nI find the Mojo lanauge really promising and Arrow should be a first-class citizen in Mojo's ecosystem. Since the language itself is still in its early stages, under heavy development, this Arrow implementation is still in an experimental phase.\n\n## Currently implemented abstractions\n\n- `Buffer` providing the memory management for contiguous memory regions.\n- `DataType` for defining the `Arrow` data types.\n- `ArrayData` as the common layout for all `Arrow` arrays.\n- Typed array views for primitive, string and nested arrow arrays providing more convenient and efficient access to the underlying `ArrayData`.\n- [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) to exchange arrow data between other implementations in a zero-copy manner, but only one direction is implemented for now.\n\n## Examples\n\n### Creating a primitive array\n\n```mojo\nfrom firebolt.arrays import array, StringArray, ListArray, Int64Array\nfrom firebolt.dtypes import int8, bool_, list_\n\nvar a = array[int8](1, 2, 3, 4)\nvar b = array[bool_](True, False, True)\n```\n\n### Creating a string array\n\n```mojo\nvar s = StringArray()\ns.unsafe_append(\"hello\")\ns.unsafe_append(\"world\")\n```\n\nMore convenient APIs are planned to be added in the future.\n\n### Creating a list array\n\n```mojo\nvar ints = Int64Array()\nvar lists = ListArray(ints)\n\nints.append(1)\nints.append(2)\nints.append(3)\nlists.unsafe_append(True)\nassert_equal(len(lists), 1)\nassert_equal(lists.data.dtype, list_(int64))\n```\n\n### Zero-copy access of a PyArrow array in Mojo\n\nFor more details see the [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).\n\n```mojo\nvar pa = Python.import_module(\"pyarrow\")\nvar pyarr = pa.array(\n   [1, 2, 3, 4, 5], mask=[False, False, False, False, True]\n)\n\nvar c_array = CArrowArray.from_pyarrow(pyarr)\nvar c_schema = CArrowSchema.from_pyarrow(pyarr.type)\n\nvar dtype = c_schema.to_dtype()\nassert_equal(dtype, int64)\nassert_equal(c_array.length, 5)\nassert_equal(c_array.null_count, 1)\nassert_equal(c_array.offset, 0)\nassert_equal(c_array.n_buffers, 2)\nassert_equal(c_array.n_children, 0)\n\nvar data = c_array.to_array(dtype)\nvar array = data.as_int64()\nassert_equal(array.bitmap[].size, 64)\nassert_equal(array.is_valid(0), True)\nassert_equal(array.is_valid(1), True)\nassert_equal(array.is_valid(2), True)\nassert_equal(array.is_valid(3), True)\nassert_equal(array.is_valid(4), False)\nassert_equal(array.unsafe_get(0), 1)\nassert_equal(array.unsafe_get(1), 2)\nassert_equal(array.unsafe_get(2), 3)\nassert_equal(array.unsafe_get(3), 4)\nassert_equal(array.unsafe_get(4), 0)\n\narray.unsafe_set(0, 10)\nassert_equal(array.unsafe_get(0), 10)\nassert_equal(str(pyarr), \"[\\n  10,\\n  2,\\n  3,\\n  4,\\n  null\\n]\")\n```\n\n## Rough edges and limitations\n\nSo far the implementation has been focused to provide a solid foundation for further development, not for memory efficiency, performance or completeness.\n\nA couple of notable limitations:\n\n1. The chosen abstractions may not be ideal, but:\n   - mojo lacks support for dynamic dispatch at the moment\n   - variant elements must be copyable\n   - references and lifetimes are not hardened yet\n   - expressing nested data types is not straightforward\n\n   Due to these reasons polymorphism is achieved by defining a common layout for type hierarchies and providing specialized views for each child type. This approach seems to work well for nested `DataType` and `Array` types and the implementation can be continued while `Mojo` gains the necessary features to rethink theses abstractions.\n\n2. The `C Data Interface` doesn't call the release callbacks yet and only consuming arrow data is implemented for now because a `Mojo` callback cannot be passed to a `C` function yet. As mojo matures, this limitation will be certainly addressed.\n\n3. Testing of the conformance against the `Arrow` specification is done by reading arrow data from the python implementation `PyArrow` since `Mojo` can already call python functions. If the project manages to evolve further, it should be wired into the arrow integration testing suite, but first that requires a `JSON` library in `Mojo`.\n\n4. Only boolean, numeric, string, list and struct datatypes are supported for now since these cover most of the implementation complexity. Support for the rest of the arrow data types can be added incrementally.\n\n5. A convenient API hasn't been designed yet, preferably that should be tackled once the implementation is more mature.\n\n6. No `ChunkedArray`s, `RecordBatch`es, `Table`s are implemented yet, but soon they will be.\n\n7. No CI has been set up yet, but it is going to be in focus really soon.\n\n## Development\n\nI shared the implementation it its current state so others can join the effort.\nIf the project manages to evolve, ideally it should be donated to the upstream Apache Arrow project.\n\nGiven an existing Mojo installation the tests can be run with:\n\n```bash\ncd firebolt\nmojo test firebolt -I .\n```\n\nTested with nightly `Mojo`:\n\n```bash\n$ mojo --version\nmojo 2024.7.1805 (0a697965)\n```\n\n## References\n\n- [Another effort to implement Arrow in Mojo](https://github.com/mojo-data/arrow.mojo)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkszucs%2Ffirebolt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkszucs%2Ffirebolt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkszucs%2Ffirebolt/lists"}