{"id":16707193,"url":"https://github.com/expandingman/arrow.jl","last_synced_at":"2025-10-29T22:20:08.172Z","repository":{"id":54426997,"uuid":"117885248","full_name":"ExpandingMan/Arrow.jl","owner":"ExpandingMan","description":"DEPRECATED in favor of [JuliaData/Arrow.jl](https://github.com/JuliaData/Arrow.jl)","archived":false,"fork":false,"pushed_at":"2020-11-16T19:22:08.000Z","size":253,"stargazers_count":55,"open_issues_count":26,"forks_count":9,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-06-13T11:06:26.908Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ExpandingMan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-01-17T19:51:57.000Z","updated_at":"2025-06-06T08:41:10.000Z","dependencies_parsed_at":"2022-08-13T15:20:24.675Z","dependency_job_id":null,"html_url":"https://github.com/ExpandingMan/Arrow.jl","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/ExpandingMan/Arrow.jl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExpandingMan%2FArrow.jl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExpandingMan%2FArrow.jl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExpandingMan%2FArrow.jl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExpandingMan%2FArrow.jl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ExpandingMan","download_url":"https://codeload.github.com/ExpandingMan/Arrow.jl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ExpandingMan%2FArrow.jl/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259634348,"owners_count":22887697,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T19:37:54.746Z","updated_at":"2025-10-29T22:20:08.109Z","avatar_url":"https://github.com/ExpandingMan.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ⛔Deprecation Notice ⛔\nThis package is deprecated in favor of\n[JuliaData/Arrow.jl](https://github.com/JuliaData/Arrow.jl).\n\nAs of writing, this package is still used by Feather.jl which reads and writes legacy\nfeather v1 files.  However, as Feather v1 is deprecated in favor of Feather v2 which is\nimplemented by JuliaData/Arrow.jl, it is unlikely that Feather.jl will be maintained in\nthe future.  We recommend using either Feather.jl or pyarrow to convert your data to the\nlatest feather format.\n\n# Arrow\n\n[![Build Status](https://travis-ci.org/ExpandingMan/Arrow.jl.svg?branch=master)](https://travis-ci.org/ExpandingMan/Arrow.jl)\n[![codecov.io](http://codecov.io/github/ExpandingMan/Arrow.jl/coverage.svg?branch=master)](http://codecov.io/github/ExpandingMan/Arrow.jl?branch=master)\n\nThis is a pure Julia implementation of the [Apache Arrow](https://arrow.apache.org) data standard.  This package provides Julia `AbstractVector` objects for\nreferencing data that conforms to the Arrow standard.  This allows users to seamlessly interface Arrow formatted data with a great deal of existing Julia code.\n\nPlease see this [document](https://arrow.apache.org/docs/memory_layout.html) for a description of the Arrow memory layout.\n\n***WARNING*** As of right now this package uses Julia `Ptr` (pointer) objects and \"unsafe\" methods.  This is for performance reasons.  It should in principle be\npossible to make this package completely safe with little to no loss in performance, but we are waiting on some performance improvements in `Base`.  While\nArrow.jl has been tested and should be safe with proper usage, it is up to the user to make sure that their Arrow.jl objects reference the appropriate locations\nin data.  If the user, for example, uses an Arrow.jl object to reference data past the end of an array, the resulting program will segfault!\n\n\n## Installation\nJust do\n```julia\nimport Pkg; Pkg.add(\"Arrow\")\n```\nArrow only has `CategoricalArrays` as a dependency (and `Missings` on 0.6).\n\n\n## `ArrowVector` Objects\nThe `Arrow` package exposes several `ArrowVector{J} \u003c: AbstractVector{J}` objects.  These provide an interface to arrow formatted data as well as providing\nmethods to convert Julia objects to the Arrow data format.  The simplest of these is\n```julia\nPrimitive{J} \u003c: ArrowVector{J}\n```\nThis object maintains a reference to a data buffer (a `Vector{UInt8}`) and describes and contiguous subset of it.  It will automatically convert the underlying\ndata to the type `J` on demand.  The `Primitive` type can only describe bits type elements (i.e. types for which `isbits` is true, in particular not strings).  In the\nfollowing example we create a `Primitive` to address a subset of a buffer\n```julia\ndata = [0, 2, 3, 5, 7, 0] # this will be the underlying data from which we create our buffer\nbuff = reinterpret(UInt8, data) # in this simple case the Arrow format and Julia's in-memory format coincide\np = Primitive{Int}(buff, 9, 4) # arguments are: buffer, start location, length\n\np[1] # returns 2\np[2:3] # returns the (non-arrow) Vector [3,5]\np[:] # returns the (non-arrow) Vector [2,3,5,7]\n\np[2] = 999 # assignment is supported for AbstractPrimitive types. this change is reflected in buff and data\n\n\nq = Primitive([2,3,5,7]) # if we didn't already have a buffer we needed to reference, we can create one like this\nq = arrowformat([2,3,5,7]) # the arrowformat function automatically determines the appropriate ArrowVector for the provided array\nrawvalues(q) # this returns the created buffer as a Vector{UInt8}\n```\nHere we see that indexing an `ArrowVector` returns ordinary Julia arrays containing the data stored in the Arrow buffer.  All other `ArrowVector` objects are\nbuilt out of combinations of `Primitive`s.\n\nEnter `?Primitive` in the REPL for a full list of constructors.\n\n***Note:*** In what follows we show explicit methods for constructing each `ArrowVector` type from a raw data buffer.  This can become a bit confusing where\nthere are many sub-buffer locations to keep track of, so it is strongly suggested that you make use of the `Locate` interface described in the next section.\n\n### The `NullablePrimitive` Type\nThe Arrow format also supports arrays with bits type elements that may be null.  For these we provide the `NullablePrimitive{J} \u003c: AbstractVector{Union{J,Missing}}` type.  Under the hood the\n`NullablePrimitive` type is a pair of `Primitive`s: one references a `Primitive{UInt8}` bit mask describing which elements of the `NullablePrimitive` are null and the\nother references the underlying data.  In the following example we create a `NullablePrimitive` from an existing buffer\n```julia\nbuff = [[0x0d]; reinterpret(UInt8, [2.0, 3.0, 5.0, 7.0])]  # bits(0x0d) == \"00001101\"\np = NullablePrimitive{Float64}(buff, 1, 2, 4) # arguments are: buffer, bitmask location, values location, length\n\np[1] # returns 2.0\np[2] # returns missing\np[1:4] # returns [2.0, missing, 5.0, 7.0]\n\np[2] = 3.0  # assignment also supported for NullablePrimitive, the change will be reflected in buff\n\n\nq = NullablePrimitive([2.0,missing,5.0,7.0]) # if we didn't already have a buffer we needed to reference, we can create one\n# the above will create seperate buffers for the bit mask and values. to create a contiguous buffer containing all we can do\nq = NullablePrimitive(Array, [2.0,missing,5.0,7.0])\nq = arrowformat([2.0,missing,5.0,7.0]) # you can also use arrowformat to automatically determine the ArrowVector type\nrawvalues(bitmask(q)) # returns [0x0d]\n```\n\nEnter `?NullablePrimitive` in the REPL for a full list of constructors.\n\n### The `List` Type\nThe underlying dataformat for arbitrary length objects such as strings is more complicated, so these objects require a dedicated type.  For these we provide\n`List{J} \u003c: AbstractVector{J}`.  As well as containing the values contained by strings, these objects contain \"offsets\" for describing how long each string\nshould be.  The arrow format suggests that these offsets are `Int32`s and that there are `length(l)+1` of them.  For example\n```julia\noffs = reinterpret(UInt8, Int32[0,3,5,7])\nvals = convert(Vector{UInt8}, \"abcdefg\")\nbuff = [offs; vals]\n# type parameters: List return type, offsets type (must be \u003c:Integer)\nl = List{String,Int32}(buff, 1, length(offs)+1, 3, UInt8, length(vals)) # arguments are: buffer, offsets location, values location, length of List, value type, values length\n\n# alternatively we can construct the values separately\nv = Primitive{UInt8}(buff, length(offs)+1, length(vals))\nl = List{String,Int32}(buff, 1, 3, v) # arguments are: buffer, offset location, length, values primitive\n\n# or you can create each piece individually\no = Primitive{Int32}(buff, 1, 4)  # note that the Int32 type is required for offsets by the arrow format\nv = Primitive{UInt8}(buff, length(offs)+1, length(vals))\nl = List{String}(o, v)\n\nl[1] # returns \"abc\"\nl[2] # returns \"de\"\nl[3] # returns \"fg\"\nl[1:3] # returns a normal Vector{String} (copies data!)\n\nl[1] = \"a\"  # ERROR: assignments are not currently supported for list types\n\n\nm = List([\"abc\", \"de\", \"fg\"]) # just as in the other cases, you can create your own data\nm = List(Array, [\"abc\", \"de\", \"fg\"]) # you can also require it all to be in a contiguous buffer\nm = arrowformat([\"abc\", \"de\", \"fg\"]) # as always arrowformat automatically determines the ArrowVector type\nrawvalues(offsets(m)) # returns reinterpret(UInt8, [0,3,5,7])\nrawvalues(values(m)) # returns convert(Vector{UInt8}, \"abcdefg\")\n```\nNote that `List{J}` and `NullableList{J}` use the constructor `J(::AbstractVector{C})` where `C` is the values type (in the above example `UInt8`)\n\n***WARNING:*** Currently the values of the offsets themselves are not bounds-checked for performance reasons.  This means you have to be extra sure that your\noffsets are properly constructed.  It is recommended that you always use `arrowformat`, `List`, or `offsets` to construct offsets, this should not be done\nmanually.\n\nEnter `?List` in the REPL for a full list of constructors.\n\n\n### The `NullableList` Type\nNext we have the `NullableList{J} \u003c: AbstractVector{Union{J,Missing}}` type.  `NullableList` is to `List` as `NullablePrimitive` is to `Primitive`.  In addition\nto offsets and values, it also contains a bit mask describing which elements are null.  By now you can probably predict what the example will look like\n```julia\nbmask = [0x05] # bits(0x05) == \"00000101\"\noffs = reinterpret(UInt8, Int32[0,3,5,7])\nvals = convert(Vector{UInt8}, \"abcdefg\")\nbuff = [bmask; offs; vals]\nl = NullableList{String,Int32}(buff, 1, 2, length(offs)+2, 3, UInt8, length(vals))\n# arguments above are: buffer, bit mask location, offsets location, values location, list length, values type, values length\n\n# again you can also provide each piece separately\nb = Primitive{UInt8}(buff, 1, 1)  # required to have eltype UInt8\no = Primitive{Int32}(buff, 2, 4)  # required to have eltype Int32\nv = Primitive{UInt8}(buff, length(offs)+2, length(vals))\nl = NullableList{String,Int32}(b, o, v)\n\nl[1] # returns \"abc\"\nl[2] # returns missing\nl[3] # returns \"fg\"\n\nl[2] = \"de\"  # ERROR assignments not currently supported for list types\n\n\n# you can also create lists of Primitives, though this may involve copying\nl = NullableList{Primitive{UInt8},Int32}(b, o, v)\n\n\n# by now all the ways of creating this from our own data should be familiar\nm = NullableList([\"abc\", missing, \"fg\"])\nm = NullableList(Array, [\"abc\", missing, \"fg\"])\nm = arrowformat([\"abc\", missing, \"fg\"])\n```\n\nEnter `?NullableList` in the REPL for a full list of constructors.\n\n\n### The `DictEncoding` Type\nThe arrow format also supports dictionary encoding of arrays.  What this means is simply that instead of one array, there are two, a \"short\" array containing a\nview values, and a \"long\" array which contains pointers to those values (required by the Arrow standard to be `Int32`).  This provides a way of compressing\narrays in which a relatively small number of values are repeated in large numbers.  Arrow.jl uses the Julia package\n[CategoricalArrays.jl](https://github.com/JuliaData/CategoricalArrays.jl) to support this functionality.  `CategoricalArray`s will be dictionary encoded by\ndefault when converted to Arrow array objects.  One aspect of this that may seem confusing is that references are required to be 0-based indices, which is\ncontrary to the Julia 1-based approach we've used for everything else.  In practice this shouldn't matter much: references do not need to be constructed\nmanually.  See the following\n```julia\n# in most real cases these would be constructed from data in one of the ways described above\nrefs = Primitive{Int32}([0, 1, 2, 0, 1, 3])\nvals = List([\"fire\", \"walk\", \"with\", \"me\"])\nA = DictEncoding(refs, vals)\n\nA[1] # returns \"fire\"\nA[5] # return \"walk\"\nA[[1,2,3,6]] # returns [\"fire\", \"walk\", \"with\", \"me\"]\n\n\n# you can also create your own from Julia data\nB = DictEncoding([\"fire\", \"walk\", \"with\", \"me\"])  # in this case there is no benefit to DictEncoding over List\n# arrowformat will automatically convert any CategoricalArray object to an Arrow formatted DictEncoding\nB = arrowformat(categorical([\"fire\", \"walk\", \"with\", \"me\"]))\n```\nNote that indexing a `DictEncoding{T}` object will return objects of type `T` or `Vector{T}`.  The only exception is when indexing with a `:`, `A[:]`, in which\ncase a `CategoricalArray` will be returned (equivalently, this can be done with `categorical(A)`.  In order to retrieve slices as `CategoricalArray`, one should\nuse the `categorical` function, e.g. `categorical(A, slice)`.\n\n### The `BitPrimitive` and `NullableBitPrimitive` Types\nBecause the Arrow format specifies that `Bool`s should be stored as single bits, a special type is required to store Arrow formatted `Bool` data.  These are\nanalogous to the Julia `BitVector` object.  Note that there is nothing stopping you from serializing Julia `Bool` (which are 8-bit), but these will not in\ngeneral be readable outside of Julia.  `arrowformat` will automatically convert `AbstractVector{Bool}` and `AbstractVector{Union{Bool,Missing}}` to\n`BitPrimitive` and `NullableBitPrimitive` respectively.  These types also provide the usual constructors as seen for the other types above.\n\n## Serializing Julia Data\nNothing is stopping you from storing Julia bits-type data that is not necessarily specified by the Arrow format.  For example, a `Primitive{Complex128}` will\nwork just as expected.  `ArrowVector` objects were deliberately designed so that the way they construct their output depends *only* on their type parameter.\nWhile `arrowformat` will pick the appropriate `ArrowVector` for Arrow formatting data, there are no \"hidden conversions\" happening under the hood: the type\nparameter of your `ArrowVector` object is what you get.  You can therefore serialize any type for which `isbits` is true.  In principle you can also serialize\nmore complicated types using `List`s.  The only caveat is that any type not explicitly described in the Arrow standard will not in general be readable outside\nof Julia.\n\n\n## The `Locate` Interface\nGiven a `Vector{UInt8}` locating your data objects can be rather pedantic, and the last thing you want to do is point your `ArrowVector` objects to the wrong\nmemory locations, as this will lead to scary undefined behavior.  Arrow provides an interface that will make this significantly easier provided your metadata is\nsufficiently well organized (which it always should be).  This interface will also check to make sure the locations you specify have proper alignment (still\ndoes not guarantee correctness!!).  The idea here is to create Julia `struct`s which somehow represent the metadata of the various objects you want to access.\nIn the following, assume you have defined\n```julia\nstruct ObjMetadata\n    # whatever is needed to locate objects and determine their types goes here. You can also use type parameters if you want\nend\n```\nYou are not limited to only having one such object, you can have arbitrarily many.  Once you define the appropriate methods (described below), all you need to\ndo is call\n```julia\nlocate(data, T, obj)\n# data is the data buffer; T is the return type of the container being constructed; obj is the ObjMetadata\n```\nThis will automatically create the `ArrowVector` object that represents your data.\nThe type parameter you provide specifies the return type, for example you might construct a `List` with `locate(data, String, obj)` or a `NullableBitPrimitive`\nwith `locate(data, Union{Bool,Missing}, obj)`.\n\n### Minimal Interface\nThe minimal way of implementing the locate interface requires defining some of the following methods\n```julia\nLocate.length(obj::ObjMetadata) = # the length (in number of elements) of the ArrowVector\nLocate.values(obj::ObjMetadata) = # location of values (i.e. return value data; char data for Lists)\nLocate.valueslength(obj::ObjMetadata) = # the length of the values sub-buffer (in number of elements); not needed for Primitives, only Lists\nLocate.bitmask(obj::ObjMetadata) = # location of null bitmask\nLocate.offsets(obj::ObjMetadata) = # location of offsets buffer\n```\nOf course, you may only need to define a subset of these.  For example, if all you want are `Primitive`s, you need only define `Locate.length` and\n`Locate.values`.  If you never need lists, you needn't define `Locate.valueslength` or `Locate.offsets`.\n\n### Overriding Defaults\nThe above interface may not be adequate for all purposes.  For example, if you only define the methods listed above, the offsets type will always default to\n`Int32` (the Arrow format standard).  Furthermore, the type of `ArrowVector` will be determined by the desired return value (i.e. `Primitive` for bits-types,\n`List` for strings, `NullablePrimitive` for `Union{T,Missing}` where `T` is a bits-type, etc.)  To override these defaults you can use more detailed methods:\n```julia\n# value data can be specified by defining the Locate.Values methods\n# T is the value data type, but you may not need it because the overall container return type will override it\nLocate.Values{T}(obj::ObjMetadata) = Locate.Values{T}(Locate.values(obj), Locate.valueslength(obj))\n\n# you need a slightly different Values constructor for List values\n# here T is the List return type so you can use it if you need to, but you may not\nLocate.Values(::Type{T}, obj::ObjMetadata) where T = Locate.Values{UInt8}(Locate.values(obj), Locate.valueslength(obj))\n\n# there's not really a reason to define Locate.Bitmask if you defined Locate.bitmask, but it's there\nLocate.Bitmask(obj::ObjMetadata) = Locate.Bitmask(Locate.bitmask(obj))\n\n# you can use Locate.Offsets to override the default offset type of Int32\nLocate.Offsets(obj::ObjMetadata) = Locate.Offsets{Int64}(Locate.offsets(obj))\n\n# as we described, you can also override the default container types, but this is not recommended\n# it may be useful for custom types, but remember these won't in general be usable outside of Julia\nLocate.containertype(::Type{CustomType}, obj::ObjMetadata) = NullablePrimitive # returned value should have no type paramters\n```\nIn the above we showed constructors receiving the arguments they *would* receive if you *only* defined the `Locate` methods listed in the previous section, but\nof course you can make these constructors do anything you want, as long as return the proper type as an output.\n\n\n## Writing Data\nWriting is somewhat simpler than reading as Arrow will figure out how to convert ordinary Julia data to Arrow formatted data for you.  In addition to\n`arrowformat` the other two most important functions for writing data will be `rawpadded` and `writepadded`.  `rawpadded` takes a `Primitive` as argument and\nreturns a properly Arrow padded `Vector{UInt8}` appropriate for writing the data directly to an Arrow formatted buffer.  `writepadded` will write the properly\npadded array to an `IO` object.\n```julia\nA = NullableList(data)\nwritepadded(io, A, bitmask, offsets, values)  # write bitmask, offsets then values of A, all contiguously, all properly padded\n\nB = DictEncoding(data)\nwritepadded(io, B, references)  # writes references\nwritepadded(io, levels(B), offsets, bitmask, values)  # writes the NullableList in a different order than above\n```\n\nThe following table show which sub-buffers are relevant for which `ArrowVector`s.  All sub-buffers can be accessed as `Primitive`s simply by doing, for example\n`bitmask(l)` where `l isa ArrowVector{Union{T,Missing}} where T` returns the primitive representing the null bit mask.\n\n|        | `values` | `bitmask` | `offsets` |\n| --- | --- | --- | --- |\n| `Primitive` | 1 | 0 | 0 |\n| `NullablePrimitive` | 1 | 1 | 0 |\n| `List` | 1 | 0 | 1 |\n| `NullableList` | 1 | 1 | 1 |\n| `BitPrimitive` | 1 | 0 | 0 |\n| `NullableBitPrimitive` | 1 | 1 | 0 |\n\n`DictEncoding` is a bit more complicated as it can contain any of the other types, but its references and data pool can be accessed with `references` and `pool`\nrespectively.\n\n## DateTime\nArrow.jl provides Arrow formatted date-time objects that have Julia equivalents.  These are `Arrow.Datestamp=\u003eDates.Date`, `Arrow.Timestamp=\u003eDates.DateTime` and\n`Arrow.TimeOfDay=\u003eDates.Time`.  The `arrowformat` function will automatically convert objects of the Julia `Dates` types to the appropriate Arrow format.  When\nconstructing the various `ArrowVector` objects, this conversion must be specified explicitly, e.g. with `Primitive{TimeOfDay}(v)` where `v::Vector{Dates.Time}`.\nThere is nothing stopping you from serializing the Julia `Dates` objects, but they are not in general readable outside of Julia.  The units in which `DateTime`\nand `TimeOfDay` are stored can be specified with `Dates.TimePeriod`s.  For example, to store a `DateTime` with resolution of seconds, one should do\n`convert(Timestamp{Dates.Second}, t)` where `t::DateTime`.\n\n## Working Example\nFor a working (but as of this writing still in-development) example of a package built with Arrow.jl see [this](https://github.com/ExpandingMan/Feather.jl/tree/arrow1) fork of Feather.jl.\n\n## TODO\nA lot of work still to be done:\n- Performance pass: performance seems ok according to basic sanity checks but it that code has neither been optimized nor thoroughly benchmarked.\n- Extensive unit tests needed: hopefully I'll get to more of this soon.\n- Support Arrow Structs.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexpandingman%2Farrow.jl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexpandingman%2Farrow.jl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexpandingman%2Farrow.jl/lists"}