https://github.com/xiaodaigh/diban.jl

Dìbǎn (地板) is a Parquet reader and writer
https://github.com/xiaodaigh/diban.jl

Last synced: 2 months ago
JSON representation

Dìbǎn (地板) is a Parquet reader and writer

Host: GitHub
URL: https://github.com/xiaodaigh/diban.jl
Owner: xiaodaigh
License: mit
Created: 2020-05-01T01:42:05.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2020-08-22T04:46:12.000Z (over 4 years ago)
Last Synced: 2025-01-21T10:08:26.509Z (4 months ago)
Language: Julia
Homepage:
Size: 97.7 KB
Stars: 3
Watchers: 4
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Please use Parquet.jl instead. This is a purely development library. All algorithms will be contributed back to Parquet.jl

## Diban.jl (Dìbǎn 地板)

There is a `write_paruqet` and `read_parquet` functions to write and read

parquet files! It's EXTREMELY slow at the moment but it works on newer Parquet

files that Parquet.jl can't handle at the moment.

The intention is to contribute these functions back to Parquet.jl so as not to

fragment the community efforts. But the process is likely to be slow. Therefore,

I make Dìbǎn available while Parquet.jl is being worked on.

## Installation

You need a particular branch of Parquet.jl and the master branch of Diban.jl

```julia

# add the latest version of Dìbǎn

]add https://github.com/xiaodaigh/Diban.jl

```

## Usage

### Write

Diban supports `Int32, Int64, Float32, Float64, Bool`(including `BitArray`), and `String` vectors, and their `Union` with `Missing`.

```julia

using Diban

using DataFrames

tbl = DataFrame(

    int32 = Int32[-1, 0, 1],

    int64 = Int64[-10, 0, 10],

    float32 = Float32[-0.5, 0, 0.5],

    float64 = Float64[-0.5, 0, 0.5],

    bool = [true, false, true],

    string = ["abc", "def", "ghi"],

    int32m = Union{Missing, Int32}[-1, missing, 1],

    int64m = Union{Missing, Int64}[-10, missing, 10],

    float32m = Union{Missing, Float32}[-0.5, missing, 0.5],

    float64m = Union{Missing, Float64}[-0.5, missing, 0.5],

    boolm = Union{Missing, Bool}[true, missing, false],

    stringm = Union{Missing, String}["abc", missing, "ghi"],

)

path = "c:/scratch/tmp.parquet"

write_parquet(path, tbl)

a = read_parquet(path)

```

### Read

```julia

using Diban

# there are some bugs with multithreading so please use

read_parquet(path)

### reading only columns `col1` and `col2`

read_parquet(path, ["col1", "col2"])

```

### Notes & Bugs?

Currently, only UNnested columns are supported.

There are some bugs with multi-threading so you may want to use `mutlithreaded=false`

```

read_parquet(path, mutlithreaded=false)

```

## TODO

* Add support for CategoricalArrays

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xiaodaigh/diban.jl

Awesome Lists containing this project

README