Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/johnmyleswhite/Volcanito.jl

A backend agnostic for tabular data operations in Julia
https://github.com/johnmyleswhite/Volcanito.jl

Last synced: 3 months ago
JSON representation

A backend agnostic for tabular data operations in Julia

Host: GitHub
URL: https://github.com/johnmyleswhite/Volcanito.jl
Owner: johnmyleswhite
License: other
Created: 2020-08-28T14:36:38.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2020-10-31T23:35:48.000Z (over 3 years ago)
Last Synced: 2024-01-17T02:00:46.526Z (5 months ago)
Language: Julia
Size: 120 KB
Stars: 25
Watchers: 6
Forks: 3
Open Issues: 19
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Lists

awesome-dataframes - Volcanito.jl - Backend agnostic for tabular data operations in Julia. (Libraries)

README

        # Volcanito.jl

Volcanito is an attempt to start standardizing the user-facing API that tables

expose in Julia. Because that task is too ambitious for one person writing code

in spurts every few months, the project is starting with something less

ambitious:

* Standardize on a set of user-facing macros that define primitive operations

    on tables:

    * `@select`

    * `@where`

    * `@group_by`

    * `@aggregate_vector`

    * `@order_by`

    * `@limit`

    * `@inner_join`

    * `@left_join`

    * `@right_join`

    * `@outer_join`

* Lower those user-facing macros to objects that lazily represent those

    operations and can be used to build a simplified logical plan:

    * `Select`

    * `Where`

    * `GroupBy`

    * `AggregateVector`

    * `OrderBy`

    * `Limit`

    * `Join`

* Define a basic implementation of how to carry out the logical plan in terms

    of primitive operations on DataFrames from

    [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl).

For more details, see [docs/architecture.md](https://github.com/johnmyleswhite/Volcanito.jl/blob/master/docs/architecture.md).

# Goals

Volcanito is a project that I started to explore a few areas in the Julia data

tools design space:

* *Laziness*: How much value can cross-operation optimizations provide if data

    tools have access to a full query plan created by lazy wrappers? How many

    optimization opportunities does the current eager evaluation strategy leave

    on the table?

* *Row-Wise Semantics*: Are there substantial challenges to using row-wise

    semantics everywhere even if DataFrames are stored as columns? Where is

    usability increased and where is it decreased by moving to a system in which

    all operations are described in terms of arbitrary Julia expressions over

    tuples?

* *Syntactical Optimizations* : How many opportunities for optimization depend

    upon having access to the source syntax of an expression? For example, can

    we support arbitrary join predicates, but use source syntax to optimize

    equijoins?

* *Generic Fallbacks*: How much of the data tooling can be handled generically

    in a way that new data formats can plug into trivially? Can we have generic

    definitions of nested for loop joins and hash joins that work on any source

    of tuples?

# Example Usage

```

import Pkg

Pkg.activate(".")

import DataFrames: DataFrame

import Statistics: mean

import Volcanito:

    @select,

    @where,

    @group_by,

    @aggregate_vector,

    @order_by,

    @limit,

    @inner_join

df = DataFrame(

    a = rand(10_000),

    b = rand(10_000),

    c = rand(Bool, 10_000),

)

@select(df, a, b, d = a + b)

@where(df, a > b)

@aggregate_vector(

    @group_by(df, !c),

    m_a = mean(a),

    m_b = mean(b),

    n_a = length(a),

    n_b = length(b),

)

@order_by(df, a + b)

@limit(df, 10)

@inner_join(

    a = df,

    b = @aggregate_vector(

        @group_by(df, c),

        m_a = mean(a),

        m_b = mean(b),

        n_a = length(a),

        n_b = length(b),

    ),

    a.c == b.c,

)

@aggregate_vector(df, m = mean(a))

```

To make it easier to understand how things work, the examples above all exploit

the fact that Volcanito's user-facing macros construct `LogicalNode` objects

that automatically materialize the result of a query whenever `Base.show` is

called. This makes it seem as if the user-facing macros operate eagerly, but

the truth is that they operate lazily and produce `LogicalNode` objects rather

than DataFrames. If you want to transform a `LogicalNode` object into a full

DataFrame, you should explicitly call `Volcanito.materialize`.

```

import Pkg

Pkg.activate(".")

import DataFrames: DataFrame

import Volcanito:

    @select,

    materialize

df = DataFrame(

    a = rand(10_000),

    b = rand(10_000),

    c = rand(Bool, 10_000),

)

plan = @select(df, a, b, d = a + b)

typeof(plan)

df = materialize(plan)

typeof(df)

```

# Expression Rewrites

To simplify working with data, the macros involve rewrite passes to automate

several tedious users otherwise do manually.

## Automatic Three-Valued Logic

Three-valued logic works even with short-circuiting Boolean operators:

```

import Pkg

Pkg.activate(".")

import DataFrames: DataFrame

import Volcanito: @where

df = DataFrame(

    a = [missing, 0.25, 0.5, 0.75],

    b = [missing, 0.75, 0.5, 0.25],

)

function f(x)

    println("Calling f(x) on x = $x")

    x + 1

end

@where(df, f(a) > 1.5 && f(b) >= 1.25)

```

## Local Variable Interpolation/Splicing

Local scalar variables can be interpolated/spliced into expressions:

```

import Pkg

Pkg.activate(".")

import DataFrames: DataFrame

import Volcanito: @where

df = DataFrame(

    a = [missing, 0.25, 0.5, 0.75],

    b = [missing, 0.75, 0.5, 0.25],

)

let x = 0.5

    @where(df, a >= $x)

end

```

## Backtick Syntax for Expressing Arbitrary Column Names

As in SQL, Volcanito allows backticks to be used to indicate that an otherwise

invalid identifier is a column name. This can be used when column names are

derived from an expression without an alias:

```

import Pkg

Pkg.activate(".")

import DataFrames: DataFrame

import Volcanito: @select, @aggregate_vector

import Statistics: mean

df = DataFrame(

    a = rand(10_000),

    b = rand(10_000),

)

@select(

    @aggregate_vector(df, mean(a)),

    `mean(a)` + 1,

)

```

This trick means that the normal Julia syntax for generating a `Cmd` object is

not available: use the `@cmd` macro instead to achieve the same effect.

## Backtick Syntax + Interpolation for Expressing Dynamic Column Names

One challenge with metaprogramming approaches like Volcanito employs is that it

can be difficult to use these techniques in functions in which the column names

to be computed aginst are not known statically. To address this, Volcanito

further coopts backtick syntax and combines it with interpolation syntax to make

it possible to indicate that column names are dynamic and only known at runtime.

An example of using this capacity in a function is shown below:

```

import Pkg

Pkg.activate(".")

import DataFrames: DataFrame

import Volcanito: @select, materialize

df = DataFrame(

    a = rand(10_000),

    b = rand(10_000),

)

function add_columns(df, x, y)

    @select(df, new_col = `$x` + `$y`)

end

add_columns(df, :a, :b)

isequal(

    materialize(@select(df, new_col = a + b)),

    materialize(add_columns(df, :a, :b)),

)

```