Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/RCHowell/Sift

Sift is a basic, Relational Algebra based query engine built on top of Apache Arrow. It draws inspiration from Andy Grove's KQuery.
https://github.com/RCHowell/Sift

query relation-algebra sql

Last synced: about 2 months ago
JSON representation

Sift is a basic, Relational Algebra based query engine built on top of Apache Arrow. It draws inspiration from Andy Grove's KQuery.

Awesome Lists containing this project

README

        

## Preface

I built this as an exercise while studying [Database Systems: The Complete Book](http://infolab.stanford.edu/~ullman/dscb.html) (DSCB) by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom. I also wanted to experiment with Apache Arrow, and I found Andy Grove's [KQuery](https://github.com/andygrove/how-query-engines-work); much of this work is modelled after his engine, and I have left notes where I use some of his constructs. This exercise was more about studying the execution of queries, so little effort was put into the parser and planner. There are currently no plan optimizations, and the language is simply syntactic sugar over the operators of Relation Algebra discussed in DSCB.

## Operations
- Scan
- Selection
- Projection
- Limit
- Grouping/Aggregation
- Distinct
- Sort (TODO)
- Join / Union / Difference / Intersection (TODO)

## Language

> Full details in *sift.lang/README.md*

The purpose of the Sift language is to have a query language that maps near 1:1 to operators of the extended relational algebra discussed in section 5.2 of Garcia-Molina et. al. It is literally an inversion of the query expression tree using the F# (and Elixir) pipe operator to simplify writing nested transformations.

Limitations in the language come from my inability to dedicate time to the parser. Right now, I'm more interested in learning about parser generators. The purpose of the hand-written lexer and parser was to learn some basics.

A query is formed with a relation production followed by transformations. All type data is provided by the **Schema** of a data **Source** which is registered to the query execution environment. The full BNF is at the bottom.

### Shell Example

![](https://i.imgur.com/1RGvkLm.png)

![](https://i.imgur.com/s2yIvwl.png)

### Relation Productions

Let *R(A, B, C)* and *S(B, C, D)* be two relations. Here are some example relation productions, including subqueries.
```
# simple scan
'R'

# joins
'R' JOIN 'S'
'R' OUTER JOIN 'S'
'R' JOIN 'S' ON A = D

# equivalent to the previous join
'R' X 'S' |> SELECT A = D

# project tuples to same domain prior to union
('R' |> PROJECT B, C) UNION ('S' |> PROJECT B, C)

# Let T(X, Y) and V(X, Y) be two relations
'T' X 'V' # cross
'T' U 'V' # union
'T' \ 'V' # difference
'T' & 'V' # intersection
```

### Examples

```
Q: Select all titles produced by Paramount between 1979 and 1982

'Movies'
|> SELECT (1979 <= Year && Year <= 1982) && Studio = 'Paramount'
|> PROJECT Title
```

```
Q: Get the average, min, and max heights of all players by age and position

'Players'
|> PROJECT Height, Age
|> GROUP AVG(Height) -> Avg, MIN(Height) -> Shortest, MAX(Height) -> Tallest BY Age, Position
```

## Execution

> Do 'gradle run --console plain' to run the interactive query shell

### Sample Data

The sample data is a collection of some fuzzy friends.

```
┌─────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│Name │Age │Gender │Weight │Type │Breed │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Ramona │2.00 │F │8.00 │Cat │Mini Coon │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Mochi │2.00 │F │45.00 │Dog │Samoyed │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Cali │7.00 │F │30.00 │Dog │Vizsla │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Gretchen │13.00 │F │50.00 │Dog │English │
│ │ │ │ │ │Bulldog │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Cooper │6.00 │M │30.00 │Dog │Beagle │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Eleanor │5.00 │F │24.00 │Dog │Cocker │
│ │ │ │ │ │Spaniel │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Huckleberry │7.00 │M │20.00 │Cat │Medium Coon │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Madman Mochi │3.00 │M │14.00 │Cat │Unknown │
└─────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘
```

### Selection

> You can see I have a bug in the precedence of parsing, but I don't care much about the parser

```
'Pets' |> SELECT (Type = 'Dog') && (Gender = 'F')

┌─────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│Name │Age │Gender │Weight │Type │Breed │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Mochi │2.00 │F │45.00 │Dog │Samoyed │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Cali │7.00 │F │30.00 │Dog │Vizsla │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Gretchen │13.00 │F │50.00 │Dog │English │
│ │ │ │ │ │Bulldog │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│Eleanor │5.00 │F │24.00 │Dog │Cocker │
│ │ │ │ │ │Spaniel │
└─────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘
```

### Projection

```
'Pets'
|> SELECT Type = 'Cat'
|> PROJECT Name + ' is a ' + Breed + ' kitty cat' -> Greeting

┌──────────────────────────────────────────────────────────────────────────────┐
│Greeting │
├──────────────────────────────────────────────────────────────────────────────┤
│Ramona is a Mini Coon kitty cat │
├──────────────────────────────────────────────────────────────────────────────┤
│Huckleberry is a Medium Coon kitty cat │
├──────────────────────────────────────────────────────────────────────────────┤
│Madman Mochi is a Unknown kitty cat │
└──────────────────────────────────────────────────────────────────────────────┘
```

### Aggregations

```
'Pets' |> GROUP MAX(Weight) -> Thiccest BY Type

┌───────────────────────────────────────┬──────────────────────────────────────┐
│Type │Thiccest │
├───────────────────────────────────────┼──────────────────────────────────────┤
│Cat │20.00 │
├───────────────────────────────────────┼──────────────────────────────────────┤
│Dog │50.00 │
└───────────────────────────────────────┴──────────────────────────────────────┘
```

---

## SiftQL BNF

```
# Tokens
::= [A-Za-z\-_]+ # operators, relation and field identifiers
::= '[A-Za-z0-9\s]+'
::= [0-9]+(.[0-9]+)?
::= (TRUE|FALSE|UNKOWN)
::= NULL

::=

::=
|
|
|
|
|

::= '' # quoted identifier
| ( ) # sub-query

::= (AS )? (OUTER|LEFT|RIGHT)? JOIN (AS )? (ON )?
::= (X|CROSS)
::= (U|UNION)
::= (\|DIFF)
::= (&|INTERSECT)

::= (|> )*
::=
|
|
| (BY )? (ASC|DESC)
| LIMIT
| DISTINC

::= SELECT

::= PROJECT
::=
| ->

::= GROUP (BY )?
::= ->
::= \#()

::=
|
| ( )
::= # field reference
| \#() # functions
|
::= (|||)
```

## Shell

Try Graal