https://github.com/zero-one-group/geni

A Clojure dataframe library that runs on Spark
https://github.com/zero-one-group/geni
big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark
Last synced: 2 months ago
JSON representation
A Clojure dataframe library that runs on Spark
Host: GitHub
URL: https://github.com/zero-one-group/geni
Owner: zero-one-group
License: apache-2.0
Created: 2020-04-18T10:46:14.000Z (about 5 years ago)
Default Branch: develop
Last Pushed: 2023-11-28T17:22:38.000Z (over 1 year ago)
Last Synced: 2025-03-28T06:06:08.245Z (3 months ago)
Topics: big-data, clojure, clojure-library, clojure-repl, data-engineering, data-science, dataframe, distributed-computing, high-performance-computing, machine-learning, parallel-computing, spark
Language: Clojure
Homepage:
Size: 1.86 MB
Stars: 292
Watchers: 12
Forks: 27
Open Issues: 17
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

awesome-articles - Geni - Clojure Dataframes
README

        


    



Geni (*/gɜni/* or "gurney" without the r) is a [Clojure](https://clojure.org/) dataframe library that runs on [Apache Spark](https://spark.apache.org/). The name means "fire" in Javanese.

[![CI](https://github.com/zero-one-group/geni/actions/workflows/continuous-integration.yml/badge.svg?branch=develop)](https://github.com/zero-one-group/geni/actions)

[![Code Coverage](https://codecov.io/gh/zero-one-group/geni/branch/develop/graph/badge.svg)](https://codecov.io/gh/zero-one-group/geni)

[![Clojars Project](https://img.shields.io/clojars/v/zero.one/geni.svg)](http://clojars.org/zero.one/geni)

[![License](https://img.shields.io/github/license/zero-one-group/geni.svg)](LICENSE)

## Overview

Geni provides an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Geni uses Clojure's `->` threading macro as the main way to compose Spark's `Dataset` and `Column` operations in place of the usual method chaining in Scala. It also provides a greater degree of dynamism by allowing args of mixed types such as columns, strings and keywords in a single function invocation. See the docs section on [Geni semantics](docs/semantics.md) for more details.

## Resources

  

    

      

        Docs

      

      

        Cookbook

      

    

    

      

        


            A Simple Performance Benchmark

            Code of Conduct

            Contributing Guide

            Creating Spark Schemas

            Examples

            Design Goals

            Geni Semantics

            Manual Dataset Creation

            Optional XGBoost Support

            Pandas, NumPy and Other Idioms

            Using Dataproc

            Using Kubernetes

            Where's The Spark Session

            Why?

            Working with SQL Maps

            Collecting Data from Spark Datasets

        

      

      

        

            

                Getting Started with Clojure, Geni and Spark

            

            

                Reading and Writing Datasets

            

            

                Selecting Rows and Columns

            

            

                Grouping and Aggregating

            

            

                Combining Datasets with Joins and Unions

            

            

                String Operations

            

            

                Cleaning up Messy Data

            

            

                Timestamps and Dates

            

            

                Window Functions

            

            

                Reading from and Writing to SQL Databases

            

            

                Avoiding Repeated Computations with Caching

            

            

                Basic ML Pipelines

            

            

                Customer Segmentation with NMF

            

        

      

    

  

[![cljdoc](https://cljdoc.org/badge/zero.one/geni)](https://cljdoc.org/d/zero.one/geni/CURRENT)

[![slack](https://badgen.net/badge/-/clojurians%2Fgeni?icon=slack&label)](https://clojurians.slack.com/messages/geni/)

[![zulip](https://img.shields.io/badge/zulip-clojurians%2Fgeni-brightgreen.svg)](https://clojurians.zulipchat.com/#narrow/stream/256615-geni)

## Basic Examples

All examples below use the Statlib California housing prices data available for free on [Kaggle](https://www.kaggle.com/camnugent/california-housing-prices).

Spark SQL API for data wrangling:

```clojure

(require '[zero-one.geni.core :as g])

(def dataframe (g/read-parquet! "test/resources/housing.parquet"))

(g/count dataframe)

=> 5000

(g/print-schema dataframe)

; root

;  |-- longitude: double (nullable = true)

;  |-- latitude: double (nullable = true)

;  |-- housing_median_age: double (nullable = true)

;  |-- total_rooms: double (nullable = true)

;  |-- total_bedrooms: double (nullable = true)

;  |-- population: double (nullable = true)

;  |-- households: double (nullable = true)

;  |-- median_income: double (nullable = true)

;  |-- median_house_value: double (nullable = true)

;  |-- ocean_proximity: string (nullable = true)

(-> dataframe (g/limit 5) g/show)

; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+

; |longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|

; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+

; |-122.23  |37.88   |41.0              |880.0      |129.0         |322.0     |126.0     |8.3252       |452600.0          |NEAR BAY       |

; |-122.22  |37.86   |21.0              |7099.0     |1106.0        |2401.0    |1138.0    |8.3014       |358500.0          |NEAR BAY       |

; |-122.24  |37.85   |52.0              |1467.0     |190.0         |496.0     |177.0     |7.2574       |352100.0          |NEAR BAY       |

; |-122.25  |37.85   |52.0              |1274.0     |235.0         |558.0     |219.0     |5.6431       |341300.0          |NEAR BAY       |

; |-122.25  |37.85   |52.0              |1627.0     |280.0         |565.0     |259.0     |3.8462       |342200.0          |NEAR BAY       |

; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+

(-> dataframe (g/describe :housing_median_age :total_rooms :population) g/show)

; +-------+------------------+------------------+-----------------+

; |summary|housing_median_age|total_rooms       |population       |

; +-------+------------------+------------------+-----------------+

; |count  |5000              |5000              |5000             |

; |mean   |30.9842           |2393.2132         |1334.9684        |

; |stddev |12.969656616832669|1812.4457510408017|954.0206427949117|

; |min    |1.0               |1000.0            |100.0            |

; |max    |9.0               |999.0             |999.0            |

; +-------+------------------+------------------+-----------------+

(-> dataframe

    (g/group-by :ocean_proximity)

    (g/agg {:count        (g/count "*")

            :mean-rooms   (g/mean :total_rooms)

            :distinct-lat (g/count-distinct (g/int :latitude))})

    (g/order-by (g/desc :count))

    g/show)

; +---------------+-----+------------------+------------+

; |ocean_proximity|count|mean-rooms        |distinct-lat|

; +---------------+-----+------------------+------------+

; |INLAND         |1823 |2358.181020296215 |10          |

; |<1H OCEAN      |1783 |2467.5361749859785|7           |

; |NEAR BAY       |1287 |2368.72027972028  |2           |

; |NEAR OCEAN     |107  |2046.1869158878505|2           |

; +---------------+-----+------------------+------------+

(-> dataframe

    (g/select {:ocean :ocean_proximity

               :house (g/struct {:rooms (g/struct :total_rooms :total_bedrooms)

                                 :age   :housing_median_age})

               :coord (g/struct {:lat :latitude :long :longitude})})

    (g/limit 3)

    g/collect)

=> ({:ocean "NEAR BAY",

     :house {:rooms {:total_rooms 880.0, :total_bedrooms 129.0}, 

             :age 41.0},

     :coord {:lat 37.88, :long -122.23}}

    {:ocean "NEAR BAY",

     :house {:rooms {:total_rooms 7099.0, :total_bedrooms 1106.0}, 

             :age 21.0},

     :coord {:lat 37.86, :long -122.22}}

    {:ocean "NEAR BAY",

     :house {:rooms {:total_rooms 1467.0, :total_bedrooms 190.0}, 

             :age 52.0},

     :coord {:lat 37.85, :long -122.24}})

```

Spark ML example translated from [Spark's programming guide](https://spark.apache.org/docs/latest/ml-pipeline.html):

```clojure

(require '[zero-one.geni.core :as g])

(require '[zero-one.geni.ml :as ml])

(def training-set

  (g/table->dataset

    [[0 "a b c d e spark"  1.0]

     [1 "b d"              0.0]

     [2 "spark f g h"      1.0]

     [3 "hadoop mapreduce" 0.0]]

    [:id :text :label]))

(def pipeline

  (ml/pipeline

    (ml/tokenizer {:input-col :text

                   :output-col :words})

    (ml/hashing-tf {:num-features 1000

                    :input-col :words

                    :output-col :features})

    (ml/logistic-regression {:max-iter 10

                             :reg-param 0.001})))

(def model (ml/fit training-set pipeline))

(def test-set

  (g/table->dataset

    [[4 "spark i j k"]

     [5 "l m n"]

     [6 "spark hadoop spark"]

     [7 "apache hadoop"]]

    [:id :text]))

(-> test-set

    (ml/transform model)

    (g/select :id :text :probability :prediction)

    g/show)

;; +---+------------------+----------------------------------------+----------+

;; |id |text              |probability                             |prediction|

;; +---+------------------+----------------------------------------+----------+

;; |4  |spark i j k       |[0.1596407738787411,0.8403592261212589] |1.0       |

;; |5  |l m n             |[0.8378325685476612,0.16216743145233883]|0.0       |

;; |6  |spark hadoop spark|[0.0692663313297627,0.9307336686702373] |1.0       |

;; |7  |apache hadoop     |[0.9821575333444208,0.01784246665557917]|0.0       |

;; +---+------------------+----------------------------------------+----------+

```

More detailed examples can be found [here](examples/README.md).

## Quick Start

### Install Geni

Install the `geni` script to `/usr/local/bin` with:

```bash

wget https://raw.githubusercontent.com/zero-one-group/geni/develop/scripts/geni

chmod a+x geni

sudo mv geni /usr/local/bin/

```

The command `geni` downloads the latest Geni uberjar and places it in `~/.geni/geni-repl-uberjar.jar`, and runs it with `java -jar`.

### Uberjar

Download the latest Geni REPL uberjar from the [release](https://github.com/zero-one-group/geni/releases) page. Run the uberjar as follows:

```bash

java -jar 

```

The uberjar app prints the default `SparkSession` instance, starts an nREPL server with an `.nrepl-port` file for easy text-editor connection and steps into a Clojure REPL(-y).

### Leiningen Template

Use [Leiningen](http://leiningen.org/) to create a [template](https://github.com/zero-one-group/geni-template) of a Geni project:

```bash

lein new geni 

```

`cd` into the project directory and do `lein run`. The templated app runs a Spark ML example, and then steps into a Clojure REPL-y with an `.nrepl-port` file.

### Screencast Demos

    

        Install

        Uberjar

        Leiningen

    

    

          

          

          

    

## Installation

Add the following to your `project.clj` dependency:

[![Clojars Project](https://clojars.org/zero.one/geni/latest-version.svg)](http://clojars.org/zero.one/geni)

You would also need to add Spark as provided dependencies. For instance, have the following key-value pair for the `:profiles` map:

```clojure

:provided

{:dependencies [;; Spark

                [org.apache.spark/spark-avro_2.12 "3.3.3"]

                [org.apache.spark/spark-core_2.12 "3.3.3"]

                [org.apache.spark/spark-hive_2.12 "3.3.3"]

                [org.apache.spark/spark-mllib_2.12 "3.3.3"]

                [org.apache.spark/spark-sql_2.12 "3.3.3"]

                [org.apache.spark/spark-streaming_2.12 "3.3.3"]

                ; Arrow

                [org.apache.arrow/arrow-memory-netty "4.0.0"]

                [org.apache.arrow/arrow-memory-core "4.0.0"]

                [org.apache.arrow/arrow-vector "4.0.0"

                :exclusions [commons-codec com.fasterxml.jackson.core/jackson-databind]]

                ;; Databases

                [mysql/mysql-connector-java "8.0.25"]

                [org.postgresql/postgresql "42.2.20"]

                [org.xerial/sqlite-jdbc "3.34.0"]

                ;; Optional: Spark XGBoost

                [ml.dmlc/xgboost4j-spark_2.12 "1.2.0"]

                [ml.dmlc/xgboost4j_2.12 "1.2.0"]]}

```

You may also need to install `libatlas3-base` and `libopenblas-base` to use a native BLAS, and install `libgomp1` to train XGBoost4J models. When the optional dependencies are not present, the vars to the corresponding functions (such as `ml/xgboost-classifier`) will be left unbound.

## License

Copyright 2020 Zero One Group.

Geni is licensed under Apache License v2.0, see [LICENSE](LICENSE).

## Mentions

Some parts of the project have been taken from or inspired by:

* [finagle-clojure](https://github.com/finagle/finagle-clojure) for Scala interop functions.

* [LispCast](https://lispcast.com/) for [exponential backoff](https://lispcast.com/exponential-backoff/).

* Reddit users [/u/borkdude](https://old.reddit.com/user/borkdude) and [/u/czan](https://old.reddit.com/user/czan) for [with-dynamic-import](src/zero_one/geni/utils.clj).

* StackOverflow user [whocaresanyway's answer](https://stackoverflow.com/questions/1696693/clojure-how-to-find-out-the-arity-of-function-at-runtime) for `arg-count`.

* [Julia Evans'](https://jvns.ca/) [Pandas Cookbook](https://github.com/jvns/pandas-cookbook) for its syllabus.

* Reddit user [/u/joinr](https://old.reddit.com/user/joinr) for helping with [unit-testing the REPL](test/zero_one/geni/main_test.clj).

* [Sparkling](https://github.com/gorillalabs/sparkling), [sparkplug](https://github.com/amperity/sparkplug) and [Gabriel Borges](https://github.com/borgesgabriel) for helping with the RDD function serialisation.

* [Chris Nuernberger](https://github.com/cnuernber) and [Tomasz Sulej](https://github.com/tsulej) for helping with [tech.ml.dataset](https://github.com/techascent/tech.ml.dataset) and [tablecloth](https://github.com/scicloj/tablecloth).

* [Ubuntu](https://ubuntu.com/community/code-of-conduct), [Django](https://www.djangoproject.com/conduct/) and [Conjure](https://github.com/Olical/conjure/blob/master/.github/CODE_OF_CONDUCT.md) for their codes of conduct.

* [FZF](https://github.com/junegunn/fzf) for their issue template.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zero-one-group/geni

Awesome Lists containing this project

README