https://github.com/netzwerg/paleo

Immutable Java 8 data frames with typed columns (including primitives)
https://github.com/netzwerg/paleo

Last synced: 7 months ago
JSON representation

Immutable Java 8 data frames with typed columns (including primitives)

Host: GitHub
URL: https://github.com/netzwerg/paleo
Owner: netzwerg
License: apache-2.0
Created: 2015-09-08T07:54:06.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2021-12-14T15:19:07.000Z (over 4 years ago)
Last Synced: 2025-03-30T06:01:58.652Z (12 months ago)
Language: Java
Homepage:
Size: 1.07 MB
Stars: 71
Watchers: 8
Forks: 16
Open Issues: 6
Metadata Files:
- Readme: README.adoc
- License: LICENSE

Awesome Lists containing this project

awesome-java - Paleo

README

          # NO LONGER MAINTAINED – USE AT YOUR OWN RISK!

# Paleo image:https://travis-ci.org/netzwerg/paleo.svg?branch=master["Build Status", link="https://travis-ci.org/netzwerg/paleo"]

:latest-release-version: 0.14.0

Immutable Java 8 data frames with typed columns.

A data frame is composed of `0..n` named columns, which all contain the same number of row values. Each column has a fixed

data type, which allows for type-safe value access. The following column types are supported out-of-the-box:

* **Int**: Primitive `int` values

* **Long**: Primitive `long` values

* **Double**: Primitive `double` values

* **Boolean**: Primitive `boolean` values

* **String**: `java.lang.String` values

* **Timestamp**: `java.time.Instant` values

* **Category**: Categorical `String` values (aka factors)

Columns can be created via simple factory methods, through a fluent builder API, or from text files.

# Hello Paleo

The `paleo-core` module provides all classes to identify, create, and structure typed columns: 

[source,java]

----

// Type-safe column identifiers

final StringColumnId NAME = StringColumnId.of("Name");

final CategoryColumnId COLOR = CategoryColumnId.of("Color");

final DoubleColumnId SERVING_SIZE = DoubleColumnId.of("Serving Size (g)");

// Convenient column creation

StringColumn nameColumn = StringColumn.ofAll(NAME, "Banana", "Blueberry", "Lemon", "Apple");

CategoryColumn colorColumn = CategoryColumn.ofAll(COLOR, "Yellow", "Blue", "Yellow", "Green");

DoubleColumn servingSizeColumn = DoubleColumn.ofAll(SERVING_SIZE, 118, 148, 83, 182);

// Grouping columns into a data frame

DataFrame dataFrame = DataFrame.ofAll(nameColumn, colorColumn, servingSizeColumn);

// Typed random access to individual values (based on rowIndex / columnId)

String lemon = dataFrame.getValueAt(2, NAME);

double appleServingSize = dataFrame.getValueAt(3, SERVING_SIZE);

// Typed stream-based access to all values

DoubleStream servingSizes = servingSizeColumn.valueStream();

double maxServingSize = servingSizes.summaryStatistics().getMax();

// Smart column implementations

Set colors = colorColumn.getCategories();

----

# Parsing From Text / File

The `paleo-io` module parses data frames from tab-delimited or comma-separated text representations. The structure of the

data frame (i.e. the names and types of its columns) can be defined in one of two ways:

## Header Rows

In its simplest format, the tab-delimited text representation directly contains column names and types in a header.

The first row specifies the column names, the second row specifies the column types (actual data starting on third row):

----

1 Name    Color

2 String  Category

3 Banana  Yellow

...

n Apple   Green

----

The contents can then be parsed via `Parser.tsv(Reader in)` or `Parser.csv(Reader in)`, e.g. like:

[source,java]

----

final String EXAMPLE =

            "Name\tColor\tServing Size (g)\n" +

            "String\tCategory\tDouble\n" +

            "Banana\tYellow\t118\n" +

            "Blueberry\tBlue\t148\n" +

            "Lemon\tYellow\t83\n" +

            "Apple\tGreen\t182";

DataFrame dataFrame = Parser.tsv(new StringReader(EXAMPLE));

----

## External JSON Schema

Generally it is advisable to separate the structural information from the actual data. Paleo therefore supports the

definition of an external JSON schema. The format is inspired by the

http://dataprotocols.org/json-table-schema[JSON Table Schema]:

[source,json]

----

{

  "title": "Example Schema",

  "dataFileName": "data.txt",

  "charsetName": "ISO-8859-1", // <1>

  "fields": [

    {

      "name": "Name",

      "type": "String"

    },

    {

      "name": "Color",

      "type": "Category"

    },

    {

      "name": "Serving Size",

      "type": "Double",

      "metaData": { "unit": "g" }

    },

    {

      "name": "Exemplary Date",

      "type": "Timestamp",

      "format": "yyyyMMddHHmmss"

    }

  ]

}

----

<1> Optionally specify an encoding

Dedicated parsing methods allow to first parse the schema from JSON, and subsequently use it to create a `DataFrame`.

A given base directory is used to load the actual data (i.e. to resolve the location of the configured `dataFileName`):

[source,java]

----

Schema schema = Schema.parseJson(new StringReader(EXAMPLE_SCHEMA));

DataFrame dataFrame = Parser.tsv(schema, baseDir);

----

## Working With Parsed Data Frames

Once a `DataFrame` instance has been parsed, its data can be accessed through a type-safe API:

[source,java]

----

final String EXAMPLE =

            "Name\tColor\tServing Size (g)\n" +

            "String\tCategory\tDouble\n" +

            "Banana\tYellow\t118\n" +

            "Blueberry\tBlue\t148\n" +

            "Lemon\tYellow\t83\n" +

            "Apple\tGreen\t182";

DataFrame dataFrame = Parser.tsv(new StringReader(EXAMPLE));

// Lookup typed identifiers by column index

final StringColumnId NAME = dataFrame.getColumnId(0, ColumnType.STRING);

final CategoryColumnId COLOR = dataFrame.getColumnId(1, ColumnType.CATEGORY);

final DoubleColumnId SERVING_SIZE = dataFrame.getColumnId(2, ColumnType.DOUBLE);

// Use identifier to access columns & values

StringColumn nameColumn = dataFrame.getColumn(NAME);

IndexedSeq nameValues = nameColumn.getValues();

// ... or access individual values via row index / column id 

String yellow = dataFrame.getValueAt(2, COLOR);

----

# Usage

All modules are available via https://bintray.com/netzwerg/maven/paleo/view[Bintray/JCenter].

## Repository Configuration

Gradle:

[source,groovy]

----

repositories {

    jcenter()

}

----

Maven `settings.xml`:

[source,xml]

----

    

      false

    

    central

    bintray

    http://jcenter.bintray.com

----

## Using the `paleo-core` module

Gradle:

[source,groovy]

[subs="attributes"]

----

compile 'ch.netzwerg:paleo-core:{latest-release-version}'

----

Maven:

[source,xml]

[subs="specialcharacters,attributes"]

----

    ch.netzwerg

    paleo-core

    {latest-release-version}

    jar

----

## Using the `paleo-io` module

Optional (requires `paleo-core`)

Gradle:

[source,groovy]

[subs="attributes"]

----

compile 'ch.netzwerg:paleo-io:{latest-release-version}'

----

Maven:

[source,xml]

[subs="specialcharacters,attributes"]

----

    ch.netzwerg

    paleo-io

    {latest-release-version}

    jar

----

# Vavr

Paleo makes extensive use of the http://www.vavr.io/[Vavr library]. Vavr provides

awesome collection classes which offer functionality way beyond the standard JDK. Working with the Vavr classes

is highly recommended, but it is always possible to back out and convert to JDK standards (e.g. with `toJavaList()`).

# Factory-Methods vs. Builders

Paleo tries to make the best compromise between parsing speed, index-based value lookup, and memory usage. That's why

it offers two ways to create columns: Static factory methods allow for convenient construction if all values are already

available. Individual column builders should be used if columns are constructed via successive value addition. Please be

aware that the builders are not thread-safe.

# Why The Name?

The backing data structures are all about **raw** values and **primitive** types — this somehow reminded me of

the paleo diet.

# Contributions

Pull requests are very welcome.

Please note that by submitting a pull request, you agree to license your contribution under the "Apache License Version 2.0".

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/netzwerg/paleo

Awesome Lists containing this project

README