https://github.com/milpol/gorilla4j

Implementation of time series compression method based on the Facebook Gorilla paper
https://github.com/milpol/gorilla4j

gorilla java timeseries timeseries-data

Last synced: 6 months ago
JSON representation

Implementation of time series compression method based on the Facebook Gorilla paper

Host: GitHub
URL: https://github.com/milpol/gorilla4j
Owner: milpol
License: mit
Created: 2019-06-09T17:20:41.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2021-01-06T22:43:14.000Z (over 5 years ago)
Last Synced: 2025-08-04T13:38:26.350Z (11 months ago)
Topics: gorilla, java, timeseries, timeseries-data
Language: Java
Homepage:
Size: 47.9 KB
Stars: 13
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ![Travis CI](https://api.travis-ci.org/milpol/gorilla4j.svg) ![Maven Central](https://img.shields.io/maven-central/v/com.jarslab.ts/gorilla4j.svg)

# What is all about

It is all about storing data in a efficient way. 

Stop! *two* things. First: it is not about any data, but a very special kind: time series. 

Sounds scary but all in all it is just a value (numerical) in time (epoch). 

Second: *but they said that storage is cheap!* Well, so the bubble gum, it is just a buck. Million packs do the million bucks though.

Also, what *they* don't say that we store enormous load of data which we write once and read ~~once~~ never.

## Give me the numbers

As mentioned, we are considering here a time series data (value in time). Let's say we want to store stock price valuation of single company, single day, sampled every 10 second.

8 hours gives 2880 samples, sample is a time (Java long, 8 bytes) and a value (Java double, 8 bytes). Math is simple:

`8 * 60 * 6 * 16 = 46080B = 45KB` 

Phew. That's nothing you'll say. Sure, the bubble gum is just a buck, blah, blah...

How about Gorilla format, can it do any better?

From ad-hoc test:

`~8465B ~= 8,3KB`

(We could compare that to JSON format... but it would not make any sense.)

Just to be clear: we are talking about exact same data, no rounding or data losses, but...

Well, in wise algorithms there is almost always *but*, the one here is how the data is distributed.

# But how?

All answers and technical guts can be found in great paper from the Facebook engineers [Gorilla: A Fast, Scalable, In-Memory Time Series Database](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf)

# Usage

## Maven coords

```xml

  com.jarslab.ts

  gorilla4j

  0.4

```

## Examples

### Building basic Gorilla block

```java

TSG tsg = new TSG(1546300800, new OutBitSet());

tsg.put(1546300800, 4.0);

tsg.put(1546300860, 4.1);

tsg.put(1546300920, 4.2);

tsg.close(); // at this point no more points are accepted

```

### Dump block and re-create

```java

TSG tsg = new TSG(1546300800, new OutBitSet());

tsg.put(1546300800, 4.2);

byte[] tsgBytes = tsg.toBytes();

TSG recreatedTsg = TSG.fromBytes(tsgBytes); // block is still open and can accept points

```

### Extract iterator from block

```java

TSG tsg = new TSG(1546300800, new OutBitSet());

tsg.put(1546300800, 4.2);

Iterator tsgIterator = tsg.toIterator(); // iterator works on copied bytes, tsg accepts points

```

### Open block in iterator

```java

TSG tsg = new TSG(1546300800, new OutBitSet());

tsg.put(1546300800, 4.2);

tsg.close();

byte[] tsgBytes = tsg.getDataBytes();

Iterator tsgIterator = new TSGIterator(new InBitSet(tsgBytes));

```

# Other Java implementation?

Please check excellent [Michael Burman](https://github.com/burmanm) implementation: [gorilla-tsc](https://github.com/burmanm/gorilla-tsc). 

# Changelog

## 0.4

* Bump test libs

## 0.2

* Use `long` for time values (start and current).

* Move `DataPoint` to abstraction

* Add JavaDocs

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/milpol/gorilla4j

Awesome Lists containing this project

README