https://github.com/cldellow/manu

Mostly archived, not updated.
https://github.com/cldellow/manu

Last synced: 10 months ago
JSON representation

Mostly archived, not updated.

Host: GitHub
URL: https://github.com/cldellow/manu
Owner: cldellow
License: epl-1.0
Created: 2018-01-08T21:28:21.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-08-12T15:41:57.000Z (almost 8 years ago)
Last Synced: 2025-08-23T11:05:14.083Z (10 months ago)
Language: Java
Size: 1.02 MB
Stars: 3
Watchers: 1
Forks: 1
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# Manu: "Mostly archived, not updated"

[![Build Status](https://travis-ci.org/cldellow/manu.svg?branch=master)](https://travis-ci.org/cldellow/manu)
[![codecov](https://codecov.io/gh/cldellow/manu/branch/master/graph/badge.svg)](https://codecov.io/gh/cldellow/manu)
[![Maven Central](https://img.shields.io/maven-central/v/com.cldellow/manu.svg)](https://mvnrepository.com/artifact/com.cldellow/manu)

A time series storage format for integers and floats, using efficient delta encodings from [FastPFOR](https://github.com/lemire/JavaFastPFOR).

Examples: pageviews by article in Wikipedia, stock open/close/high/low prices, weather temperatures.

## Components
- [manu-format](format), a library for maintaining the data on disk
- [manu-cli](cli), a command-line tool for ingesting data into the format
- [manu-serve](serve), a web server to expose the data over REST

## Design criteria
### Priorities
- Cheap
- I'm doing this to drive a hobby project; my dream would be to host a variety of datasets for $10/month.
- A Fermi estimate suggests Wikipedia pageviews has 100B datapoints over the last 10 years. This implies that storage costs will dominate.
- Doesn’t need to be always-on
- This sort of follows from cheap -- the ability to load subsets of data, or to run on spot instances will be a useful tool to cut costs.

### Non-priorities
- Concurrent / fast writes
- These can happen offline.
- Fast reads
- The pareto principle will likely apply to queries - 1% of keys will get 99% of reads. We can use Varnish or similar to cache at the application level.

### Assumptions
- Dense datasets
- Keys: if we see a key once, we expect to see it again.
- Values: if key X has a datapoint at T1, we expect most other keys will as well.
- Correlated values
- Value for key X at T1 is likely related to value at T2.
- Some datasets can be lossy
- Wikipedia pageviews, e.g., are likely insensitive to precision so long as the trend is generally correct.

## Obligatory

![Manu](https://www.smbc-comics.com/comics/1429540032-20150420.png)

Credit: [Our Greatest Asset](https://www.smbc-comics.com/comic/our-greatest-asset), Saturday Morning Breakfast Cereal

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cldellow/manu

Awesome Lists containing this project

README