https://github.com/jeromer/klines

Fetch, normalise, validate, and aggregate Binance OHLCV klines into clean Parquet datasets.
https://github.com/jeromer/klines
Last synced: 24 days ago
JSON representation
Fetch, normalise, validate, and aggregate Binance OHLCV klines into clean Parquet datasets.
Host: GitHub
URL: https://github.com/jeromer/klines
Owner: jeromer
Created: 2026-05-03T17:29:53.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-05-13T17:15:01.000Z (about 2 months ago)
Last Synced: 2026-05-13T19:19:09.205Z (about 2 months ago)
Language: Python
Size: 79.1 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # klines

Fetch, normalise, validate, and aggregate Binance OHLCV klines into clean Parquet datasets.

## What it does

Two ready-to-run scripts and five library modules, no trading logic, no config globals:

| | Name | What it provides |

|---|---|---|

| script | `bin/fetch_data.py` | Download klines from Binance and save as Parquet |

| script | `bin/build_datasets.py` | Validate and aggregate H1 or M15 data into H1/H4/D1/W1/M1/Q1 |

| lib | `download` | Async batch HTTP fetch from Binance REST API |

| lib | `normalise` | Convert raw Binance JSON rows to typed OHLCV DataFrame |

| lib | `store` | Save and load DataFrames as Parquet files |

| lib | `validate` | Deduplicate, gap-fill, and sanity-check H1 or M15 data |

| lib | `aggregate` | Resample M15→H1 and H1→H4/D1/W1/M1/Q1 |

## Install

```bash

pip install "klines @ git+https://github.com/jeromer/klines.git"

```

Or with [uv](https://docs.astral.sh/uv/):

```bash

uv add "klines @ git+https://github.com/jeromer/klines.git"

```

## Scripts

The scripts in `bin/` are executable and self-contained. Two ways to run them:

**From a clone** (no install required, needs deps on `$PYTHONPATH`):

```bash

git clone https://github.com/jeromer/klines

cd klines

uv sync

./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT

./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT

```

**As installed CLI commands** (after `pip install` / `uv add`):

```bash

binance-fetch --symbols BTCUSDT,ETHUSDT

binance-build --symbols BTCUSDT,ETHUSDT

```

Both forms accept identical flags.

### `bin/fetch_data.py` — download klines

```

./bin/fetch_data.py [--symbols SYMBOL[,SYMBOL...]]

                    [--market spot|futures]

                    [--interval m15|h1|h4|d]

                    [--start YYYY-MM-DD]

                    [--end YYYY-MM-DD]

                    [--output-dir DIR]

                    [--workers N]

                    [--progress|--no-progress]

```

Downloads H1 klines (default) for one or more symbols. Resumes from the last stored timestamp if a Parquet file already exists.

```bash

# fetch BTC + ETH hourly from 2020 to today → data/raw/

./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT --start 2020-01-01

# fetch 15m futures data into a custom dir

./bin/fetch_data.py --symbols BTCUSDT --market futures --interval m15 --output-dir /tmp/raw

```

Defaults: `--market spot`, `--interval h1`, `--start 2017-01-01`, `--output-dir ./data/raw`, `--workers `.

### `bin/build_datasets.py` — validate and aggregate

```

./bin/build_datasets.py [--symbols SYMBOL[,SYMBOL...]]

                        [--source-interval h1|m15]

                        [--raw-dir DIR]

                        [--output-dir DIR]

```

Reads `{symbol}_{SOURCE}.parquet` from `--raw-dir`, validates, and writes H1/H4/D1/W1/M1/Q1 Parquet files to `--output-dir`. When `--source-interval m15`, H1 is derived from M15 before aggregating higher timeframes.

```bash

# build from H1 source (default)

./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT

# build from M15 source — derives H1, H4, D1, W1, M1, Q1

./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT --source-interval m15

./bin/build_datasets.py --symbols BTCUSDT --raw-dir /tmp/raw --output-dir /tmp/processed

```

Defaults: `--source-interval h1`, `--raw-dir ./data/raw`, `--output-dir ./data/processed`.

### Full pipeline

```bash

# H1 source

./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT && ./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT

# M15 source — higher resolution, derives H1 and all higher timeframes

./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT --interval m15 && \

./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT --source-interval m15

```

## Embedding

Use this when your project has its own config and wants to call the pipeline programmatically rather than shelling out to the scripts.

### Option A — call `main()` with your defaults

Both scripts expose `main(defaults={...})`. Keys in `defaults` set argument defaults; any CLI flag passed at runtime still overrides them. Your project never needs to touch `sys.argv`.

```python

from bin.fetch_data import main as fetch_main

from bin.build_datasets import main as build_main

SYMBOLS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]

RAW_DIR = "/data/raw"

PROCESSED_DIR = "/data/processed"

# equivalent to: ./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT,SOLUSDT --output-dir /data/raw

fetch_main(defaults={

    "symbols": SYMBOLS,

    "start": "2017-01-01",

    "output_dir": RAW_DIR,

})

# equivalent to: ./bin/build_datasets.py --symbols ... --raw-dir /data/raw --output-dir /data/processed

build_main(defaults={

    "symbols": SYMBOLS,

    "raw_dir": RAW_DIR,

    "output_dir": PROCESSED_DIR,

})

```

### Option B — call library functions directly

Use this when you need finer control: custom progress reporting, in-memory pipelines, partial steps, or integration with an async event loop.

```python

import asyncio

from pathlib import Path

import pandas as pd

from klines.download import KlineRequest, fetch_all

from klines.normalise import normalise_klines

from klines.store import load_parquet, save_parquet

from klines.validate import validate_h1

from klines.aggregate import aggregate_h4, aggregate_daily

RAW_DIR = Path("/data/raw")

PROCESSED_DIR = Path("/data/processed")

SYMBOLS = ["BTCUSDT", "ETHUSDT"]

START = "2017-01-01"

async def fetch(symbols: list[str]) -> None:

    end_ms = int(pd.Timestamp.now(tz="UTC").timestamp() * 1000)

    requests = []

    for symbol in symbols:

        path = RAW_DIR / f"{symbol}_H1.parquet"

        if path.exists():

            start_ms = int(load_parquet(path).index[-1].timestamp() * 1000) + 1

        else:

            start_ms = int(pd.Timestamp(START, tz="UTC").timestamp() * 1000)

        requests.append(KlineRequest(symbol, "1h", start_ms, end_ms))

    raw = await fetch_all(requests, max_workers=4)

    for symbol, raw_df in raw.items():

        new_df = normalise_klines(raw_df)

        path = RAW_DIR / f"{symbol}_H1.parquet"

        if path.exists():

            old = load_parquet(path)

            new_df = pd.concat([old, new_df]).sort_index()

            new_df = new_df[~new_df.index.duplicated(keep="last")]

        save_parquet(new_df, path)

def build(symbols: list[str]) -> None:

    for symbol in symbols:

        h1 = validate_h1(load_parquet(RAW_DIR / f"{symbol}_H1.parquet"))

        save_parquet(aggregate_h4(h1),    PROCESSED_DIR / f"{symbol}_H4.parquet")

        save_parquet(aggregate_daily(h1), PROCESSED_DIR / f"{symbol}_D1.parquet")

asyncio.run(fetch(SYMBOLS))

build(SYMBOLS)

```

## API reference

### `download`

```python

from klines.download import KlineRequest, fetch_all, SPOT_URL, FUTURES_URL, MAX_BARS_PER_REQUEST

# SPOT_URL    = "https://api.binance.com/api/v3/klines"

# FUTURES_URL = "https://fapi.binance.com/fapi/v1/klines"

# MAX_BARS_PER_REQUEST = 1000  (Binance hard limit; fetch_all batches automatically)

req = KlineRequest(

    symbol="BTCUSDT",

    interval="1h",       # 15m | 1h | 4h | 1d

    start_ms=...,        # Unix ms

    end_ms=...,          # Unix ms

    url=SPOT_URL,        # default

)

result: dict[str, pd.DataFrame] = asyncio.run(

    fetch_all(requests, max_workers=4, on_progress=None)

)

# on_progress: Callable[[symbol: str, done: int, total: int], None]

```

### `normalise`

```python

from klines.normalise import normalise_klines

df = normalise_klines(raw_df)

# Input:  raw DataFrame from fetch_all (12 Binance columns)

# Output: UTC DatetimeIndex, columns [open, high, low, close, volume] float64

#         Handles Binance's ms→μs timestamp switch at 2025-01-01

#         Deduplicates automatically

```

### `store`

```python

from klines.store import save_parquet, load_parquet

save_parquet(df, Path("data/BTCUSDT_H1.parquet"))   # creates parent dirs

df = load_parquet(Path("data/BTCUSDT_H1.parquet"))  # UTC index preserved

```

### `validate`

```python

from klines.validate import validate_h1, validate_m15

df = validate_h1(df)   # for H1 source data

df = validate_m15(df)  # for M15 source data

# Both:

# - Drop duplicate timestamps (keeps last)

# - Forward-fill gaps with zero-volume candles (O=H=L=C=prev close)

# - Raise ValueError on OHLC sanity violations

# - Drop the last candle if its period hasn't closed

```

Individual functions:

```python

from klines.validate import (

    check_no_gaps,       # raises ValueError if any gap found

    check_no_duplicates, # raises ValueError if duplicate timestamps found

    check_ohlc_sanity,   # raises ValueError on high
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jeromer/klines

Awesome Lists containing this project

README