{"id":50565603,"url":"https://github.com/jeromer/klines","last_synced_at":"2026-06-04T14:30:28.762Z","repository":{"id":355902883,"uuid":"1228171307","full_name":"jeromer/klines","owner":"jeromer","description":"Fetch, normalise, validate, and aggregate Binance OHLCV klines into clean Parquet datasets.","archived":false,"fork":false,"pushed_at":"2026-05-13T17:15:01.000Z","size":81,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-13T19:19:09.205Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jeromer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-03T17:29:53.000Z","updated_at":"2026-05-13T17:15:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jeromer/klines","commit_stats":null,"previous_names":["jeromer/klines"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jeromer/klines","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeromer%2Fklines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeromer%2Fklines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeromer%2Fklines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeromer%2Fklines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jeromer","download_url":"https://codeload.github.com/jeromer/klines/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeromer%2Fklines/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33910136,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-04T14:30:27.818Z","updated_at":"2026-06-04T14:30:28.750Z","avatar_url":"https://github.com/jeromer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# klines\n\nFetch, normalise, validate, and aggregate Binance OHLCV klines into clean Parquet datasets.\n\n## What it does\n\nTwo ready-to-run scripts and five library modules, no trading logic, no config globals:\n\n| | Name | What it provides |\n|---|---|---|\n| script | `bin/fetch_data.py` | Download klines from Binance and save as Parquet |\n| script | `bin/build_datasets.py` | Validate and aggregate H1 or M15 data into H1/H4/D1/W1/M1/Q1 |\n| lib | `download` | Async batch HTTP fetch from Binance REST API |\n| lib | `normalise` | Convert raw Binance JSON rows to typed OHLCV DataFrame |\n| lib | `store` | Save and load DataFrames as Parquet files |\n| lib | `validate` | Deduplicate, gap-fill, and sanity-check H1 or M15 data |\n| lib | `aggregate` | Resample M15→H1 and H1→H4/D1/W1/M1/Q1 |\n\n## Install\n\n```bash\npip install \"klines @ git+https://github.com/jeromer/klines.git\"\n```\n\nOr with [uv](https://docs.astral.sh/uv/):\n\n```bash\nuv add \"klines @ git+https://github.com/jeromer/klines.git\"\n```\n\n## Scripts\n\nThe scripts in `bin/` are executable and self-contained. Two ways to run them:\n\n**From a clone** (no install required, needs deps on `$PYTHONPATH`):\n```bash\ngit clone https://github.com/jeromer/klines\ncd klines\nuv sync\n./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT\n./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT\n```\n\n**As installed CLI commands** (after `pip install` / `uv add`):\n```bash\nbinance-fetch --symbols BTCUSDT,ETHUSDT\nbinance-build --symbols BTCUSDT,ETHUSDT\n```\n\nBoth forms accept identical flags.\n\n### `bin/fetch_data.py` — download klines\n\n```\n./bin/fetch_data.py [--symbols SYMBOL[,SYMBOL...]]\n                    [--market spot|futures]\n                    [--interval m15|h1|h4|d]\n                    [--start YYYY-MM-DD]\n                    [--end YYYY-MM-DD]\n                    [--output-dir DIR]\n                    [--workers N]\n                    [--progress|--no-progress]\n```\n\nDownloads H1 klines (default) for one or more symbols. Resumes from the last stored timestamp if a Parquet file already exists.\n\n```bash\n# fetch BTC + ETH hourly from 2020 to today → data/raw/\n./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT --start 2020-01-01\n\n# fetch 15m futures data into a custom dir\n./bin/fetch_data.py --symbols BTCUSDT --market futures --interval m15 --output-dir /tmp/raw\n```\n\nDefaults: `--market spot`, `--interval h1`, `--start 2017-01-01`, `--output-dir ./data/raw`, `--workers \u003cCPU count\u003e`.\n\n### `bin/build_datasets.py` — validate and aggregate\n\n```\n./bin/build_datasets.py [--symbols SYMBOL[,SYMBOL...]]\n                        [--source-interval h1|m15]\n                        [--raw-dir DIR]\n                        [--output-dir DIR]\n```\n\nReads `{symbol}_{SOURCE}.parquet` from `--raw-dir`, validates, and writes H1/H4/D1/W1/M1/Q1 Parquet files to `--output-dir`. When `--source-interval m15`, H1 is derived from M15 before aggregating higher timeframes.\n\n```bash\n# build from H1 source (default)\n./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT\n\n# build from M15 source — derives H1, H4, D1, W1, M1, Q1\n./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT --source-interval m15\n\n./bin/build_datasets.py --symbols BTCUSDT --raw-dir /tmp/raw --output-dir /tmp/processed\n```\n\nDefaults: `--source-interval h1`, `--raw-dir ./data/raw`, `--output-dir ./data/processed`.\n\n### Full pipeline\n\n```bash\n# H1 source\n./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT \u0026\u0026 ./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT\n\n# M15 source — higher resolution, derives H1 and all higher timeframes\n./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT --interval m15 \u0026\u0026 \\\n./bin/build_datasets.py --symbols BTCUSDT,ETHUSDT --source-interval m15\n```\n\n## Embedding\n\nUse this when your project has its own config and wants to call the pipeline programmatically rather than shelling out to the scripts.\n\n### Option A — call `main()` with your defaults\n\nBoth scripts expose `main(defaults={...})`. Keys in `defaults` set argument defaults; any CLI flag passed at runtime still overrides them. Your project never needs to touch `sys.argv`.\n\n```python\nfrom bin.fetch_data import main as fetch_main\nfrom bin.build_datasets import main as build_main\n\nSYMBOLS = [\"BTCUSDT\", \"ETHUSDT\", \"SOLUSDT\"]\nRAW_DIR = \"/data/raw\"\nPROCESSED_DIR = \"/data/processed\"\n\n# equivalent to: ./bin/fetch_data.py --symbols BTCUSDT,ETHUSDT,SOLUSDT --output-dir /data/raw\nfetch_main(defaults={\n    \"symbols\": SYMBOLS,\n    \"start\": \"2017-01-01\",\n    \"output_dir\": RAW_DIR,\n})\n\n# equivalent to: ./bin/build_datasets.py --symbols ... --raw-dir /data/raw --output-dir /data/processed\nbuild_main(defaults={\n    \"symbols\": SYMBOLS,\n    \"raw_dir\": RAW_DIR,\n    \"output_dir\": PROCESSED_DIR,\n})\n```\n\n### Option B — call library functions directly\n\nUse this when you need finer control: custom progress reporting, in-memory pipelines, partial steps, or integration with an async event loop.\n\n```python\nimport asyncio\nfrom pathlib import Path\n\nimport pandas as pd\n\nfrom klines.download import KlineRequest, fetch_all\nfrom klines.normalise import normalise_klines\nfrom klines.store import load_parquet, save_parquet\nfrom klines.validate import validate_h1\nfrom klines.aggregate import aggregate_h4, aggregate_daily\n\nRAW_DIR = Path(\"/data/raw\")\nPROCESSED_DIR = Path(\"/data/processed\")\nSYMBOLS = [\"BTCUSDT\", \"ETHUSDT\"]\nSTART = \"2017-01-01\"\n\n\nasync def fetch(symbols: list[str]) -\u003e None:\n    end_ms = int(pd.Timestamp.now(tz=\"UTC\").timestamp() * 1000)\n    requests = []\n    for symbol in symbols:\n        path = RAW_DIR / f\"{symbol}_H1.parquet\"\n        if path.exists():\n            start_ms = int(load_parquet(path).index[-1].timestamp() * 1000) + 1\n        else:\n            start_ms = int(pd.Timestamp(START, tz=\"UTC\").timestamp() * 1000)\n        requests.append(KlineRequest(symbol, \"1h\", start_ms, end_ms))\n\n    raw = await fetch_all(requests, max_workers=4)\n\n    for symbol, raw_df in raw.items():\n        new_df = normalise_klines(raw_df)\n        path = RAW_DIR / f\"{symbol}_H1.parquet\"\n        if path.exists():\n            old = load_parquet(path)\n            new_df = pd.concat([old, new_df]).sort_index()\n            new_df = new_df[~new_df.index.duplicated(keep=\"last\")]\n        save_parquet(new_df, path)\n\n\ndef build(symbols: list[str]) -\u003e None:\n    for symbol in symbols:\n        h1 = validate_h1(load_parquet(RAW_DIR / f\"{symbol}_H1.parquet\"))\n        save_parquet(aggregate_h4(h1),    PROCESSED_DIR / f\"{symbol}_H4.parquet\")\n        save_parquet(aggregate_daily(h1), PROCESSED_DIR / f\"{symbol}_D1.parquet\")\n\n\nasyncio.run(fetch(SYMBOLS))\nbuild(SYMBOLS)\n```\n\n## API reference\n\n### `download`\n\n```python\nfrom klines.download import KlineRequest, fetch_all, SPOT_URL, FUTURES_URL, MAX_BARS_PER_REQUEST\n\n# SPOT_URL    = \"https://api.binance.com/api/v3/klines\"\n# FUTURES_URL = \"https://fapi.binance.com/fapi/v1/klines\"\n# MAX_BARS_PER_REQUEST = 1000  (Binance hard limit; fetch_all batches automatically)\n\nreq = KlineRequest(\n    symbol=\"BTCUSDT\",\n    interval=\"1h\",       # 15m | 1h | 4h | 1d\n    start_ms=...,        # Unix ms\n    end_ms=...,          # Unix ms\n    url=SPOT_URL,        # default\n)\n\nresult: dict[str, pd.DataFrame] = asyncio.run(\n    fetch_all(requests, max_workers=4, on_progress=None)\n)\n# on_progress: Callable[[symbol: str, done: int, total: int], None]\n```\n\n### `normalise`\n\n```python\nfrom klines.normalise import normalise_klines\n\ndf = normalise_klines(raw_df)\n# Input:  raw DataFrame from fetch_all (12 Binance columns)\n# Output: UTC DatetimeIndex, columns [open, high, low, close, volume] float64\n#         Handles Binance's ms→μs timestamp switch at 2025-01-01\n#         Deduplicates automatically\n```\n\n### `store`\n\n```python\nfrom klines.store import save_parquet, load_parquet\n\nsave_parquet(df, Path(\"data/BTCUSDT_H1.parquet\"))   # creates parent dirs\ndf = load_parquet(Path(\"data/BTCUSDT_H1.parquet\"))  # UTC index preserved\n```\n\n### `validate`\n\n```python\nfrom klines.validate import validate_h1, validate_m15\n\ndf = validate_h1(df)   # for H1 source data\ndf = validate_m15(df)  # for M15 source data\n# Both:\n# - Drop duplicate timestamps (keeps last)\n# - Forward-fill gaps with zero-volume candles (O=H=L=C=prev close)\n# - Raise ValueError on OHLC sanity violations\n# - Drop the last candle if its period hasn't closed\n```\n\nIndividual functions:\n\n```python\nfrom klines.validate import (\n    check_no_gaps,       # raises ValueError if any gap found\n    check_no_duplicates, # raises ValueError if duplicate timestamps found\n    check_ohlc_sanity,   # raises ValueError on high\u003clow, negative volume, etc.\n    fill_gaps,           # forward-fill missing bars (freq param: \"1h\" or \"15min\")\n    drop_partial_candle, # remove last bar if its period hasn't closed\n)\n```\n\n### `aggregate`\n\n```python\nfrom klines.aggregate import (\n    aggregate_h1,         # M15 → H1 (requires all 4 bars per hour)\n    aggregate_h4,         # H1 → 4h bars, UTC-anchored to 2020-01-01\n    aggregate_daily,      # H1 → daily bars, midnight UTC\n    aggregate_weekly,     # H1 → weekly bars, Monday 00:00 UTC\n    aggregate_monthly,    # H1 → monthly bars, 1st of month UTC\n    aggregate_quarterly,  # H1 → quarterly bars, Jan/Apr/Jul/Oct 1st UTC\n)\n```\n\nIncomplete periods at the tail are dropped (e.g. a partial week with fewer than 144 H1 bars).\n\n## Development\n\n```bash\ngit clone https://github.com/jeromer/klines\ncd klines\nuv sync --extra dev\nuv run pytest tests/\nuv run ruff check .\n```\n\n## Requirements\n\n- Python ≥ 3.12\n- pandas ≥ 2.2\n- aiohttp ≥ 3.9\n- pyarrow ≥ 15\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeromer%2Fklines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjeromer%2Fklines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeromer%2Fklines/lists"}