{"id":40699305,"url":"https://github.com/maris-development/beacon","last_synced_at":"2026-01-21T12:00:29.090Z","repository":{"id":281740845,"uuid":"920629743","full_name":"maris-development/beacon","owner":"maris-development","description":"A high-performance climate 🌍 data lake supporting subsetting for zarr, netcdf, parquet, arrow ipc, csv and bbf","archived":false,"fork":false,"pushed_at":"2026-01-19T16:47:21.000Z","size":11303,"stargazers_count":8,"open_issues_count":21,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-01-19T22:03:29.649Z","etag":null,"topics":["blue-cloud2026","data-access","data-lake","docker","open-science","rest-api"],"latest_commit_sha":null,"homepage":"https://maris-development.github.io/beacon/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maris-development.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-22T13:53:22.000Z","updated_at":"2026-01-16T09:53:07.000Z","dependencies_parsed_at":"2025-04-18T17:42:34.672Z","dependency_job_id":"6b268f3b-5f6c-4825-b239-8c49f71f8270","html_url":"https://github.com/maris-development/beacon","commit_stats":null,"previous_names":["maris-development/beacon"],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/maris-development/beacon","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maris-development%2Fbeacon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maris-development%2Fbeacon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maris-development%2Fbeacon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maris-development%2Fbeacon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maris-development","download_url":"https://codeload.github.com/maris-development/beacon/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maris-development%2Fbeacon/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28632781,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T04:47:28.174Z","status":"ssl_error","status_checked_at":"2026-01-21T04:47:22.943Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blue-cloud2026","data-access","data-lake","docker","open-science","rest-api"],"created_at":"2026-01-21T12:00:18.238Z","updated_at":"2026-01-21T12:00:29.059Z","avatar_url":"https://github.com/maris-development.png","language":"Rust","readme":"# Beacon ARCO Data lake Platform\n\n[![Release](https://img.shields.io/github/v/release/maris-development/beacon?style=for-the-badge\u0026label=Release\u0026color=success)](https://github.com/maris-development/beacon/releases)\n[![Docker Image](https://img.shields.io/badge/Docker-2CA5E0?style=for-the-badge\u0026logo=docker\u0026logoColor=white)](https://github.com/maris-development/beacon/pkgs/container/beacon)\n[![Docs](https://img.shields.io/github/actions/workflow/status/maris-development/beacon/pages.yml?style=for-the-badge\u0026label=Docs)](https://maris-development.github.io/beacon/)\n[![Chat on Slack](https://img.shields.io/badge/Slack-4A154B?style=for-the-badge\u0026logo=slack\u0026logoColor=white)](https://beacontechnic-wwa5548.slack.com/join/shared_invite/zt-2dp1vv56r-tj_KFac0sAKNuAgUKPPDRg)\n\n## Contents\n\n- [What is Beacon?](#what-is-beacon)\n- [Documentation](#documentation)\n- [Getting started](#getting-started)\n- [Install Beacon with Docker Compose](#install-beacon-with-docker-compose)\n- [Data Lake model](#data-lake-model)\n- [Querying Beacon](#querying-beacon)\n- [Contributing](#contributing)\n- [Troubleshooting](#troubleshooting)\n\n## What is Beacon?\n\nBeacon is a lightweight, high-performance ARCO data lake platform for discovering, reading, transforming, and serving scientific array and tabular datasets. It focuses on interoperability with Arrow and DataFusion, and supports common scientific storage formats (Parquet, NetCDF, Zarr, ODV, CSV, and others). Beacon is designed for:\n\n- Data scientists and engineers who need fast, programmatic access to large gridded or tabular datasets stored locally or in object stores (S3-compatible systems).\n- Developers building data services that require efficient columnar reads, pushdown statistics, and integration with DataFusion execution plans.\n\nKey capabilities:\n\n- Format adapters: read and expose data as Arrow arrays from Parquet, NetCDF, Zarr, ODV, CSV, etc.\n- Pushdown \u0026 partitioning: compute lightweight statistics and partition datasets for efficient query planning.\n- Object store friendly: works with local files and S3-like object stores using the `object_store` abstraction.\n- HTTP API: optional Axum-based service to expose query endpoints and metadata.\n- SQL support: execute SQL queries (via DataFusion) against registered formats and datasets. Beacon integrates DataFusion's SQL engine so callers can run SQL directly through the API.\n- JSON query DSL: structured query format for building queries programmatically or via UIs.\n\n## Documentation\n\n- Landing page: https://maris-development.github.io/beacon/\n- Installation (Docker-focused quick start): https://maris-development.github.io/beacon/docs/1.4.0-install/\n- Query \u0026 data lake reference: https://maris-development.github.io/beacon/docs/1.4.0/query-docs/data-lake.html\n\n## Getting started\n\nTo get started with Beacon, clone the beacon-example repository, which contains an example setup for both local and S3, along with example queries and scripts:\n\n```powershell\ngit clone https://github.com/maris-development/beacon-example.git\n```\n\nFollow the instructions in the `beacon-example/README.md` to set up datasets, run the Beacon API server, and execute example queries.\n\n## Install Beacon with Docker Compose\n\nThe official installation guide walks through a Docker-based deployment. The short version:\n\n1. Install Docker (desktop or server) and create a `docker-compose.yml` similar to the example below.\n2. Adjust the environment variables for your admin credentials, memory budget, default table, log level, host, and port.\n3. Mount your dataset and table directories so the container can persist indexed data.\n4. Run `docker compose up -d` and open http://localhost:8080/swagger/ to confirm the service is live.\n\n```yaml\nversion: \"3.8\"\n\nservices:\n  beacon:\n    image: ghcr.io/maris-development/beacon:latest\n    container_name: beacon\n    restart: unless-stopped\n    ports:\n      - \"8080:8080\"\n    environment:\n      - BEACON_ADMIN_USERNAME=admin\n      - BEACON_ADMIN_PASSWORD=securepassword\n      - BEACON_VM_MEMORY_SIZE=4096\n      - BEACON_DEFAULT_TABLE=default\n      - BEACON_LOG_LEVEL=INFO\n      - BEACON_HOST=0.0.0.0\n      - BEACON_PORT=8080\n    volumes:\n      - ./data/datasets:/beacon/data/datasets\n      - ./data/tables:/beacon/data/tables\n```\n\nSee the full installation chapter for troubleshooting tips, hardware recommendations, and additional deployment patterns (single Docker container, MinIO-backed storage, etc.).\n\n## Data Lake model\n\nBeacon organizes storage into datasets (raw files) and data tables (named collections):\n\n- **Datasets** — individual NetCDF, Zarr, Parquet, CSV, ODV ASCII, or Arrow IPC assets that can be queried directly via SQL or JSON APIs. You can register many files at once using glob patterns such as `*.nc` or `*/zarr.json` to sweep directories.\n- **Data tables** — logical tables that group one or more datasets under a single name (e.g., `2020_2022` created from `2020.nc`, `2021.nc`, `2022.nc`). Tables provide a higher-level schema, so analysts can query an entire collection of datasets as if it were a single table.\n\nThe [data lake guide](https://maris-development.github.io/beacon/docs/1.4.0/query-docs/data-lake.html) contains a deeper architectural explanation plus links to tutorials for SQL and JSON queries, language SDKs, and the Beacon Studio web UI.\n\n### ND broadcasting (dataset harmonization)\n\nWhen Beacon needs to combine or harmonize n-dimensional variables coming from different files (e.g. NetCDF/Zarr cubes with slightly different shapes), it can apply broadcasting to align arrays on compatible dimensions.\n\n- Beacon broadcasts by matching dimension names (xarray-style): input dims must be a subset of target dims and order must be preserved.\n\nFor details and examples, see the harmonization docs:\n- https://maris-development.github.io/beacon/docs/1.4.0-install/data-lake/datasets-harmonization/\n\nThe underlying Arrow encoding and broadcasting implementation lives in the `beacon-nd-arrow` crate.\n\n## Querying Beacon\n\n### SQL endpoint\n\nUse the `/api/query` REST endpoint to run ANSI SQL over registered tables or directly over datasets:\n\n```bash\ncurl -X POST http://localhost:8080/api/query \\\n  -H 'Content-Type: application/json' \\\n  --output results.parquet \\\n  --data-binary @- \u003c\u003c'JSON'\n{\n  \"sql\": \"SELECT TEMP, PSAL, LONGITUDE, LATITUDE FROM observations WHERE time \u003e TIMESTAMP '2020-01-01'\",\n  \"output\": {\"format\": \"parquet\"}\n}\nJSON\n```\n\n- Works for both local paths and S3/object-store URIs registered in your configuration.\n- Output formats include Arrow IPC, Parquet, CSV, NetCDF, GeoParquet, GeoJSON, and ODV—set via `output.format`.\n\n### SQL dataset helpers\n\nRead individual collections without pre-registering tables using helper functions documented in the [SQL guide](https://maris-development.github.io/beacon/docs/1.4.0/query-docs/querying/sql.html):\n\n- `read_zarr(['dataset.zarr/zarr.json'], ['LONGITUDE','LATITUDE'])` — optional second argument pre-fetches columns to enable pushdown filtering (recommended for large cubes).\n- `read_netcdf(['dataset.nc'])` — exposes NetCDF variables and attributes directly as columns.\n- `read_parquet(['dataset.parquet'])` — benefits from predicate pushdown automatically.\n\nExample with spatial pushdown:\n\n```bash\ncurl -X POST http://localhost:8080/api/query \\\n  -H 'Content-Type: application/json' \\\n  --output subset.parquet \\\n  --data-binary @- \u003c\u003c'JSON'\n{\n  \"sql\": \"SELECT TEMP, PSAL, LONGITUDE, LATITUDE FROM read_zarr(['datasets.zarr/zarr.json'], ['LONGITUDE', 'LATITUDE']) WHERE LONGITUDE \u003e 10 AND LATITUDE \u003c 50\",\n  \"output\": {\"format\": \"parquet\"}\n}\nJSON\n```\n\n### Python SDK\n\nUse the official [beacon-py](https://maris-development.github.io/beacon-py/latest/) client when you prefer fluent builders and direct access to DataFrames/GeoDataFrames.\n\n```bash\npip install beacon-api\n```\n\n```python\nfrom beacon_api import Client\n\nclient = Client(\n    \"https://beacon.example.com\",\n    jwt_token=\"\u003coptional bearer token\u003e\",\n)\n\nclient.check_status()\n\ntables = client.list_tables()\nstations = tables[\"default\"]\n\ndf = (\n    stations\n    .query()\n    .add_select_columns([\n        (\"LONGITUDE\", None),\n        (\"LATITUDE\", None),\n        (\"JULD\", None),\n        (\"TEMP\", \"temperature_c\"),\n    ])\n    .add_range_filter(\"JULD\", \"2024-01-01T00:00:00\", \"2024-12-31T23:59:59\")\n    .to_pandas_dataframe()\n)\n```\n\n- `Client.list_tables()` and `Client.list_datasets()` expose metadata/schemas before you build queries.\n- The fluent builder covers selects, filters (min/max, geospatial, distinct), ordering, and export helpers such as `to_geo_pandas_dataframe()` or `to_parquet()`.\n- Prefer `client.sql_query(\"SELECT ...\")` if you already have SQL strings and want the SDK to manage authentication + retries.\n\n### JSON query DSL\n\nThe JSON DSL is useful for UI integrations or when you want structured column definitions, filtering, and output control. The request body follows the schema documented in [Querying with JSON](https://maris-development.github.io/beacon/docs/1.4.0/query-docs/querying/json.html):\n\n```bash\ncurl -X POST http://localhost:8080/api/query \\\n  -H 'Content-Type: application/json' \\\n  --data-binary @- \u003c\u003c'JSON'\n{\n  \"query_parameters\": [\n    {\"column_name\": \"TEMP\", \"alias\": \"temperature\"},\n    {\"column_name\": \"PSAL\", \"alias\": \"salinity\"},\n    {\"column_name\": \"TIME\"},\n    {\"column_name\": \"LONGITUDE\"},\n    {\"column_name\": \"LATITUDE\"}\n  ],\n  \"filters\": [\n    {\"for_query_parameter\": \"temperature\", \"min\": -2, \"max\": 35},\n    {\"for_query_parameter\": \"salinity\", \"min\": 30, \"max\": 42},\n    {\"and\": [\n      {\"for_query_parameter\": \"LONGITUDE\", \"min\": -20, \"max\": 20},\n      {\"for_query_parameter\": \"LATITUDE\", \"min\": 40, \"max\": 65}\n    ]}\n  ],\n  \"from\": {\n    \"netcdf\": {\"paths\": [\"data/2020.nc\", \"data/2021.nc\"]}\n  },\n  \"output\": {\"format\": \"csv\"}\n}\nJSON\n```\n\n- Discover available columns via `GET /api/query/available-columns` (requires the same auth token, if any).\n- Filters support min/max ranges, equality, polygon bounds, time windows, null filtering, and logical `and`/`or` composition.\n- Set `distinct` to deduplicate values or supply GeoJSON/GeoParquet metadata when you need geospatial outputs.\n\n### Output formats\n\nReturn results in the format that best matches your downstream tooling by adjusting `output.format`:\n\n- `parquet` / `geoparquet`\n- `ipc` (Arrow IPC streaming)\n- `csv`\n- `netcdf`\n- `geojson`\n- `odv` (with support for key columns, quality flags, and metadata columns as described in the JSON guide)\n\n## Support\n\nFor questions, issues, or feature requests, please open an issue on the GitHub repository: https://github.com/maris-development/beacon/issues\nWe also have a dedicated slack channel for discussions: https://beacontechnic-wwa5548.slack.com/join/shared_invite/zt-2dp1vv56r-tj_KFac0sAKNuAgUKPPDRg \n\n## Workspace overview\n\nLocation: repository root (this README)\n\nKey workspace members (see `Cargo.toml`):\n\n- `beacon-api` — HTTP API server exposing query endpoints and OpenAPI/Swagger UI. The API supports submitting SQL queries (DataFusion SQL) and returns Arrow/JSON results.\n- `beacon-core` — Core runtime types and orchestration used by services; ties together query planning and execution helpers.\n- `beacon-common` — Shared utilities and small helpers used across crates.\n- `beacon-config` — Configuration and environment handling.\n- `beacon-formats` — File format adapters (Parquet, CSV, Arrow, NetCDF, Zarr, GeoParquet).\n- `beacon-arrow-netcdf` — Arrow/NetCDF integration (reader/writer utilities).\n- `beacon-arrow-zarr` — Arrow/Zarr integration (reader/writer utilities).\n- `beacon-arrow-odv` — Arrow/ODV ASCII integration.\n- `beacon-binary-format` — Beacon Binary Format (BBF) for efficient storage of multi-million-dataset collections. (Will also become an exchange format in future releases.)\n- `beacon-data-lake` — Utilities for working with object stores, dataset discovery and table management.\n- `beacon-functions` — User-defined functions and helpers used in query execution.\n- `beacon-planner` — Query planner and planning utilities that build execution plans.\n- `beacon-query` — Query parsing and translation to planner structures.\n\nNote: the workspace `Cargo.toml` references `beacon-arrow-zarr` and other crates; not all referenced crates may be present locally in this checkout. If you see build errors about missing workspace members, check whether the missing crate exists in a separate repository or submodule.\n\n## Per-crate quick descriptions\n\nThese are short summaries to help contributors quickly find where to work:\n\n- `beacon-api/` — An Axum-based HTTP server that exposes Beacon's query interface and metadata endpoints. Integrates with `beacon-core` and registers DataFusion file formats and resolvers.\n\n- `beacon-core/` — Core runtime crate: session/environment scaffolding, runtime utilities, and glue between the API and execution components.\n\n- `beacon-common/` — Small helpers, error types, and utilities (serialization helpers, common types, and small abstractions used across the workspace).\n\n- `beacon-formats/` — Implements DataFusion FileFormat adapters for a range of formats. Notable submodule: `zarr` implements async discovery of Zarr v3 groups and integrates with `zarrs` + `zarrs_object_store` to create partitioned file groups and compute pushdown statistics.\n\n- `beacon-arrow-netcdf/`, `beacon-arrow-odv/` — Adapter crates that expose NetCDF and ODV data as Arrow arrays and schemas.\n\n- `beacon-arrow-zarr/` — Adapter crate that exposes Zarr v3 datasets as Arrow arrays and schemas. Uses `zarrs` and `zarrs_object_store` for low-level Zarr access.\n\n- `beacon-data-lake/` — Utilities to manage datasets on object stores and local file systems, object discovery, and helper functions for scanning.\n\n- `beacon-query/` — Parsing and translation of text queries into planner nodes used by `beacon-planner` and `beacon-core`.\n\n- `beacon-planner/` — Planner that converts parsed queries into DataFusion execution plans and coordinates pushdowns and function dispatch.\n\nThere are additional crates and examples in the repo for demos, python bindings (`beacon-py`), and studio tooling (`beacon-studio`). Browse the workspace directories for more details.\n\n## Building\n\nRequirements:\n\n- Rust toolchain: the repository includes a `rust-toolchain` file pinning the Rust version. Use `rustup` to install the correct toolchain.\n- Cargo (comes with Rust toolchain).\n\nBuild the whole workspace (from repo root):\n\n```powershell\ncargo build --workspace\n```\n\nBuild just one crate (faster):\n\n```powershell\ncargo build -p beacon-formats\n```\n\nNotes:\n\n- The first build will download and compile dependencies, including any git dependencies referenced in crate manifests (for example `nd-arrow-array`).\n\n## Testing\n\nRun all tests in the workspace:\n\n```powershell\ncargo test --workspace\n```\n\nRun tests for a single crate:\n\n```powershell\ncargo test -p beacon-formats\n```\n\nSome tests require access to `test_files/` directories (local object store) and may perform async IO. Use `-- --nocapture` to see printed debug output when running individual tests.\n\n## Linting and formatting\n\nYou can run Clippy and rustfmt for code quality checks:\n\n```powershell\ncargo clippy --workspace -- -D warnings\ncargo fmt --all\n```\n\n## Development tips\n\n- Use `cargo test -p \u003ccrate\u003e -- --nocapture` when debugging tests that print logs.\n- The project uses DataFusion and Arrow heavily; when changing format adapters (e.g., Zarr), update unit tests in `beacon-formats` and consider adding small integration tests that use `object_store::local::LocalFileSystem`.\n\n## Contributing\n\n1. Fork the project and create a feature branch.\n2. Run and add tests for any functional change.\n3. Keep changes small and focused — run `cargo test -p \u003ccrate\u003e` locally before opening a PR.\n\n## Troubleshooting\n\n- Long compile times: use incremental builds and build individual crates when working on a small change.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaris-development%2Fbeacon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaris-development%2Fbeacon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaris-development%2Fbeacon/lists"}