{"id":49987843,"url":"https://github.com/spiraldb/raincloud","last_synced_at":"2026-05-19T01:06:58.281Z","repository":{"id":356368511,"uuid":"1230157784","full_name":"spiraldb/raincloud","owner":"spiraldb","description":"🌧️ Public Data to Parquet and Vortex Pipeline","archived":false,"fork":false,"pushed_at":"2026-05-17T16:38:06.000Z","size":2528,"stargazers_count":2,"open_issues_count":4,"forks_count":0,"subscribers_count":1,"default_branch":"develop","last_synced_at":"2026-05-17T17:46:00.837Z","etag":null,"topics":["data-pipeline","datasets","parquet","python","vortex"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spiraldb.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-05T18:26:33.000Z","updated_at":"2026-05-17T16:29:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/spiraldb/raincloud","commit_stats":null,"previous_names":["spiraldb/raincloud"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/spiraldb/raincloud","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spiraldb%2Fraincloud","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spiraldb%2Fraincloud/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spiraldb%2Fraincloud/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spiraldb%2Fraincloud/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spiraldb","download_url":"https://codeload.github.com/spiraldb/raincloud/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spiraldb%2Fraincloud/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33197524,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T09:27:30.708Z","status":"ssl_error","status_checked_at":"2026-05-18T09:27:28.300Z","response_time":71,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-pipeline","datasets","parquet","python","vortex"],"created_at":"2026-05-19T01:06:56.607Z","updated_at":"2026-05-19T01:06:58.243Z","avatar_url":"https://github.com/spiraldb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🌧️ Raincloud\n\n[![CI](https://github.com/spiraldb/raincloud/actions/workflows/ci.yml/badge.svg)](https://github.com/spiraldb/raincloud/actions/workflows/ci.yml)\n[![Latest release](https://img.shields.io/github/v/release/spiraldb/raincloud)](https://github.com/spiraldb/raincloud/releases)\n[![License](https://img.shields.io/github/license/spiraldb/raincloud)](LICENSE)\n[![Cite](https://img.shields.io/badge/cite-CITATION.cff-blue)](CITATION.cff)\n\nA reproducible pipeline for building a curated catalog of public datasets as analytics-ready Parquet and Vortex files.\n\nRaincloud is a reproducible baseline of public datasets in modern columnar formats, curated from research papers and existing community efforts. The project's motivation comes from file-format research, where consistent test corpora are needed to compare encoding, compression, and layout choices on real-world inputs. Beyond file-format research, we see broader value in providing a community-curated set of real-world data, as we expect it to be useful for other tasks such as analytical benchmarking and model evaluation.\n\n\u003e ⚠ **Third-party data.** Raincloud fetches data from URLs declared in `sources.json`. Those bytes come from upstream sources, not from us — we don't audit, host, or redistribute them. See [`DISCLAIMER.md`](DISCLAIMER.md) for the AS IS posture, content / license / supply-chain disclaimers, and the dataset-removal channel.\n\nThe repo is driven by a single manifest — `sources.json` — which declares:\n\n1. **Where to fetch data from.** Each entry names an upstream source (HTTP, Kaggle, Hugging Face) and the URLs to pull.\n2. **How to transform it.** Each entry names a parser (`csv`, `parquet`, `jsonl`, `xml`, `pbf`, `custom`) and a transform handler that converts the raw bytes into one or more typed Arrow tables.\n3. **Where it lands.** Transformed tables are written to `outputs/v{schema_version}/\u003cslug\u003e/parquet/\u003cslug\u003e.parquet`, with the optional Vortex sibling at `outputs/v{schema_version}/\u003cslug\u003e/vortex/\u003cslug\u003e.vortex`. The per-format subdirectory leaves room for additional artifact tiers (e.g. `parquet-hydrated/`) without filename collisions.\n\nNothing downstream of `sources.json` is hand-maintained; `docs/datasets.md` and `docs/handlers.md` are derived artefacts regenerated after each build. Column-level / type-coverage / vortex-skip / hydration-candidate views are queryable via `list_datasets` flags and the TUI rather than markdown.\n\n## Getting started\n\n**Browse the catalog at a glance** — sortable columns, parquet/vortex presence per slug, no builds required:\n\n```bash\nuv sync --extra tui --inexact\npython -m scripts.pipeline.browse\n```\n\nA read-only Textual TUI over `sources.json`. Click any column header to sort; right pane shows description, license, fetch URL, and on-disk state for the highlighted slug. Press `q` to quit.\n\n**Tell Raincloud which dataset you want; get back a Parquet + Vortex file on disk.**\n\n```bash\nuv sync --inexact\npython -m scripts.pipeline.status --fast --missing-only   # read-only env check\npython -m scripts.pipeline.build countries-of-the-world\n```\n\n```\noutputs/v1/countries-of-the-world/parquet/countries-of-the-world.parquet\noutputs/v1/countries-of-the-world/vortex/countries-of-the-world.vortex\n```\n\nThe command runs every pipeline stage — fetch, extract, parse, transform, write, validate, convert — and leaves both a Parquet file and its converted Vortex sibling under per-format subdirectories of `outputs/v1/\u003cslug\u003e/`.\n\n### Discover\n\nRun `python -m scripts.pipeline.browse` and click the **Encoding** preset. The left panel filters by domain, size, shape traits, license, or fetch type; click a slug to see its description, on-disk state, and per-column profile (run `python -m scripts.pipeline.profile \u003cslug\u003e` first to populate it).\n\nHeadless? The same axes are flags on `list_datasets`:\n\n```bash\npython -m scripts.pipeline.list_datasets --view encoding --long\npython -m scripts.pipeline.list_datasets --tag geospatial\npython -m scripts.pipeline.list_datasets --trait has_nested --size m\npython -m scripts.pipeline.list_datasets --inspect clickbench-hits\npython -m scripts.pipeline.list_datasets --tags-help\n```\n\nThe curated-picks header at the top of [`docs/v1/datasets.md`](docs/v1/datasets.md) groups slugs into two editorial tiers (Encoding, Stress) drawn from each spec's `showcase` field — pick any slug and pass it to `build` the same way. Examples spanning the size range:\n\n```bash\npython -m scripts.pipeline.build uci-seeds                  # 210 rows, ~200 ms\npython -m scripts.pipeline.build clickbench-hits            # 100 M rows, ~10 GB parquet\n```\n\n### Upstream-specific extras\n\n157 of 249 manifest entries fetch from direct HTTPS endpoints and need no additional setup. The rest:\n\n```bash\n# Kaggle-hosted (33 slugs). One-time credential setup:\nuv sync --extra kaggle --inexact\nmkdir -p ~/.kaggle \u0026\u0026 mv /path/to/kaggle.json ~/.kaggle/ \u0026\u0026 chmod 600 ~/.kaggle/kaggle.json\n\n# Hugging Face-hosted (59 slugs):\nuv sync --extra huggingface --inexact\n\n# Everything:\nuv sync --extra all --inexact\n```\n\nThe `--inexact` flag matters: by default `uv sync` removes any extras you installed previously. Pass it on every `uv sync` so the tui / kaggle / huggingface / dev extras accumulate instead of overwriting each other.\n\n## For AI coding agents\n\nIf you're an AI coding agent landing in this repo:\n\n1. Read [`AGENTS.md`](AGENTS.md) (auto-loaded from `CLAUDE.md → AGENTS.md`) for the invariants and architecture.\n2. Run `python -m scripts.pipeline.status --fast --missing-only` to verify the env, then `python -m scripts.pipeline.validate_manifest` to confirm `sources.json` is well-formed. Both are sub-second and side-effect-free.\n3. Run `pytest` (after `uv sync --extra dev --inexact`) for a regression net before any non-trivial change to the manifest, schema, or handler registry.\n4. For catalog questions (\"which slugs use handler X\", \"what's CC0-licensed\"), use `python -m scripts.pipeline.list_datasets` rather than greping `sources.json` or scrolling [`docs/v1/datasets.md`](docs/v1/datasets.md).\n5. Copy-pasteable templates for new manifest entries and streaming handlers live in [`examples/`](examples/).\n6. Harnesses that follow the [Agent Skills](https://agentskills.io) standard get 16 invokable skills under [`.agents/skills/`](.agents/skills/) (the `.claude → .agents` symlink means Claude Code sees the same files). Tracked safe-default permissions in [`.agents/settings.json`](.agents/settings.json) — see [`.agents/README.md`](.agents/README.md) for the full layout.\n\n## Repository layout\n\n```\nsources.json                    # the manifest — one DatasetSpec per dataset\nsources.schema.md               # human-friendly schema reference\nsources.schema.json             # machine-readable JSON Schema (Draft 2020-12)\nAGENTS.md                       # invariants + first-contact guide for AI coding agents (CLAUDE.md → AGENTS.md)\nSKILLS.md                       # narrative playbooks\nHYDRATING.md                    # hand-maintained hydration policy / philosophy\nDISCLAIMER.md                   # AS IS posture, content/license disclaimers, dataset-removal reporting\nscripts/\n  pipeline/\n    build.py                    # orchestrator — ties the 7 stages together\n    fetch.py                    # stage 1: download raw bytes\n    custom_fetch.py             # named custom-fetch helpers (fetch.type = \"custom\")\n    extract.py                  # stage 2: unpack archives into _workdir/\n    parse.py                    # stage 3: read raw files into Arrow tables\n    transform.py                # stage 4: dispatch to named handler\n    write.py                    # stage 5: emit parquet\n    validate.py                 # stage 6: assert rows / schema_hash\n    convert.py                  # stage 7 (optional): emit sibling .vortex per spec's convert.vortex flag\n    hydrate.py                  # stage 8 (optional, opt-in): dereference URL columns into parquet-hydrated/\n    docs.py                     # regenerate docs/datasets.md + handlers.md (other catalog views live in list_datasets / TUI)\n    tighten_variant.py          # in-place JSON → VARIANT pass\n    validate_manifest.py        # static checks on sources.json (schema + cross-checks)\n    list_datasets.py            # filter/list slugs by handler / license / tag / size / etc.\n    status.py                   # per-slug filesystem state report\n    browse.py                   # interactive Textual TUI over sources.json (requires --extra tui)\n    spec.py                     # manifest loader, path helpers, duckdb_connect\n    handlers/                   # named transform handlers\ntests/                          # pytest smoke suite (manifest, schema, handler registry, examples)\nexamples/                       # copy-pasteable templates (minimal_spec.json, streaming_handler.py.tmpl)\n.agents/                        # tracked agent allow-list (settings.json) + 16 invokable skills (.claude → .agents)\noutputs/\n  raw_downloads/\u003cslug\u003e/         # stage 1 output — unversioned, cached\n  v{schema_version}/\u003cslug\u003e/     # stage 5 output — version-scoped\ndocs/\n  datasets.md                   # auto-generated index (one row per dataset)\n  handlers.md                   # auto-generated registry view (purpose, streaming, extra deps, usage)\n  v{schema_version}/            # tracked canonical snapshot of datasets.md + handlers.md\n_workdir/\u003cslug\u003e/                # stage 2 scratch space (gitignored)\n```\n\n## Running the pipeline\n\n```bash\n# Build a single dataset\npython -m scripts.pipeline.build \u003cslug\u003e\n\n# Build several datasets at once\npython -m scripts.pipeline.build uci-iris uci-seeds uci-wine-quality\n\n# Build every dataset in the manifest\npython -m scripts.pipeline.build --all\n\n# Loosen validation (warn instead of error on row-count drift)\npython -m scripts.pipeline.build \u003cslug\u003e --loose\n\n# Regenerate the derived docs after a build\npython -m scripts.pipeline.docs            # both files (datasets.md + handlers.md)\npython -m scripts.pipeline.docs datasets   # just datasets.md\npython -m scripts.pipeline.docs handlers   # just handlers.md\n\n# Catalog views that used to be markdown live as list_datasets flags now\npython -m scripts.pipeline.list_datasets --columns [\u003cslug\u003e...] [--column-grep PATTERN]\npython -m scripts.pipeline.list_datasets --coverage [--source parquet|vortex]\npython -m scripts.pipeline.list_datasets --no-vortex --json    # vortex-skip slugs + reasons\npython -m scripts.pipeline.list_datasets --hydrate --long      # hydration candidates\n\n# Post-process: promote JSON-annotated string columns to VARIANT in-place\npython -m scripts.pipeline.tighten_variant            # every built parquet\npython -m scripts.pipeline.tighten_variant \u003cslug\u003e...  # specific slugs\n\n# Emit a sibling .vortex for every spec that opts in via convert.vortex: true\npython -m scripts.pipeline.convert                    # respects per-spec flag\npython -m scripts.pipeline.convert \u003cslug\u003e...\n\n# Optional stage 8: dereference URL columns into parquet-hydrated/\u003cslug\u003e.parquet.\n# Off by default — only for slugs that opted in via the `hydrate` block.\n# Safety filter ON by default; bypass requires --unsafe-allow-all-domains\n# AND --i-accept-the-risk. See HYDRATING.md for the full discussion.\npython -m scripts.pipeline.hydrate \u003cslug\u003e             # one slug\npython -m scripts.pipeline.hydrate \u003cslug\u003e --limit 100 # first N rows (recommended for first run)\npython -m scripts.pipeline.hydrate --all              # every spec with hydrate\n\n# Read-only inspection / triage\npython -m scripts.pipeline.status --fast --missing-only       # filesystem state across the manifest\npython -m scripts.pipeline.validate_manifest                  # static checks on sources.json (schema + cross-checks)\npython -m scripts.pipeline.list_datasets --handler uci_default # filter the catalog without grepping JSON\npython -m scripts.pipeline.list_datasets --handler tighten_types --long\npython -m scripts.pipeline.list_datasets --grep '\\bgeo' --long\npython -m scripts.pipeline.browse                             # interactive TUI (requires --extra tui)\n\n# Run the test suite (sub-second, no fetch / no build)\nuv sync --extra dev --inexact \u0026\u0026 pytest\n```\n\nEach stage is independently invokable — e.g. `python -m scripts.pipeline.fetch \u003cslug\u003e` to download raw bytes without running the rest. Stages are idempotent: fetch skips when `expected_bytes`/`expected_sha256` already matches on disk, write skips when the output parquet is already current.\n\n### DuckDB resource limits\n\nEvery DuckDB connection in the pipeline goes through `scripts.pipeline.spec.duckdb_connect`, which reads these optional env vars before opening:\n\n| Env var | Effect |\n|---|---|\n| `RAINCLOUD_DUCKDB_MEMORY_LIMIT` | Caps DuckDB's working set (`memory_limit` setting). E.g. `8GB`, `512MB`. DuckDB spills to disk once the limit is hit. Unset = DuckDB's default (~80% of system RAM), which can be a problem on shared hosts or CI runners. |\n| `RAINCLOUD_DUCKDB_THREADS` | Caps the thread pool. Integer. |\n| `RAINCLOUD_DUCKDB_TEMP_DIRECTORY` | Where DuckDB spills intermediate batches. Default = system tempdir. Point at a larger volume if the system tempdir runs out of room on a big build. |\n\nExample:\n\n```bash\nRAINCLOUD_DUCKDB_MEMORY_LIMIT=8GB \\\nRAINCLOUD_DUCKDB_TEMP_DIRECTORY=/mnt/scratch/duckdb-tmp \\\n  python -m scripts.pipeline.build jsonbench-bluesky-100m\n```\n\nPersistent DuckDB databases are opened with `storage_compatibility_version=v1.5.0` automatically (required for VARIANT columns).\n\n## The manifest (`sources.json`)\n\nSee [`sources.schema.md`](sources.schema.md) for the human-friendly reference and [`sources.schema.json`](sources.schema.json) for the machine-readable JSON Schema (Draft 2020-12). After editing the manifest, run\n\n```bash\npython -m scripts.pipeline.validate_manifest\n```\n\nto catch typo'd handler names, slug collisions, and shape errors in under a second before paying for a fetch.\n\nA minimal entry looks like:\n\n```jsonc\n{\n  \"slug\": \"clickbench-hits\",\n  \"short_name\": \"ClickBench Hits\",\n  \"license\": { \"spdx\": \"Apache-2.0\", \"source_url\": \"...\" },\n  \"fetch\":     { \"type\": \"http\", \"urls\": [\"https://datasets.clickhouse.com/hits_compatible/hits.parquet\"] },\n  \"extract\":   { \"type\": \"passthrough\" },\n  \"parse\":     { \"reader\": \"parquet\" },\n  \"transform\": { \"handler\": \"identity\" },\n  \"write\":     { \"output\": \"clickbench-hits.parquet\", \"compression\": \"zstd\" },\n  \"expect\":    { \"rows\": 99997497 }\n}\n```\n\nCurrent counts:\n\n- **Fetch types in use:** `http`, `kaggle`, `huggingface`.\n- **Parse readers in use:** `csv`, `parquet`, `jsonl`, `xml`, `pbf`, `custom`.\n- **Schema version:** 1 — outputs land in `outputs/v1/`.\n\n## Output layout\n\nEach dataset produces exactly one parquet per output slug (some handlers split one source into multiple outputs — `glove_split` → 3, `osm_pbf_split` → 3, `stack_exchange_split` → N). Within each slug directory, artefacts live in per-format subdirectories so additional tiers (Vortex sibling, future hydrated copies, partitioned variants) can coexist without filename collisions:\n\n```\noutputs/v1/clickbench-hits/parquet/clickbench-hits.parquet\noutputs/v1/clickbench-hits/vortex/clickbench-hits.vortex\noutputs/v1/glove-6b-50d/parquet/glove-6b-50d.parquet\noutputs/v1/glove-6b-50d/vortex/glove-6b-50d.vortex\noutputs/v1/osm-germany-nodes/parquet/osm-germany-nodes.parquet\n...\n```\n\nRaw downloads are cached separately and *not* version-scoped, since the same upstream bytes can feed any schema_version:\n\n```\noutputs/raw_downloads/clickbench-hits/hits.parquet\noutputs/raw_downloads/glove-6b-50d/glove.6B.zip   # hardlinked into sibling slugs\noutputs/raw_downloads/osm-germany-nodes/germany-latest.osm.pbf\n```\n\nSibling slugs sharing the same upstream URL (GloVe 50d/100d/200d; OSM Germany nodes/ways/relations) are deduped via hardlink during fetch.\n\n## Parquet type coverage\n\nThe manifest is curated to exercise a broad range of Parquet logical and nested types, including:\n\n- `VARIANT` — `countries-of-the-world` (227 country JSON blobs), `jsonbench-bluesky-100m` (100M Bluesky firehose records).\n- GeoParquet 1.1 with WKB geometry — `osm-germany-{nodes,ways,relations}`.\n- `fixed_size_list\u003cfloat32, N\u003e` — GloVe embeddings, dbpedia 1536-dim OpenAI embeddings.\n- `list\u003c...\u003e` with tightened element types — e.g. `list\u003cuint32\u003e` for Hacker News kids/parts.\n- Nested `struct` / `list\u003cstruct\u003e` / `map\u003cstring, int64\u003e` — Wikipedia Structured Contents.\n- Timestamp precision narrowing (`ns → ms` where every value is a whole second).\n- UUID and JSON logical-type annotations on string columns.\n- `DECIMAL(P, S)` where every double value round-trips losslessly through the chosen precision.\n\nType-tightening is idempotent — `tighten_types` can be re-run against any parquet in `outputs/v*/` without regressing the widths.\n\n## Derived docs\n\n- [`docs/datasets.md`](docs/datasets.md) — one row per dataset with short/full name, description, source URL, data kind, license, row count, row group count, and file size. Regenerate: `python -m scripts.pipeline.docs datasets`.\n- [`docs/handlers.md`](docs/handlers.md) — one row per registered transform handler with its one-line purpose, streaming flag, **format-specific deps it imports** (e.g. `pandas`, `openpyxl`, `pyreadstat`, `osmium`, `zstandard`, `unlzw3` — pyarrow / numpy / duckdb suppressed as core), manifest-spec usage count, and example slugs. Useful when picking a handler for a new dataset, finding precedent for a given upstream shape, or knowing which extras a new manifest entry will pull in. Regenerate: `python -m scripts.pipeline.docs handlers`.\n\nBoth are machine-generated; do not hand-edit. `python -m scripts.pipeline.docs` with no args refreshes both.\n\n[`HYDRATING.md`](HYDRATING.md) is **hand-maintained** policy / philosophy for the optional hydrate stage — preamble only, no auto-generated per-slug list.\n\nOther catalog views (column index, type coverage, vortex-skip list, hydration candidates) used to be auto-generated markdown. They moved out of the markdown layer because the multi-megabyte indexes were unscannable as a reading experience and duplicated state already queryable. They're now flags on `list_datasets`:\n\n```bash\npython -m scripts.pipeline.list_datasets --columns [\u003cslug\u003e...] [--column-grep PATTERN]\npython -m scripts.pipeline.list_datasets --coverage [--source parquet|vortex]\npython -m scripts.pipeline.list_datasets --no-vortex --json    # vortex-skip slugs + reasons\npython -m scripts.pipeline.list_datasets --hydrate --long      # hydration candidates\n```\n\nOr use `python -m scripts.pipeline.browse` for an interactive view.\n\n## Contributing\n\nBug reports, feature requests, and PRs are welcome. See\n[`CONTRIBUTING.md`](CONTRIBUTING.md) for dev-environment setup, the pre-PR\ncheck sequence, and pointers into [`SKILLS.md`](SKILLS.md) for the most common\nchange types (new dataset, new handler). Notable changes land in\n[`CHANGELOG.md`](CHANGELOG.md).\n\n## Security\n\nPlease report vulnerabilities privately rather than via a public issue — see\n[`SECURITY.md`](SECURITY.md) for the disclosure channel and timelines.\n\n## Disclaimers\n\n[`DISCLAIMER.md`](DISCLAIMER.md) covers Raincloud's posture on third-party\ndatasets: AS IS warranty disclaimer, content and association disclaimer\n(any fetched file may contain questionable or offensive material — we\ndon't audit upstream content), license diligence, supply-chain risk, and\nthe process for requesting that a dataset be removed from `sources.json`.\n\n## License\n\nRaincloud is licensed under the [Apache License 2.0](LICENSE). Each dataset\ndeclared in `sources.json` carries its own upstream license under the\n`license.spdx` field; those licenses govern redistribution of any Parquet /\nVortex artefact built against that upstream and are independent of the\nlicense covering the pipeline code itself.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspiraldb%2Fraincloud","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspiraldb%2Fraincloud","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspiraldb%2Fraincloud/lists"}