https://github.com/platob/yggdrasil

arrow data databricks pandas polars spark sql
Last synced: about 1 month ago
JSON representation
Host: GitHub
URL: https://github.com/platob/yggdrasil
Owner: Platob
License: apache-2.0
Created: 2025-11-29T09:47:43.000Z (7 months ago)
Default Branch: main
Last Pushed: 2026-05-30T21:51:52.000Z (about 1 month ago)
Last Synced: 2026-05-30T22:03:54.521Z (about 1 month ago)
Topics: arrow, data, databricks, pandas, polars, spark, sql
Language: Python
Homepage: https://platob.github.io/Yggdrasil/
Size: 17 MB
Stars: 3
Watchers: 0
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project

README

          # Yggdrasil

**Schema-aware data interchange for Python.** One conversion registry that moves values cleanly between Python types, dataclasses, Arrow, Polars, pandas, Spark, Databricks, and the wire — without losing schema, nullability, or metadata along the way.

| Package | What it is | Where it lives |

|---|---|---|

| `ygg` (PyPI) / `yggdrasil` (import) | Pure-Python core: cast registry, Arrow schema, engine bridges, IO/HTTP, Databricks, FastAPI | [`python/`](python/) |

| Power Query connector | Excel `.pq` and Power BI `.mez` connectors that call the FastAPI service | [`powerquery/`](powerquery/) |

📚 **Docs site:** https://platob.github.io/Yggdrasil/

---

## Install

```bash

pip install ygg                   # core

pip install "ygg[data]"           # + pandas, numpy, sqlglot

pip install "ygg[bigdata]"        # + pyspark, delta-spark

pip install "ygg[databricks]"     # + databricks-sdk

pip install "ygg[api]"            # + fastapi, uvicorn, pydantic

pip install "ygg[http]"           # + urllib3, xxhash

pip install "ygg[pickle]"         # + cloudpickle, dill, zstandard, blake3

pip install "ygg[mongo]"          # + mongoengine

pip install "ygg[postgres]"       # + psycopg, adbc-driver-postgresql

pip install "ygg[kafka]"          # + confluent-kafka

pip install "ygg[delta]"          # + deltalake

```

The only hard runtime deps are `pyarrow>=20` and `polars>=1.3`. Everything else is opt-in.

---

## 60-second tour

### Cast anything into anything

```python

from yggdrasil.data.cast.registry import convert

convert("42", int)              # 42

convert("true", bool)           # True

convert("2024-01-15", "date")   # datetime.date(2024, 1, 15)

```

### Dict → typed dataclass (forgiving on input, strict on meaning)

```python

from dataclasses import dataclass

from yggdrasil.data.cast.registry import convert

@dataclass

class Order:

    id: int

    amount: float

    paid: bool = False

convert({"id": "7", "amount": "99.50", "paid": "yes"}, Order)

# Order(id=7, amount=99.5, paid=True)

```

### Arrow schema as the contract surface

```python

import yggdrasil.arrow as pa

from yggdrasil.arrow.cast import cast_arrow_tabular

from yggdrasil.data.cast.options import CastOptions

raw = pa.table({"id": ["1", "2"], "score": ["9.1", "8.7"]})

target = pa.schema([

    pa.field("id",    pa.int64(),   nullable=False),

    pa.field("score", pa.float64(), nullable=False),

])

out = cast_arrow_tabular(raw, CastOptions(target_field=target))

print(out.schema)

```

### Cross-engine in one move

```python

from yggdrasil.databricks import DatabricksClient

stmt = DatabricksClient().sql.execute("SELECT * FROM main.default.orders LIMIT 100")

stmt.to_arrow_table()   # pyarrow.Table

stmt.to_pandas()        # pandas.DataFrame

stmt.to_polars()        # polars.DataFrame

stmt.to_spark()         # pyspark.sql.DataFrame

stmt.to_pylist()        # list[dict]

```

---

## What you get

- **One conversion registry.** Register a converter once, dispatch from anywhere. Order: exact match → identity → `Any` wildcard → MRO fallback → one-hop composition.

- **Arrow schema as the contract.** Field names, order, nullability, metadata, nested structure, timezone intent are preserved across boundaries.

- **Engines bridge into Arrow.** Polars, pandas, Spark each register on import — `from yggdrasil.polars.cast import cast_polars_dataframe` etc.

- **Production HTTP stack.** `HTTPSession`, prepared requests, batch dispatch, typed response → Arrow/pandas/Polars/Spark.

- **Databricks toolkit.** `DatabricksClient` covers SQL, Unity Catalog, Compute, DBFS/Volumes, Secrets, IAM, Genie, Spark Connect.

- **Optional dep guards.** Base installs stay light. `from yggdrasil.polars.lib import polars` is the safe import.

---

## Performance

The cast registry, schema layer, and engine bridges are all tuned for the hot path — type checks, equality, hash, projection, and same-shape merges live in the nanosecond range so per-batch overhead stays negligible.

Run the benchmark sweep locally:

```bash

cd python

python benchmarks/run_all.py --repeat 5

```

Benches are organized to mirror the source tree:

- [`benchmarks/data/`](python/benchmarks/data) — `Field`, `DataType`, cast registry, equality, merge, options

- [`benchmarks/dataclasses/`](python/benchmarks/dataclasses) — `ExpiringDict`, `WaitingConfig`, pickle helpers

- [`benchmarks/concurrent/`](python/benchmarks/concurrent) — `Job`, `JobResult`, `JobPoolExecutor`, `ThreadJob`

- [`benchmarks/io/`](python/benchmarks/io) — `URL`, `Headers`, `BytesIO`, `Memory`, paths, primitive + nested leaves

- [`benchmarks/databricks/`](python/benchmarks/databricks) — Databricks-specific code paths (live)

---

## Use cases at a glance

| You want to… | Reach for |

|---|---|

| Normalize dicts/JSON into typed dataclasses | `convert(payload, MyDataclass)` |

| Pin a downstream Arrow schema | `cast_arrow_tabular(t, CastOptions(target_field=schema))` |

| Convert Polars ↔ Arrow ↔ pandas ↔ Spark | `yggdrasil.{polars,pandas,spark}.cast` |

| Fan out HTTP requests with retries | `HTTPSession().send_many(reqs, SendManyConfig(...))` |

| Run SQL on Databricks and get a DataFrame | `DatabricksClient().sql.execute(q).to_polars()` |

| Read/write DBFS or Volume files | `DatabricksClient().dbfs_path("...").write_text(...)` |

| Type-check job widget params | `MyConfig.from_environment()` (subclass `NotebookConfig`) |

| Talk to Databricks from Excel/Power BI | Power Query connector via FastAPI service |

---

## Repository guide

- [`python/`](python/) — `ygg` source, tests, MkDocs site.

  - [`python/README.md`](python/README.md) — package guide with progressive examples (scalars → schema → engines → HTTP → Databricks).

  - [`python/docs/`](python/docs/) — published documentation source (https://platob.github.io/Yggdrasil/).

- [`powerquery/`](powerquery/) — Excel `.pq` and Power BI `.mez` connectors over the FastAPI service.

- [`AGENTS.md`](AGENTS.md) — house style, error-message tone, comment voice, API ergonomics.

- [`CLAUDE.md`](CLAUDE.md) — agent-facing notes for AI contributors.

---

## Develop locally

```bash

git clone https://github.com/Platob/Yggdrasil.git

cd Yggdrasil/python

uv venv --seed .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate

uv pip install -e .[dev]                       # core + dev tooling

```

```bash

cd python

pytest                          # full suite

pytest tests/test_yggdrasil/test_io/test_url.py   # one file

ruff check

black .

mkdocs serve                    # docs at http://127.0.0.1:8000

```

Databricks live-integration tests are gated by the `integration` marker and skipped unless `DATABRICKS_HOST` is set.

---

## Release pipeline

The version in [`python/pyproject.toml`](python/pyproject.toml) is the single source of truth.

| Workflow | Builds | Triggers |

|---|---|---|

| [`publish.yml`](.github/workflows/publish.yml) | `ygg` sdist + pure-Python wheel → PyPI, then tags `vX.Y.Z` and cuts a GitHub Release | push to `main` touching `python/src/**`, `pyproject.toml`, README, LICENSE, or workflow itself |

| [`docs.yml`](.github/workflows/docs.yml) | MkDocs Material site → GitHub Pages (https://platob.github.io/Yggdrasil/) | push to `main` touching `python/docs/**`, `python/src/**`, `mkdocs.yml`, or workflow itself |

Do not push to `main` from an agent session — develop on a branch and open a PR.

---

## License

[Apache-2.0](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/platob/yggdrasil

Awesome Lists containing this project

README