https://github.com/platob/yggdrasil
https://github.com/platob/yggdrasil
arrow data databricks pandas polars spark sql
Last synced: 17 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/platob/yggdrasil
- Owner: Platob
- License: apache-2.0
- Created: 2025-11-29T09:47:43.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2026-05-30T21:51:52.000Z (20 days ago)
- Last Synced: 2026-05-30T22:03:54.521Z (20 days ago)
- Topics: arrow, data, databricks, pandas, polars, spark, sql
- Language: Python
- Homepage: https://platob.github.io/Yggdrasil/
- Size: 17 MB
- Stars: 3
- Watchers: 0
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Yggdrasil
**Schema-aware data interchange for Python.** One conversion registry that moves values cleanly between Python types, dataclasses, Arrow, Polars, pandas, Spark, Databricks, and the wire β without losing schema, nullability, or metadata along the way.
| Package | What it is | Where it lives |
|---|---|---|
| `ygg` (PyPI) / `yggdrasil` (import) | Pure-Python core: cast registry, Arrow schema, engine bridges, IO/HTTP, Databricks, FastAPI | [`python/`](python/) |
| Power Query connector | Excel `.pq` and Power BI `.mez` connectors that call the FastAPI service | [`powerquery/`](powerquery/) |
π **Docs site:** https://platob.github.io/Yggdrasil/
---
## Install
```bash
pip install ygg # core
pip install "ygg[data]" # + pandas, numpy, sqlglot
pip install "ygg[bigdata]" # + pyspark, delta-spark
pip install "ygg[databricks]" # + databricks-sdk
pip install "ygg[api]" # + fastapi, uvicorn, pydantic
pip install "ygg[http]" # + urllib3, xxhash
pip install "ygg[pickle]" # + cloudpickle, dill, zstandard, blake3
pip install "ygg[mongo]" # + mongoengine
pip install "ygg[postgres]" # + psycopg, adbc-driver-postgresql
pip install "ygg[kafka]" # + confluent-kafka
pip install "ygg[delta]" # + deltalake
```
The only hard runtime deps are `pyarrow>=20` and `polars>=1.3`. Everything else is opt-in.
---
## 60-second tour
### Cast anything into anything
```python
from yggdrasil.data.cast.registry import convert
convert("42", int) # 42
convert("true", bool) # True
convert("2024-01-15", "date") # datetime.date(2024, 1, 15)
```
### Dict β typed dataclass (forgiving on input, strict on meaning)
```python
from dataclasses import dataclass
from yggdrasil.data.cast.registry import convert
@dataclass
class Order:
id: int
amount: float
paid: bool = False
convert({"id": "7", "amount": "99.50", "paid": "yes"}, Order)
# Order(id=7, amount=99.5, paid=True)
```
### Arrow schema as the contract surface
```python
import yggdrasil.arrow as pa
from yggdrasil.arrow.cast import cast_arrow_tabular
from yggdrasil.data.cast.options import CastOptions
raw = pa.table({"id": ["1", "2"], "score": ["9.1", "8.7"]})
target = pa.schema([
pa.field("id", pa.int64(), nullable=False),
pa.field("score", pa.float64(), nullable=False),
])
out = cast_arrow_tabular(raw, CastOptions(target_field=target))
print(out.schema)
```
### Cross-engine in one move
```python
from yggdrasil.databricks import DatabricksClient
stmt = DatabricksClient().sql.execute("SELECT * FROM main.default.orders LIMIT 100")
stmt.to_arrow_table() # pyarrow.Table
stmt.to_pandas() # pandas.DataFrame
stmt.to_polars() # polars.DataFrame
stmt.to_spark() # pyspark.sql.DataFrame
stmt.to_pylist() # list[dict]
```
---
## What you get
- **One conversion registry.** Register a converter once, dispatch from anywhere. Order: exact match β identity β `Any` wildcard β MRO fallback β one-hop composition.
- **Arrow schema as the contract.** Field names, order, nullability, metadata, nested structure, timezone intent are preserved across boundaries.
- **Engines bridge into Arrow.** Polars, pandas, Spark each register on import β `from yggdrasil.polars.cast import cast_polars_dataframe` etc.
- **Production HTTP stack.** `HTTPSession`, prepared requests, batch dispatch, typed response β Arrow/pandas/Polars/Spark.
- **Databricks toolkit.** `DatabricksClient` covers SQL, Unity Catalog, Compute, DBFS/Volumes, Secrets, IAM, Genie, Spark Connect.
- **Optional dep guards.** Base installs stay light. `from yggdrasil.polars.lib import polars` is the safe import.
---
## Performance
The cast registry, schema layer, and engine bridges are all tuned for the hot path β type checks, equality, hash, projection, and same-shape merges live in the nanosecond range so per-batch overhead stays negligible.
Run the benchmark sweep locally:
```bash
cd python
python benchmarks/run_all.py --repeat 5
```
Benches are organized to mirror the source tree:
- [`benchmarks/data/`](python/benchmarks/data) β `Field`, `DataType`, cast registry, equality, merge, options
- [`benchmarks/dataclasses/`](python/benchmarks/dataclasses) β `ExpiringDict`, `WaitingConfig`, pickle helpers
- [`benchmarks/concurrent/`](python/benchmarks/concurrent) β `Job`, `JobResult`, `JobPoolExecutor`, `ThreadJob`
- [`benchmarks/io/`](python/benchmarks/io) β `URL`, `Headers`, `BytesIO`, `Memory`, paths, primitive + nested leaves
- [`benchmarks/databricks/`](python/benchmarks/databricks) β Databricks-specific code paths (live)
---
## Use cases at a glance
| You want to⦠| Reach for |
|---|---|
| Normalize dicts/JSON into typed dataclasses | `convert(payload, MyDataclass)` |
| Pin a downstream Arrow schema | `cast_arrow_tabular(t, CastOptions(target_field=schema))` |
| Convert Polars β Arrow β pandas β Spark | `yggdrasil.{polars,pandas,spark}.cast` |
| Fan out HTTP requests with retries | `HTTPSession().send_many(reqs, SendManyConfig(...))` |
| Run SQL on Databricks and get a DataFrame | `DatabricksClient().sql.execute(q).to_polars()` |
| Read/write DBFS or Volume files | `DatabricksClient().dbfs_path("...").write_text(...)` |
| Type-check job widget params | `MyConfig.from_environment()` (subclass `NotebookConfig`) |
| Talk to Databricks from Excel/Power BI | Power Query connector via FastAPI service |
---
## Repository guide
- [`python/`](python/) β `ygg` source, tests, MkDocs site.
- [`python/README.md`](python/README.md) β package guide with progressive examples (scalars β schema β engines β HTTP β Databricks).
- [`python/docs/`](python/docs/) β published documentation source (https://platob.github.io/Yggdrasil/).
- [`powerquery/`](powerquery/) β Excel `.pq` and Power BI `.mez` connectors over the FastAPI service.
- [`AGENTS.md`](AGENTS.md) β house style, error-message tone, comment voice, API ergonomics.
- [`CLAUDE.md`](CLAUDE.md) β agent-facing notes for AI contributors.
---
## Develop locally
```bash
git clone https://github.com/Platob/Yggdrasil.git
cd Yggdrasil/python
uv venv --seed .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e .[dev] # core + dev tooling
```
```bash
cd python
pytest # full suite
pytest tests/test_yggdrasil/test_io/test_url.py # one file
ruff check
black .
mkdocs serve # docs at http://127.0.0.1:8000
```
Databricks live-integration tests are gated by the `integration` marker and skipped unless `DATABRICKS_HOST` is set.
---
## Release pipeline
The version in [`python/pyproject.toml`](python/pyproject.toml) is the single source of truth.
| Workflow | Builds | Triggers |
|---|---|---|
| [`publish.yml`](.github/workflows/publish.yml) | `ygg` sdist + pure-Python wheel β PyPI, then tags `vX.Y.Z` and cuts a GitHub Release | push to `main` touching `python/src/**`, `pyproject.toml`, README, LICENSE, or workflow itself |
| [`docs.yml`](.github/workflows/docs.yml) | MkDocs Material site β GitHub Pages (https://platob.github.io/Yggdrasil/) | push to `main` touching `python/docs/**`, `python/src/**`, `mkdocs.yml`, or workflow itself |
Do not push to `main` from an agent session β develop on a branch and open a PR.
---
## License
[Apache-2.0](LICENSE).