https://github.com/lmmx/polars-schema-index

A Polars plugin for flattening nested data
https://github.com/lmmx/polars-schema-index

Last synced: 7 months ago
JSON representation

A Polars plugin for flattening nested data

Host: GitHub
URL: https://github.com/lmmx/polars-schema-index
Owner: lmmx
License: mit
Created: 2025-02-07T01:52:04.000Z (9 months ago)
Default Branch: master
Last Pushed: 2025-03-24T22:04:17.000Z (8 months ago)
Last Synced: 2025-03-24T23:19:40.582Z (8 months ago)
Language: Python
Homepage: https://pypi.org/project/polars-schema-index/
Size: 34.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-polars - polars-schema-index - Polars plugin for flattening nested data by [@lmmx](https://github.com/lmmx). (Libraries/Packages/Scripts / Polars plugins)

README

          # polars-schema-index

**A Polars plugin for flattening nested columns with stable numeric indexing.**

`polars-schema-index` provides a systematic way to explode/unnest nested Polars DataFrames (does not yet support LazyFrames) without overwriting columns that share the same name. It achieves this by:

- Attaching a custom `schema_index` namespace to your DataFrame.

- Renaming columns that do not end in digits with a numbered suffix.

- Iteratively flattening `Struct` columns (and optionally exploding `list[struct]` columns first), so every nested field becomes a separate top-level column.

## Installation

```bash

pip install polars-schema-index[polars]

```

On older CPUs run:

```python

pip install polars-schema-index[polars-lts-cpu]

```

## Usage

```python

import polars as pl

from polars_schema_index import flatten_nested_data

# Example: flatten a deeply nested JSON structure

df = pl.read_ndjson(

    source=b'''{

        "body": [

            {

                "type": "If",

                "test": {

                    "type": "Compare",

                    "left": {

                        "type": "Name",

                        "id": "x",

                        "ctx": { "type": "Load" }

                    },

                    "ops": [{ "type": "IsNot" }],

                    "comparators": [{ "type": "Constant", "value": null }]

                },

                "body": [{ "type": "Pass" }],

                "orelse": []

            }

        ],

        "type_ignores": []

    }

    '''.replace(b"\n", b"")

)

flattened = flatten_nested_data(df)

print(flattened)

```

This gives a DataFrame with all nested fields expanded into uniquely suffixed, monotonically

increasing numbered columns:

```python

┌────────────────┬────────┬────────────┬─────────┬───┬─────────┬──────────┬──────────┬─────────┐

│ type_ignores_1 ┆ type_2 ┆ orelse_5   ┆ type_6  ┆ … ┆ type_14 ┆ type_15  ┆ value_16 ┆ type_17 │

│ ---            ┆ ---    ┆ ---        ┆ ---     ┆   ┆ ---     ┆ ---      ┆ ---      ┆ ---     │

│ list[null]     ┆ str    ┆ list[null] ┆ str     ┆   ┆ str     ┆ str      ┆ null     ┆ str     │

╞════════════════╪════════╪════════════╪═════════╪═══╪═════════╪══════════╪══════════╪═════════╡

│ []             ┆ If     ┆ []         ┆ Compare ┆ … ┆ IsNot   ┆ Constant ┆ null     ┆ Load    │

└────────────────┴────────┴────────────┴─────────┴───┴─────────┴──────────┴──────────┴─────────┘

```

### What It Solves

- **No more silent overwrites** of common keys (like `"type"`) when unnesting.

- **Stable numeric suffixes** for each column, so even if you run multiple flatten passes, names remain unique.

- **Optional exploding of list-of-struct columns** before flattening them.

### Key Functions

1. **`flatten_nested_data(df, explode_lists=True, max_passes=1000)`**

   Iteratively flattens all `Struct` columns in a DataFrame or LazyFrame, and explodes any `list[struct]` columns (if `explode_lists=True`). Continues until no `Struct` columns remain (or `max_passes` is reached).

2. **`df.schema_index.append_unnest_relabel(df, column=...)`**

   Moves one column to the end via `.permute`, unnest it, then relabel newly created columns with numeric suffixes.

### Note

- **Column Renaming**: The library appends numeric suffixes to *all columns* that lack them, even if they are already scalar columns. That ensures flattening never creates collisions, but it does mean your top-level columns will also gain suffixes.

- **LazyFrame Support**: By default, the plugin is registered for `DataFrame`. If you want to use this on LazyFrames, you can register a similar namespace for `LazyFrame` or manually attach the plugin’s logic. I may end up supporting both.

## Contributing

1. **Issues & Discussions**: Please open a GitHub issue for bugs, feature requests, or questions.

2. **Pull Requests**: PRs are welcome! Add tests under `tests/`, update the docs, and ensure you run `pytest` locally.

## License

This project is licensed under the MIT License.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lmmx/polars-schema-index

Awesome Lists containing this project

README