https://github.com/prakulhiremath/flashback

⏪ Git for DataFrames. Time-travel debugging, exact temporal lineage, and feature evolution tracking for Pandas and Polars.
https://github.com/prakulhiremath/flashback
data-lineage data-versioning mlops polars time-travel
Last synced: about 1 month ago
JSON representation
⏪ Git for DataFrames. Time-travel debugging, exact temporal lineage, and feature evolution tracking for Pandas and Polars.
Host: GitHub
URL: https://github.com/prakulhiremath/flashback
Owner: prakulhiremath
License: mit
Created: 2026-05-28T06:30:18.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-06-07T16:59:00.000Z (about 2 months ago)
Last Synced: 2026-06-07T18:26:38.775Z (about 2 months ago)
Topics: data-lineage, data-versioning, mlops, polars, time-travel
Language: Python
Homepage: http://aliensonearth.in/flashback/
Size: 97.7 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project

README

          # ⚡ flashback

> **Git for Datasets** — time-travel debugging and transformation lineage tracking for pandas & Polars.

[![CI](https://github.com/flashback-dev/flashback/actions/workflows/ci.yml/badge.svg)](https://github.com/flashback-dev/flashback/actions)

[![PyPI](https://img.shields.io/pypi/v/flashback.svg)](https://pypi.org/project/flashback)

[![Python](https://img.shields.io/pypi/pyversions/flashback.svg)](https://pypi.org/project/flashback)

[![Coverage](https://codecov.io/gh/prakulhiremath/flashback/branch/main/graph/badge.svg?token=XXXX)](https://codecov.io/gh/prakulhiremath/flashback)

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20440635.svg)](https://doi.org/10.5281/zenodo.20440635)

[![Medium](https://img.shields.io/badge/Medium-Read%20the%20Story-12100E?style=flat&logo=medium&logoColor=white)](https://medium.com/@prakulhiremath/the-6-hour-training-job-mystery-why-flashback-changes-everything-for-data-engineers-e81290fbdc84)

[![PyPI Downloads](https://static.pepy.tech/personalized-badge/flashback-df?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/flashback-df)

```

📂 load  ──▶  🔍 filter  ──▶  ➕ with_columns  ──▶  ⏪ lag  ──▶  HEAD

                  │

              (before-lag)  ◀── fb.checkout("before-lag")

```

---

## Why this exists

Every ML researcher has asked: **"Why did my metric change?"** Nobody knows.

You ran a 6-hour training job, the Sharpe ratio dropped from 1.4 to 0.9, and

somewhere between the raw tick data and the feature matrix a silent

transformation introduced look-ahead bias. You have no idea where.

**DVC is too heavy** — it versions entire files with S3 backends, CI pipelines,

and YAML configs.  You don't want to learn a new orchestration system; you

want to know what happened to column `price_lag1` between step 3 and step 7.

**Git doesn't understand columns.** `git diff` on a Parquet file is binary

noise.  It cannot tell you "this `.filter()` removed 412 rows" or "this

`.with_columns()` introduced a null in 3% of rows."

**flashback fixes this.**

It wraps your DataFrame in a zero-cost proxy that records every transformation

as a node in an in-memory Directed Acyclic Graph (DAG).  Each node is

identified by a deterministic SHA-256 hash of the schema + operation

arguments, giving you:

- **Instant time-travel** — `fb.checkout("before-lag")` returns the exact

  frame at that checkpoint with no I/O unless you ask for it.

- **Structural diffing** — `frame.diff(other)` shows you exactly which rows

  were added or removed between any two checkpoints.

- **Beautiful lineage views** — `fb.visualize()` renders a `rich`-powered

  git-log-style tree in your terminal, or an SVG graph in Jupyter.

- **Reproducibility** — identical transformations applied to identical data

  always produce the same node ID — transformations are deterministic by

  construction.

---

## Install

```bash

pip install flashback-df

# or, if you use uv (recommended):

uv add flashback

```

**Requirements:** Python ≥ 3.10, Polars ≥ 0.20, pandas ≥ 2.0.

---

## Quickstart

```python

import flashback as fb

# ── 1. Load any source ──────────────────────────────────────────────────────

df = fb.load("trades.parquet")          # Parquet

df = fb.load("prices.csv")             # CSV

df = fb.load(my_polars_df)             # existing Polars DataFrame

df = fb.load(my_pandas_df)             # existing Pandas DataFrame

# ── 2. Transform — every step is recorded automatically ─────────────────────

df = df.filter(fb.col("price") > 0)

df = df.with_columns(

    (fb.col("price") * fb.col("volume")).alias("notional")

)

# Tag a checkpoint before the next risky operation.

df = df.tag("before-lag")

df = df.lag("price", 1)               # sugar for shift(-1) + tracking

df = df.rolling_mean("notional", 5)

# ── 3. Time-travel ──────────────────────────────────────────────────────────

df_clean = fb.checkout("before-lag")  # ← instant; no disk I/O

# ── 4. See what broke your Sharpe ratio ─────────────────────────────────────

fb.visualize()

```

Terminal output:

```

╭─ flashback lineage  •  4 commits  •  HEAD → rolling_mean ──────────────────╮

│                                                                             │

│  📂 LOAD  5,000 rows × 4 cols  [14:03:01]                                  │

│  │                                                                          │

│  ├─ 🔍 filter  arg_0=...col("price")...  4,823 rows × 4 cols  #a1b2c3d4   │

│  │                                                                          │

│  ├─ ➕ with_columns  arg_0=...alias("notional")  4,823 rows × 5  #e5f6a7  │

│  │                                                                          │

│  ├─ ⏪ lag  column='price'  n=1  4,823 rows × 6  [before-lag]  #b8c9d0    │

│  │                                                                          │

│  └─ 📈 rolling_mean  window=5  4,823 rows × 7 ● HEAD  #01e2f3a4           │

│                                                                             │

╰─────────────────────────────────────────────────────────────────────────────╯

```

---

## API Reference

### `fb.load(source, *, label=None, track=True)`

Load a DataFrame from a file path, Polars DataFrame, or Pandas DataFrame and

begin tracking its lineage.

| Param | Type | Description |

|-------|------|-------------|

| `source` | `str \| pl.DataFrame \| pd.DataFrame \| FlashbackFrame` | Data source |

| `label` | `str \| None` | Human-readable root label (default: filename stem or `"root"`) |

| `track` | `bool` | Register with the global registry (default: `True`) |

**Supported formats:** `.parquet`, `.csv`, `.json`, `.ndjson`, `.ipc`, `.arrow`

---

### `fb.col(name)`

Alias for `polars.col`.  Use inside transform chains for IDE-friendly imports:

```python

df = df.filter(fb.col("price") > 0)

```

---

### `fb.commit(frame, label, *, message="")`

Tag the current state of `frame` with a human-readable label — analogous to

`git tag`.

```python

df = fb.commit(df, "before-normalise", message="Raw features, no scaling")

```

Or use the method form:

```python

df = df.tag("before-normalise", message="Raw features, no scaling")

```

---

### `fb.checkout(label, *, frame=None)`

Time-travel to a named checkpoint.  Returns a new `FlashbackFrame` at that

exact state, fully materialised.

```python

df_original = fb.checkout("before-normalise")

```

If `frame` is provided, searches only that frame's lineage.  Otherwise,

searches the global registry.

---

### `fb.visualize(frame=None, *, style="tree", max_width=120)`

Render the transformation lineage.

- `style="tree"` — rich tree with icons, timestamps, shapes, node IDs.

- `style="dag"` — compact ASCII graph (`git log --graph` style).

- In Jupyter, automatically falls back to an SVG/HTML widget.

---

### `FlashbackFrame.lag(column, n=1, *, alias=None)`

Shift `column` by `n` periods with a tracked checkpoint.

```python

df = df.lag("price", 1)                    # → price_lag1

df = df.lag("price", 3, alias="price_t3")  # → price_t3

```

---

### `FlashbackFrame.rolling_mean(column, window, *, alias=None, min_periods=None)`

Rolling mean over `window` periods with lineage tracking.

```python

df = df.rolling_mean("notional", 20)  # → notional_rmean20

```

---

### `FlashbackFrame.diff(other)`

Structural diff between two frames.  Returns a Polars DataFrame with a `_diff`

column of `"added"` / `"removed"`.

```python

delta = df_now.diff(df_old)

print(delta.filter(pl.col("_diff") == "removed"))

```

---

### `FlashbackFrame.history()`

Return the full transformation chain as a list of dicts (root → HEAD):

```python

for step in df.history():

    print(step["op_name"], step["shape"], step["label"])

```

---

## Persistence

Lineage graphs can be saved to and loaded from disk:

```python

from flashback.storage import Storage

store = Storage(".flashback")  # or Storage.from_cwd()

store.save(df, frame_id="experiment-001")

# Later, in another session:

df = store.load("experiment-001")

```

The `.flashback/` directory layout:

```

.flashback/

├── config.json

├── graphs/

│   └── experiment-001.json   # serialised DAG

└── cache/

    └── .parquet     # materialised node snapshots

```

---

## How it works

```

┌──────────────────────────────────────────────────────────┐

│  FlashbackFrame                                          │

│                                                          │

│  ┌──────────────┐    intercept    ┌───────────────────┐  │

│  │  Polars API  │ ─────────────▶ │   LineageDAG      │  │

│  │  .filter()   │                │                   │  │

│  │  .sort()     │  record node   │  root ──▶ filter  │  │

│  │  .join()     │ ◀──────────── │         ──▶ sort  │  │

│  └──────────────┘                │         ──▶ join  │  │

│         │                        └───────────────────┘  │

│         ▼                                               │

│  polars.DataFrame  (unchanged; Polars still optimises)  │

└──────────────────────────────────────────────────────────┘

```

**Node identity** is a 20-character hex SHA-256 of:

```json

{

  "parents": [""],

  "op": "filter",

  "kwargs": {"arg_0": "[(col(\"price\")) > (0)]"},

  "schema": {"id": "Int64", "price": "Float64", ...}

}

```

This means:

- Identical pipelines on identical data always hash to the same node → instant

  cache hits.

- Changing *any* argument or parent state produces a *different* hash → no

  silent collisions.

---

## Development

```bash

git clone https://github.com/flashback-dev/flashback

cd flashback

pip install -e ".[dev]"

# Lint

ruff check flashback tests

ruff format --check flashback tests

# Type-check

mypy flashback

# Test with coverage

pytest

```

The CI matrix runs across **Ubuntu × macOS × Windows** and **Python 3.10 –

3.13** with a hard 90% coverage threshold.

---

## Roadmap

- [ ] **Branching** — `fb.branch("experiment-A")` for parallel pipeline exploration

- [ ] **Merge** — reconcile two branches at the DAG level

- [ ] **Remote storage** — push/pull lineage graphs to S3 / GCS

- [ ] **Streaming Polars** — track lazy plans before `.collect()`

- [ ] **Notebook integration** — `%load_ext flashback` magic with live DAG sidebar

- [ ] **Export to DVC** — generate `.dvc` stage files from a flashback DAG

---

## License

MIT — see [LICENSE](LICENSE).

---



  Built with Polars · Rich · NetworkX
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/prakulhiremath/flashback

Awesome Lists containing this project

README