https://github.com/jolovicdev/cashet
Cache Python function results like git objects. Content-addressable, pipeline-friendly, and CLI-inspectable. Run once, reuse forever.
https://github.com/jolovicdev/cashet
cache cli-tool compute-cache content-addressable dag deduplication function-cache hashing memoization pickle pythn sqlite
Last synced: about 1 month ago
JSON representation
Cache Python function results like git objects. Content-addressable, pipeline-friendly, and CLI-inspectable. Run once, reuse forever.
- Host: GitHub
- URL: https://github.com/jolovicdev/cashet
- Owner: jolovicdev
- License: mit
- Created: 2026-04-11T19:48:00.000Z (about 2 months ago)
- Default Branch: master
- Last Pushed: 2026-04-19T20:27:22.000Z (about 1 month ago)
- Last Synced: 2026-04-19T23:38:38.359Z (about 1 month ago)
- Topics: cache, cli-tool, compute-cache, content-addressable, dag, deduplication, function-cache, hashing, memoization, pickle, pythn, sqlite
- Language: Python
- Homepage: https://pypi.org/project/cashet/
- Size: 78.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
cashet
Content-addressable compute cache with git semantics
Run a function once. Get the same result instantly every time after that.
Install · Quick Start · Why · Use Cases · CLI · API · How It Works
---
## Install
**Global CLI tool** (recommended):
```bash
uv tool install cashet
# or
pipx install cashet
```
Then use the CLI anywhere:
```bash
cashet --help
```
**In a project** (library + CLI):
```bash
uv add cashet
# or
pip install cashet
```
This installs `cashet` as both an importable Python library (`from cashet import Client`) and a project-local CLI (`uv run cashet`).
**Develop / contribute:**
```bash
git clone https://github.com/jolovicdev/cashet.git
cd cashet
uv sync
uv run pytest
```
## Quick Start
```python
from cashet import Client
client = Client() # creates .cashet/ in current directory
def expensive_transform(data, scale=1.0):
# imagine this takes 10 minutes
return [x * scale for x in data]
# First call: runs the function
ref = client.submit(expensive_transform, [1, 2, 3], scale=2.0)
print(ref.load()) # [2.0, 4.0, 6.0]
# Second call with same args: instant — returns cached result
ref2 = client.submit(expensive_transform, [1, 2, 3], scale=2.0)
print(ref2.load()) # [2.0, 4.0, 6.0] — no re-computation
```
You can also use `Client` as a context manager to ensure the store connection is closed cleanly:
```python
with Client() as client:
ref = client.submit(expensive_transform, [1, 2, 3], scale=2.0)
print(ref.load())
```
Chain tasks into a pipeline where each step's output feeds into the next:
```python
from cashet import Client
client = Client()
def load_dataset(path):
return list(range(100))
def normalize(data):
max_val = max(data)
return [x / max_val for x in data]
def train_model(data, lr=0.01):
return {"loss": 0.05, "lr": lr, "samples": len(data)}
# Step 1: load
raw = client.submit(load_dataset, "data/train.csv")
# Step 2: normalize (receives raw output as input)
normalized = client.submit(normalize, raw)
# Step 3: train (receives normalized output)
model = client.submit(train_model, normalized, lr=0.001)
print(model.load()) # {'loss': 0.05, 'lr': 0.001, 'samples': 100}
```
Re-run the script — everything returns instantly from cache. Change one argument and only that step (and downstream) re-runs.
## Why
You already have caches (`functools.lru_cache`, `joblib.Memory`). Here's what's different:
| | lru_cache | joblib.Memory | **cashet** |
|---|---|---|---|
| AST-normalized hashing | No | No | Yes (comments/formatting don't break cache) |
| DAG resolution (chain outputs) | No | No | Yes |
| Content-addressable storage | No | No | Yes (like git blobs) |
| CLI to inspect history | No | No | Yes |
| Diff two runs | No | No | Yes |
| Garbage collection / eviction | No | No | Yes |
| Pluggable serialization | No | No | Yes |
| Explicit cache opt-out | No | Partial | Yes |
| Pluggable store / executor | No | No | Yes |
| Persists across restarts | No | Yes | Yes |
The core idea: **hash the function's AST-normalized source + arguments = unique cache key**. Comments, docstrings, and formatting changes don't invalidate the cache — only semantic changes do. Same function + same args = same result, stored immutably on disk. The result is a git-like blob you can inspect, diff, and chain.
## Use Cases
### 1. ML Experiment Tracking Without the Bloat
You run 200 hyperparameter sweeps overnight. Half crash. You fix a bug and re-run. Without cashet, you re-process the dataset 200 times. With cashet:
```python
from cashet import Client, TaskError, TaskRef
client = Client()
def preprocess(dataset_path, image_size):
# 45 minutes of image resizing
...
def train(data, learning_rate, dropout):
...
# Batch submit with topological ordering
# TaskRef(0) refers to the first task's output
results = client.submit_many([
(preprocess, ("s3://my-bucket/images", 224)),
(train, (TaskRef(0), 0.01, 0.2)),
(train, (TaskRef(0), 0.01, 0.5)),
(train, (TaskRef(0), 0.001, 0.2)),
(train, (TaskRef(0), 0.001, 0.5)),
(train, (TaskRef(0), 0.0001, 0.2)),
(train, (TaskRef(0), 0.0001, 0.5)),
])
```
`preprocess` runs **once** — all 6 training jobs reuse its cached output. Re-run the script tomorrow and even the training results come from cache (same function + same args = instant).
### 2. Data Pipeline Debugging
Your ETL pipeline fails at step 5. You fix a typo. Now you need to re-run steps 5-7 but steps 1-4 are unchanged and expensive:
```python
from cashet import Client
client = Client()
raw = client.submit(load_s3, "s3://logs/2024-05-01/")
clean = client.submit(remove_pii, raw)
enriched = client.submit(join_crm, clean, "select * from users")
report = client.submit(generate_report, enriched)
```
Fix the `join_crm` function and re-run the script. Steps 1-2 return instantly from cache. Only step 3 onward re-executes. This works because cashet tracks which function produced which output — changing a function's source code changes its hash, invalidating downstream cache entries.
### 3. Reproducible Notebook Results
`cashet` is designed to work in Jupyter notebooks and IPython sessions. Share a result with a colleague and they can verify exactly how it was produced:
```python
# your notebook
ref = client.submit(generate_forecast, date="2024-01-01", model="v3")
print(f"Result hash: {ref.hash}")
```
```bash
# their terminal — inspect provenance
cashet show
# Output:
# Hash: a3b4c5d6...
# Function: generate_forecast
# Source: def generate_forecast(date, model): ...
# Args: (('2024-01-01',), {'model': 'v3'})
# Created: 2024-05-01T10:32:17
# Retrieve the actual result
cashet get -o forecast.csv
```
### 4. Incremental Computation
Process a large dataset in chunks. Already-processed chunks return instantly:
```python
from cashet import Client
client = Client()
def process_chunk(chunk_id, source_file):
# expensive per-chunk processing
...
results = []
for chunk_id in range(100):
ref = client.submit(process_chunk, chunk_id, "huge_file.parquet")
results.append(ref)
```
First run processes all 100 chunks. Second run (even after restarting Python) returns all 100 results instantly. Add a new chunk? Only that one runs.
## CLI
```bash
# Show commit history
cashet log
# Filter by function name
cashet log --func "preprocess"
# Filter by tag
cashet log --tag env=prod --tag experiment=run-1
# Show full commit details (source code, args, error)
cashet show
# Retrieve a result (pretty-prints strings/dicts/lists)
cashet get
# Write a result to file
cashet get -o output.bin
# Compare two commits
cashet diff
# Show lineage of a result (same function+args over time)
cashet history
# Delete a specific commit
cashet rm
# Evict old cache entries and orphaned blobs
cashet gc --older-than 30
# Evict oldest entries until under a size limit
cashet gc --max-size 1GB
# Clear everything (alias for gc --older-than 0)
cashet clear
# Storage statistics (includes disk size)
cashet stats
```
## API
### `Client`
```python
from cashet import Client
client = Client(
store_dir=".cashet", # where to store blobs + metadata (SQLiteStore)
# falls back to $CASHET_DIR env var if set
store=None, # or inject any Store implementation
executor=None, # or inject any Executor implementation
serializer=None, # defaults to PickleSerializer
max_workers=1, # max parallelism for submit_many (default: 1, sequential)
)
```
### Pluggable Backends
Everything is protocol-based. Swap the store, executor, or serializer without touching your task code:
```python
from pathlib import Path
from cashet import Client, Store, Executor, Serializer
from cashet.store import SQLiteStore
from cashet.executor import LocalExecutor
# These are equivalent (the defaults):
client = Client(store_dir=".cashet")
# Explicit injection:
client = Client(
store=SQLiteStore(Path(".cashet")),
executor=LocalExecutor(),
)
```
**Store protocol** — implement this to use RocksDB, Redis, S3, or anything else:
```python
from cashet.protocols import Store
class RedisStore:
def put_blob(self, data: bytes) -> ObjectRef: ...
def get_blob(self, ref: ObjectRef) -> bytes: ...
def put_commit(self, commit: Commit) -> None: ...
def get_commit(self, hash: str) -> Commit | None: ...
def find_by_fingerprint(self, fingerprint: str) -> Commit | None: ...
def find_running_by_fingerprint(self, fingerprint: str) -> Commit | None: ...
def list_commits(self, ...) -> list[Commit]: ...
def get_history(self, hash: str) -> list[Commit]: ...
def stats(self) -> dict[str, int]: ...
def evict(self, older_than: datetime) -> int: ...
def delete_commit(self, hash: str) -> bool: ...
def close(self) -> None: ...
client = Client(store=RedisStore("redis://localhost"))
# Everything else works identically
```
**Executor protocol** — implement this for distributed execution (Celery, Kafka, RQ):
```python
from cashet.protocols import Executor
class CeleryExecutor:
def submit(self, func, args, kwargs, task_def, store, serializer):
# Push to Celery, poll for result
...
client = Client(
store=RedisStore("redis://localhost"),
executor=CeleryExecutor(),
)
```
**Serializer protocol** — already covered below.
### `client.submit(func, *args, **kwargs) -> ResultRef`
Submit a function for execution. Returns a `ResultRef` — a lazy handle to the result.
```python
ref = client.submit(my_func, arg1, arg2, key="value")
ref.hash # content hash of the result blob
ref.commit_hash # commit hash (use this for show/history/rm/get)
ref.size # size in bytes
ref.load() # deserialize and return the result
```
If the same function + same arguments have been submitted before, returns the cached result **without re-executing**.
### `client.clear()`
Remove all cache entries and orphaned blobs. Equivalent to `client.gc(timedelta(days=0))`.
```python
client.clear()
```
### `client.submit_many(tasks) -> list[ResultRef]`
Submit a batch of tasks with automatic topological ordering. Use `TaskRef(index)` to wire outputs between tasks in the batch.
```python
from cashet import TaskRef
refs = client.submit_many([
step1_func,
(step2_func, (TaskRef(0),)),
(step3_func, (TaskRef(1), "extra_arg")),
], max_workers=4) # run independent tasks in parallel
```
This enables parallel fan-out and ensures each task only runs after its dependencies.
**Opt out of caching:**
```python
# Per-call
ref = client.submit(non_deterministic_func, _cache=False)
# Per-function via decorator
@client.task(cache=False)
def random_score():
return random.random()
```
**Force re-execution (skip cache, always run):**
```python
# Per-call
ref = client.submit(my_func, arg, _force=True)
# Per-function via decorator
@client.task(force=True)
def always_rerun():
...
```
**Tag commits:**
```python
# Per-call
ref = client.submit(train, data, lr=0.01, _tags={"experiment": "v1"})
# Per-function via decorator
@client.task(tags={"team": "ml"})
def preprocess(raw):
...
```
Tags are not part of the cache key — they are metadata for organization and filtering.
**Retry flaky operations:**
```python
# Per-call
ref = client.submit(fetch_api, url, _retries=3)
# Per-function via decorator
@client.task(retries=3)
def fetch_api(url):
...
```
Retries wait briefly between attempts. When retries are exhausted, `client.submit` raises `TaskError` with the original traceback included in the message.
**Task timeouts:**
```python
# Per-call (seconds)
ref = client.submit(slow_func, _timeout=30)
# Per-function via decorator
@client.task(timeout=30)
def slow_func():
...
```
Timeouts can be combined with retries — a timed-out attempt counts as a failure and will be retried.
### `@client.task`
Register a function with cashet metadata and make it directly callable:
```python
@client.task
def my_func(x):
return x * 2
ref = my_func(5) # Returns ResultRef, same as client.submit(my_func, 5)
ref.load() # 10
@client.task(cache=False, name="custom_task_name", tags={"env": "prod"})
def other_func(x):
return x + 1
```
`client.submit(my_func, 5)` still works identically.
### `client.log()`, `client.show()`, `client.get()`, `client.diff()`, `client.history()`, `client.rm()`, `client.gc()`
```python
# List commits
commits = client.log(func_name="preprocess", limit=10)
# Filter by status
commits = client.log(status="failed")
# Filter by tags
commits = client.log(tags={"experiment": "v1"})
# Get commit details
commit = client.show(hash)
commit.task_def.func_source # the source code
commit.task_def.args_snapshot # the serialized args
commit.parent_hash # previous commit for same func+args
commit.created_at
# Load a result by commit hash
result = client.get(hash)
# Diff two commits
diff = client.diff(hash_a, hash_b)
# {'func_changed': True, 'args_changed': False, 'output_changed': True, ...}
# Get lineage (all runs of same func+args)
history = client.history(hash)
# Evict old entries (default: 30 days)
evicted = client.gc()
# Evict entries older than 7 days
from datetime import timedelta
evicted = client.gc(older_than=timedelta(days=7))
# Evict oldest entries until under size limit
evicted = client.gc(max_size_bytes=1024 * 1024 * 1024) # 1GB
# Storage stats
stats = client.stats()
# {
# 'total_commits': 42,
# 'completed_commits': 40,
# 'stored_objects': 38, # blob_objects + inline_objects
# 'disk_bytes': 10485760, # blob_bytes + inline_bytes
# 'blob_objects': 35,
# 'blob_bytes': 9437184,
# 'inline_objects': 3,
# 'inline_bytes': 1048576,
# }
```
### Jupyter & Notebook Support
`cashet` works seamlessly in Jupyter notebooks, IPython, and the Python REPL. It uses a tiered source-resolution strategy:
1. **`inspect.getsource()`** — for normal `.py` files
2. **`dill.source.getsource()`** — for interactive sessions with live history
3. **`dis.Bytecode` fallback** — for any live function, even after a kernel restart
This means you can define functions in a notebook cell, rerun the cell with changes, and `cashet` will correctly invalidate the cache based on the new code.
```python
# In a notebook cell
client = Client()
def preprocess(data):
return [x * 2 for x in data]
ref = client.submit(preprocess, [1, 2, 3])
```
Change the cell body and rerun — the cache invalidates automatically.
### Thread Safety
`cashet` is safe to use from multiple threads and processes sharing the same store directory. Concurrent submissions of the same uncached task are deduplicated: the function executes **exactly once** and all callers receive the same cached result. This works across `multiprocessing.Process`, `ProcessPoolExecutor`, and multiple independent Python interpreters.
> **Note:** Cross-process dedup uses a 5-minute timeout by default. If a process dies while running a task, its claim is automatically reclaimed after that timeout so other workers are not blocked forever. You can adjust this via `LocalExecutor(running_ttl=...)`:
>
> ```python
> from datetime import timedelta
> from cashet.executor import LocalExecutor
>
> client = Client(executor=LocalExecutor(running_ttl=timedelta(minutes=10)))
> ```
```python
import threading
def worker():
c = Client() # separate Client instance, same store
c.submit(expensive_func, arg)
threads = [threading.Thread(target=worker) for _ in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
# expensive_func ran only once
```
### `ResultRef`
A lazy reference to a stored result. Pass it as an argument to chain tasks:
```python
step1 = client.submit(func_a, input_data)
step2 = client.submit(func_b, step1) # step1 auto-resolves to its output
```
### Custom Serialization
```python
from cashet import Client, PickleSerializer, SafePickleSerializer, JsonSerializer
# Default: pickle (handles arbitrary Python objects)
client = Client(serializer=PickleSerializer())
# Safe pickle: restricts deserialization to an allowlist of known types
client = Client(serializer=SafePickleSerializer())
# Allow custom classes through the allowlist
client = Client(serializer=SafePickleSerializer(extra_classes=[MyClass]))
# For JSON-safe data (dicts, lists, primitives)
client = Client(serializer=JsonSerializer())
# Or implement the Serializer protocol
from cashet.hashing import Serializer
class MySerializer:
def dumps(self, obj) -> bytes:
...
def loads(self, data: bytes):
...
```
## How It Works
```
client.submit(func, arg1, arg2)
│
▼
┌─────────────────┐
│ Hash function │ SHA256(AST-normalized source + dep versions + referenced user helpers)
│ Hash arguments │ SHA256(canonical repr of args/kwargs)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Fingerprint │ func_hash:args_hash
│ cache lookup │ ← Store protocol (SQLiteStore, RedisStore, ...)
└────────┬────────┘
│
┌─────┴─────┐
│ │
CACHED MISS
│ │
▼ ▼
Return ref ← Executor protocol (LocalExecutor, CeleryExecutor, ...)
Execute function
Store result as blob → Store protocol
Record commit with parent lineage
Return ref
```
**Architecture (protocol-based):**
| Protocol | Default | Implement for |
|---|---|---|
| `Store` | `SQLiteStore` | RocksDB, Redis, S3, Postgres |
| `Executor` | `LocalExecutor` | Celery, Kafka, RQ, subprocess |
| `Serializer` | `PickleSerializer` | JSON, MessagePack, custom formats |
**Storage layout** (in `.cashet/`):
```
.cashet/
├── objects/ # content-addressable blobs (like git objects)
│ ├── a3/
│ │ └── b4c5d6... # compressed result blob
│ └── e7/
│ └── f8g9h0...
└── meta.db # SQLite: commits, fingerprints, provenance, inline_objects
```
**Small objects** (<1KB) are stored inline in `meta.db` instead of the filesystem. This reduces inode overhead for caches with many tiny results. Larger objects are stored as compressed blobs in `objects/` as usual.
**Key design decisions:**
- **Closure variables are not hashed** and emit a `ClosureWarning` if present. Function identity is source code, not runtime state. If you need cache invalidation based on a value, pass it as an explicit argument.
- **Referenced user-defined helper functions are hashed recursively.** Change an imported helper in your own code and the caller's cache invalidates correctly. Builtin and third-party library functions are skipped.
- **Blobs are deduplicated by content hash.** Identical results share one blob on disk.
- **Source is hashed as an AST.** Comments, docstrings, and whitespace changes don't invalidate the cache.
- **Non-cached tasks get unique commit hashes** (timestamp salt) so they always re-execute but still record lineage.
- **Parent tracking:** Each commit records the hash of the previous commit for the same function+args, forming a history chain you can traverse.
## Project Status
**Beta.** The core (hashing, DAG resolution, fingerprint dedup) is stable. The defaults work reliably for single-machine and multiprocess workflows. The protocol layer (`Store`, `Executor`, `Serializer`) is ready for alternative backends — implementing a Redis store or Celery executor is a single-file job.
Built-in: `SQLiteStore` + `LocalExecutor` + `PickleSerializer`.
Not yet built: Redis, RocksDB, S3 stores; Celery/Kafka executors. PRs welcome.
## License
MIT