https://github.com/radekdymacz/lakesync
Offline-first sync engine for browser apps — tracks column-level changes locally and lands them as Parquet in an Iceberg lakehouse
https://github.com/radekdymacz/lakesync
bun crdt iceberg lakehouse offline-first sync typescript
Last synced: 4 months ago
JSON representation
Offline-first sync engine for browser apps — tracks column-level changes locally and lands them as Parquet in an Iceberg lakehouse
- Host: GitHub
- URL: https://github.com/radekdymacz/lakesync
- Owner: radekdymacz
- License: apache-2.0
- Created: 2026-02-06T13:08:23.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-02-20T07:48:52.000Z (4 months ago)
- Last Synced: 2026-02-20T08:42:21.449Z (4 months ago)
- Topics: bun, crdt, iceberg, lakehouse, offline-first, sync, typescript
- Language: TypeScript
- Homepage: https://radekdymacz.github.io/lakesync/
- Size: 1.88 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README
# LakeSync
[](https://github.com/radekdymacz/lakesync/actions/workflows/ci.yml)
[](https://opensource.org/licenses/Apache-2.0)
**Local-first sync. Any backend.**
LakeSync is an open-source sync engine for local-first TypeScript apps. Your data lives in SQLite on the device, syncs through a lightweight gateway, and flushes to the backend of your choice — **Postgres for small data, BigQuery for analytics, S3/R2 via Apache Iceberg for large data**. Source connectors pull data from external systems like Jira and Salesforce into the same pipeline. Same client code either way.
**[Documentation](https://radekdymacz.github.io/lakesync)** · **[Getting Started](https://radekdymacz.github.io/lakesync/docs/getting-started)** · **[Architecture](https://radekdymacz.github.io/lakesync/docs/architecture)**
## Why LakeSync?
Most sync engines lock you into a single backend. LakeSync's `LakeAdapter` interface decouples sync from storage — swap backends without changing client code.
| | Traditional Sync | Data Lake | LakeSync |
|---|---|---|---|
| Offline-first | Yes | No | **Yes** |
| Column-level conflict resolution | Rarely | N/A | **Yes** |
| Pluggable backend | No | No | **Yes** |
| Small data (Postgres/MySQL) | Yes | No | **Yes** |
| Analytics (BigQuery) | Sometimes | Sometimes | **Yes** |
| Large data (Iceberg/S3/R2) | No | Yes | **Yes** |
| Time-travel queries | No | Yes | **Yes** |
| Real-time WebSocket sync | Sometimes | No | **Yes** |
| Source connectors (Jira, Salesforce, ...) | No | No | **Yes** |
| Self-hosted or edge (CF Workers) | Sometimes | No | **Yes** |
## How It Works
```mermaid
sequenceDiagram
participant App as Your App
participant DB as Local SQLite
participant GW as Gateway (CF DO / Self-Hosted)
participant Backend as Any Backend
App->>DB: INSERT / UPDATE / DELETE
Note over App,DB: Zero latency — local write
DB-->>GW: Push column-level deltas (HTTP or WebSocket)
GW-->>DB: ACK + pull remote changes
Note over GW: Deltas merge via HLC + LWW
GW->>Backend: Batch flush
Note over Backend: Postgres? BigQuery? R2?
You choose the adapter.
```
1. **Mutations write to local SQLite** with zero latency
2. **Column-level deltas push** to a lightweight gateway (HTTP or WebSocket)
3. **Gateway merges** via Hybrid Logical Clocks — concurrent edits to different columns are both preserved
4. **Batch flush** to whatever backend you choose via the adapter interface
5. **Source connectors poll** external systems (Jira, Salesforce, ...) and ingest data through the same gateway
## Right-Size Your Backend
### Small data — use what you know
```
Client SQLite → Gateway → Postgres / MySQL / RDS
```
Familiar tooling, standard SQL queries, simple operational model. The `PostgresAdapter` or `MySQLAdapter` flushes deltas directly to your database.
### Large data — scale to the lake
```
Client SQLite → Gateway → Apache Iceberg (S3/R2)
```
Infinite scale on object storage. Operational data and analytics data are the same thing. Query with Spark, DuckDB, Athena, Trino — zero ETL.
### Mix both — route by table
```typescript
const adapter = new CompositeAdapter({
routes: [
{ tables: ["users", "settings"], adapter: postgresAdapter },
{ tables: ["events", "telemetry"], adapter: icebergAdapter },
],
defaultAdapter: postgresAdapter,
});
```
The `CompositeAdapter` routes deltas to different backends by table name. When your data outgrows one backend, `migrateAdapter()` moves it to another — idempotent and safe to re-run.
### Replicate — fan out writes
```typescript
const adapter = new FanOutAdapter({
primary: postgresAdapter, // sync — fast operational reads/writes
secondaries: [bigqueryAdapter], // async — best-effort analytics replica
});
```
The `FanOutAdapter` writes to a primary adapter synchronously and replicates to secondaries in the background. Secondary failures never block the write path.
### Materialise — queryable destination tables
Adapters that implement the `Materialisable` interface automatically create queryable destination tables after each flush. Every synced column is derived from the `TableSchema`, plus a `props JSONB` column for consumer-extensible metadata and a `synced_at` timestamp. Tombstoned rows are deleted. The `PostgresAdapter` supports this out of the box.
### Tier — age-based lifecycle
```typescript
const adapter = new LifecycleAdapter({
hot: { adapter: postgresAdapter, maxAgeMs: 30 * 24 * 60 * 60 * 1000 }, // 30 days
cold: { adapter: bigqueryAdapter },
});
```
Recent data stays in the hot tier for fast queries. Older data is served from the cold tier. Call `migrateToTier()` on a schedule to move aged-out deltas.
## Column-Level Conflict Resolution
Traditional sync engines resolve conflicts at the row level — if two users edit the same row, one wins. LakeSync resolves at the **column level** using Last-Write-Wins with Hybrid Logical Clocks:
```mermaid
sequenceDiagram
participant A as Alice
participant GW as Gateway
participant B as Bob
Note over A,B: Both editing the same todo
A->>A: title = "Buy oat milk"
B->>B: status = "done"
A->>GW: push delta (title, HLC=100)
B->>GW: push delta (status, HLC=101)
Note over GW: Column-level merge
title ← Alice (HLC 100)
status ← Bob (HLC 101)
GW->>A: pull → status = "done"
GW->>B: pull → title = "Buy oat milk"
Note over A,B: Both changes preserved ✓
```
Both changes are preserved because they touch different columns. The HLC timestamp determines the winner only when two clients modify the _same_ column.
## Offline-First
The full dataset lives in local SQLite. Edits queue in a persistent IndexedDB outbox that survives page refreshes and browser crashes. When connectivity returns, the outbox drains automatically.
```mermaid
sequenceDiagram
participant App as App (offline)
participant DB as Local SQLite
participant Q as Outbox (IndexedDB)
participant GW as Gateway
App->>DB: Edit 1
DB-->>Q: Delta queued
App->>DB: Edit 2
DB-->>Q: Delta queued
App->>DB: Edit 3
DB-->>Q: Delta queued
Note over App,Q: Fully functional offline
Note over Q,GW: ← Connection restored →
Q->>GW: Push all 3 deltas
GW-->>Q: ACK
GW->>App: Pull remote changes
Note over App,GW: Caught up ✓
```
## Sync Rules
Declarative bucket-based filtering with JWT claim references. The gateway evaluates rules at pull time — clients never download data they shouldn't see. Supports operators: `eq`, `neq`, `in`, `gt`, `lt`, `gte`, `lte`.
```json
{
"buckets": [{
"name": "user-data",
"filters": [
{ "column": "user_id", "op": "eq", "value": "jwt:sub" },
{ "column": "priority", "op": "gte", "value": "3" }
],
"tables": ["todos", "preferences"]
}]
}
```
## Actions
Actions are imperative operations dispatched through the gateway to external systems. Connectors register `ActionHandler`s that declare supported action types, enabling frontend discovery. The gateway handles idempotency (via `actionId` and `idempotencyKey`), validation, and routing.
```typescript
// Discover available actions
const discovery = gateway.describeActions();
// → { connectors: { "slack": [{ actionType: "send_message", ... }] } }
// Execute an action
const result = await gateway.handleAction({
clientId: "client-1",
actions: [{
actionId: "abc-123",
clientId: "client-1",
hlc: hlc.now(),
connector: "slack",
actionType: "send_message",
params: { channel: "#general", text: "Hello" },
}],
});
```
## React Hooks
`lakesync/react` provides reactive hooks that wire directly into the sync coordinator.
```tsx
import { LakeSyncProvider, useQuery, useMutation, useAction, useSyncStatus } from "lakesync/react";
function App() {
return (
);
}
function TodoList() {
const { rows } = useQuery("SELECT * FROM todos ORDER BY created_at DESC");
const { insert, update, remove } = useMutation();
const { execute } = useAction();
const { isOnline, isSyncing } = useSyncStatus();
return (
- update("todos", todo.id, { completed: 1 })}>
{todo.title}
{rows.map(todo => (
))}
);
}
```
Queries re-run automatically when affected tables change — `useQuery` extracts table names from SQL and only re-renders when relevant deltas arrive, not on every sync. Actions dispatch imperative operations to connectors. `useActionDiscovery()` enables dynamic UI based on registered action handlers.
## Quick Start
### Install
```bash
npm install lakesync
```
### Sync in 10 Lines
```typescript
import { LocalDB, SyncCoordinator, HttpTransport } from "lakesync/client";
const db = await LocalDB.open({ name: "my-app", backend: "idb" });
const transport = new HttpTransport({
baseUrl: "https://your-gateway.workers.dev",
gatewayId: "my-gateway",
token: "your-jwt-token",
});
const coordinator = new SyncCoordinator(db, transport);
coordinator.startAutoSync();
// Track mutations — deltas are extracted and queued automatically
await coordinator.tracker.insert("todos", "row-1", {
title: "Buy milk",
completed: 0,
});
```
### Run Locally
```bash
git clone https://github.com/radekdymacz/lakesync.git
cd lakesync
bun install
bun run build
bun run test
```
### Deploy the Gateway
**Cloudflare Workers (edge):**
```bash
cd apps/gateway-worker
wrangler r2 bucket create lakesync-data # once
wrangler deploy
```
**Self-hosted (Node.js / Bun):**
```typescript
import { GatewayServer } from "@lakesync/gateway-server";
import { PostgresAdapter } from "@lakesync/adapter";
const adapter = new PostgresAdapter({ connectionString: "postgres://..." });
const server = new GatewayServer({
port: 3000,
gatewayId: "my-gateway",
adapter,
jwtSecret: "your-secret",
persistence: "sqlite", // survive restarts
});
server.start();
```
Or use Docker:
```bash
cd packages/gateway-server
docker compose up
```
See the [Todo App](apps/examples/todo-app/) for a complete working example.
## Architecture
```mermaid
graph TB
subgraph "Browser"
UI[Application UI]
SC[SyncCoordinator]
ST[SyncTracker]
DB[(LocalDB
sql.js + IDB)]
Q[(IDB Queue)]
end
subgraph "Gateway (pick one)"
subgraph "Edge"
CF[CF Workers + DO]
end
subgraph "Self-Hosted"
SH[GatewayServer
Node.js / Bun]
end
end
subgraph "Source Connectors"
JIRA[Jira]
SF[Salesforce]
end
subgraph "Backends"
PG[(Postgres / MySQL)]
BQ[(BigQuery)]
R2[(R2 / S3)]
PQ[Parquet / Iceberg]
end
subgraph "Adapters"
COMP[CompositeAdapter
route by table]
FAN[FanOutAdapter
replicate writes]
LIFE[LifecycleAdapter
hot / cold tiers]
end
subgraph "Analytics"
DDB[DuckDB-WASM]
end
UI --> SC
SC --> ST
ST --> DB
ST --> Q
SC -->|HTTP / WS| CF
SC -->|HTTP / WS| SH
JIRA -->|poll| SH
SF -->|poll| SH
CF --> COMP
SH --> COMP
COMP --> PG
COMP --> R2
FAN --> PG
FAN --> BQ
LIFE --> PG
LIFE --> BQ
R2 --> PQ
DDB -->|query| PQ
```
### Key Design Decisions
- **Pluggable adapters** — `LakeAdapter` (object storage) and `DatabaseAdapter` (SQL) interfaces. Swap backends at the gateway level.
- **HLC timestamps** (branded bigints) — 48-bit wall clock + 16-bit counter, monotonic ordering across distributed clients without coordination
- **Deterministic delta IDs** — SHA-256 hash of `(clientId, hlc, table, rowId, columns)` enables idempotent push
- **DeltaBuffer** — atomic `BufferSnapshot` pattern (append log + row index) swapped atomically on each mutation, giving O(1) conflict checks and O(n) flush with no intermediate inconsistent state
- **Result\** everywhere — no exceptions cross API boundaries; all errors are typed and composable
- **Adapter composition** — `CompositeAdapter` (route by table), `FanOutAdapter` (replicate writes), `LifecycleAdapter` (hot/cold tiers). All implement `DatabaseAdapter` so they nest freely.
- **Table sharding** — split a tenant's traffic across multiple Durable Objects by table name. The shard router fans out pushes and merges pull results automatically.
- **Adapter-sourced pull** — clients can pull directly from named source adapters (e.g. a BigQuery dataset) via the gateway, with sync rules filtering applied.
- **Real-time WebSocket sync** — `WebSocketTransport` maintains a persistent connection to the gateway server, with server-initiated broadcast of new deltas to connected clients. Binary protobuf framing with auto-reconnect and exponential backoff.
- **Source connectors** — `BaseSourcePoller` provides a memory-managed ingestion pipeline. Connectors (Jira, Salesforce) poll external APIs and push deltas into the gateway with backpressure-aware streaming. Dynamic registration via `createPoller()` factory — import a connector package and it auto-registers.
- **Source polling** — two strategies for change detection: cursor-based (e.g. Jira's `updated` field) and diff-based (snapshot comparison for APIs without cursor support). Both use memory-bounded accumulation with configurable chunk size and memory budget.
- **Actions** — imperative operations dispatched through the gateway to connector-registered `ActionHandler`s. Supports idempotency, discovery, and validation. Decoupled from sync — actions go to external systems, deltas come back.
- **Materialise** — opt-in `Materialisable` interface for adapters that can project deltas into queryable destination tables. Auto-invoked after flush. Hybrid column model: synced columns + `props JSONB` + `synced_at`.
- **SyncEngine extraction** — pure sync operations (`push`, `pull`, `syncOnce`) are extracted into a `SyncEngine` class. `syncOnce()` is an explicit pull-then-push transaction — ordering is structural, not a convention. `SyncCoordinator` composes the engine with scheduling and lifecycle.
- **Gateway decomposition** — `SyncGateway` is a thin facade composing `DeltaBuffer` (atomic snapshot pattern), `ActionDispatcher` (action routing + idempotency), `SchemaManager` (validation), and `FlushCoordinator` (fire-and-forget queue publish). `GatewayServer` uses a middleware pipeline with data-driven route dispatch.
## Packages
| Package | Description |
|---------|-------------|
| [`@lakesync/core`](packages/core) | HLC timestamps, delta types, LWW conflict resolution, sync rules, Result type |
| [`@lakesync/client`](packages/client) | Client SDK: SyncEngine, SyncCoordinator, SyncTracker, LocalDB, transports, queues, initial sync |
| [`@lakesync/gateway`](packages/gateway) | Sync gateway: SyncGateway facade, DeltaBuffer, ActionDispatcher, SchemaManager, adapter-sourced pull |
| [`@lakesync/gateway-server`](packages/gateway-server) | Self-hosted gateway server (Node.js / Bun) with middleware pipeline, SQLite persistence, JWT auth, and WebSocket support |
| [`@lakesync/adapter`](packages/adapter) | Storage adapters: MinIO/S3, Postgres, MySQL, BigQuery, Composite, FanOut, Lifecycle, migration tooling |
| [`@lakesync/proto`](packages/proto) | Protobuf codec for the wire protocol |
| [`@lakesync/parquet`](packages/parquet) | Parquet read/write via parquet-wasm |
| [`@lakesync/catalogue`](packages/catalogue) | Iceberg REST catalogue client (Nessie-compatible) |
| [`@lakesync/compactor`](packages/compactor) | Parquet compaction, equality deletes, checkpoint generation |
| [`@lakesync/analyst`](packages/analyst) | Time-travel queries + analytics via DuckDB-WASM |
| [`@lakesync/react`](packages/react) | React bindings: reactive hooks for sync, queries, and mutations |
| [`@lakesync/connector-jira`](packages/connector-jira) | Jira Cloud source connector — polls issues, comments, and projects into the gateway |
| [`@lakesync/connector-salesforce`](packages/connector-salesforce) | Salesforce CRM source connector — polls accounts, contacts, opportunities, and leads |
| [`lakesync`](packages/lakesync) | Unified package with subpath exports for all of the above |
| App | Description |
|-----|-------------|
| [`gateway-worker`](apps/gateway-worker) | Cloudflare Workers: Durable Object gateway, R2 storage, JWT auth, sync rules, table sharding |
| [`todo-app`](apps/examples/todo-app) | Reference implementation: offline-first todo list with column-level sync |
| [`docs`](apps/docs) | Documentation site (Fumadocs + Next.js) |
## Backend Support
| Backend | Adapter | Status |
|---------|---------|--------|
| Cloudflare R2 | `LakeAdapter` (MinIO-compatible) | Production-ready |
| AWS S3 | `LakeAdapter` (MinIO-compatible) | Production-ready |
| MinIO | `LakeAdapter` | Production-ready |
| PostgreSQL | `DatabaseAdapter` + `Materialisable` | Implemented |
| MySQL | `DatabaseAdapter` (MySQLAdapter) | Implemented |
| BigQuery | `DatabaseAdapter` (BigQueryAdapter) | Implemented |
| Composite (route by table) | `CompositeAdapter` | Implemented |
| Fan-out (replicate writes) | `FanOutAdapter` | Implemented |
| Lifecycle (hot/cold tiers) | `LifecycleAdapter` | Implemented |
| Jira Cloud | `connector-jira` (source connector) | Implemented |
| Salesforce CRM | `connector-salesforce` (source connector) | Implemented |
## Status
Experimental, but real. All planned phases are implemented and tested: core sync engine, conflict resolution, client SDK, Cloudflare Workers gateway, self-hosted gateway server with WebSocket support, React bindings, compaction, checkpoint generation, sync rules (with extended comparison operators), initial sync, database adapters (Postgres, MySQL, BigQuery), composite routing, fan-out replication, lifecycle tiering, table sharding, adapter-sourced pull, source connectors (Jira, Salesforce) with memory-managed ingestion, imperative actions with discovery and idempotency, and the materialise protocol for queryable destination tables. API is not yet stable — expect breaking changes.
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and contribution guidelines.
## Licence
Licensed under the [Apache Licence 2.0](LICENSE).