https://github.com/henryqingmo/kubekv
Go implementation of the Raft consensus algorithm with simulated RPC, leader election, and fault-tolerant log replication.
https://github.com/henryqingmo/kubekv
concensus etcd fault-tolerance go k8s kubernetes raft raft-consensus-algorithm
Last synced: 18 days ago
JSON representation
Go implementation of the Raft consensus algorithm with simulated RPC, leader election, and fault-tolerant log replication.
- Host: GitHub
- URL: https://github.com/henryqingmo/kubekv
- Owner: henryqingmo
- Created: 2026-04-03T23:46:06.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-05-07T06:21:23.000Z (about 1 month ago)
- Last Synced: 2026-05-07T08:24:25.211Z (about 1 month ago)
- Topics: concensus, etcd, fault-tolerance, go, k8s, kubernetes, raft, raft-consensus-algorithm
- Language: Go
- Homepage:
- Size: 8.56 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# KubeKV
A fault-tolerant, linearizable key-value store built on a from-scratch Raft consensus implementation in Go. Long-term goal: a minimal etcd — the kind of thing Kubernetes uses for cluster state.
## What it does
- **Raft consensus** — leader election, log replication, and term-based safety
- **Log compaction** — snapshotting with `InstallSnapshot` so logs don't grow unbounded
- **Linearizable KV** — versioned `Put`/`Get` operations; stale reads are rejected
- **Partition tolerance** — correctly handles split-brain, heals on reconnect, checks term + leadership before committing
## Architecture
```mermaid
flowchart TD
subgraph client["Client Layer"]
C["Clerk\nclient.go"]
end
subgraph kv["KV Layer — kvraft1/"]
KVS["KVServer\nserver.go"]
RSM["RSM reader loop\nrsm/rsm.go"]
Store["In-memory KV store\nversioned map[key]→{val, version}"]
end
subgraph raft["Raft Layer — raft1/"]
RL["Raft leader\nraft.go"]
RF1["Raft follower"]
RF2["Raft follower"]
P["Persister\nraft state + snapshot"]
end
C -->|"Get / Put RPC"| KVS
KVS -->|"Submit(op)\nblocks until committed"| RSM
RSM -->|"Start(cmd)\nstartCh → immediate AppendEntries"| RL
RL -->|"AppendEntries RPC"| RF1
RL -->|"AppendEntries RPC"| RF2
RL -.->|"InstallSnapshot RPC\nlagging follower"| RF1
RL -->|"applyCh ApplyMsg\ncommitted index + command"| RSM
RSM -->|"DoOp()"| Store
RSM -.->|"Snapshot()\nwhen log ≥ 90% maxraftstate"| RL
RL -->|"persist()"| P
```
Each write goes through the Raft log. Reads block until the server confirms it's still the leader (no stale reads). The RSM layer owns the `applyCh` drain loop — it executes committed ops, wakes blocked `Submit()` callers via a per-index channel, and triggers snapshotting when the log approaches `maxraftstate`. On log growth, the KV layer serializes its state and Raft discards the log prefix, shipping the snapshot to lagging followers via `InstallSnapshot`.
## Key design decisions
**Versioned puts over CAS** — every key carries a monotonic version. `Put(key, val, version)` is rejected if the version doesn't match, giving clients optimistic concurrency without distributed locks. This maps cleanly to how etcd's watch + revision model works.
**Snapshot-driven log truncation** — when the KV layer signals a snapshot, Raft replaces the log prefix with `(lastIncludedIndex, lastIncludedTerm)` and ships the snapshot to lagging followers via `InstallSnapshot`. This is the same mechanism etcd uses to onboard new members without replaying the full history.
**Fast path on `Start()`** — `Start()` signals a dedicated channel that immediately triggers `AppendEntries` to all peers, rather than waiting for the next heartbeat tick. Cuts commit latency under load.
## Running
```bash
cd src/raft1
go test -run TestBasicAgree
go test -run TestSnapshotInstall
go test -run TestConcurrentClients
```
## Where this is headed
The immediate target is a standalone gRPC server that speaks a subset of the etcd v3 API — enough to swap in as a Kubernetes datastore for a small cluster. That requires:
- [ ] gRPC transport replacing the in-process RPC shim
- [ ] Persistent WAL on disk (currently in-memory via `Persister`)
- [ ] Watch streams (key-range notifications on commit)
- [ ] Membership changes (Raft joint consensus, §6 of the paper)
- [ ] Read index / lease-based reads for lower-latency queries
## References
- [In Search of an Understandable Consensus Algorithm](https://raft.github.io/raft.pdf) — Ongaro & Ousterhout
- [etcd internals](https://etcd.io/docs/v3.5/learning/design-learner/) — membership and storage design