{"id":50748849,"url":"https://github.com/fndome/sws","last_synced_at":"2026-06-10T23:30:47.060Z","repository":{"id":355883430,"uuid":"1227893973","full_name":"fndome/sws","owner":"fndome","description":"io_uring based Single Worker Server in Zig","archived":false,"fork":false,"pushed_at":"2026-06-02T05:53:19.000Z","size":987,"stargazers_count":0,"open_issues_count":3,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-02T06:10:52.230Z","etag":null,"topics":["fiber","http","io-uring","ws","zig"],"latest_commit_sha":null,"homepage":"","language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fndome.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-03T10:02:02.000Z","updated_at":"2026-06-02T05:53:19.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/fndome/sws","commit_stats":null,"previous_names":["fndome/sws"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/fndome/sws","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fndome%2Fsws","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fndome%2Fsws/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fndome%2Fsws/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fndome%2Fsws/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fndome","download_url":"https://codeload.github.com/fndome/sws/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fndome%2Fsws/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34175887,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fiber","http","io-uring","ws","zig"],"created_at":"2026-06-10T23:30:46.383Z","updated_at":"2026-06-10T23:30:47.048Z","avatar_url":"https://github.com/fndome.png","language":"Zig","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sws — Single Worker Server\n\n[中文文档](README_CN.md)\n\n`io_uring` based Single Worker Server (HTTP + WebSocket) on Linux, in Zig 0.16.0.\n\n## Project Goal\n\n`sws` is not just a `req/s` demo. It is a small Linux-only network runtime built\naround Zig, `io_uring`, fibers, explicit buffer ownership, and one IO-thread\nevent loop. The immediate goal is to make the HTTP/WebSocket/DNS/client paths\ncorrect, measurable, and easy to audit before chasing larger benchmark numbers.\n\nCurrent scope:\n\n- HTTP/1.1 server: `GET`, `POST`, `PUT`, `PATCH`, `DELETE`, JSON/text/html\n  responses, request body helpers, middleware, keep-alive boundaries.\n- WebSocket: HTTP/1.1 upgrade, frame parse/write, ping/pong/close handling.\n- DNS and outbound HTTP client: async UDP DNS, small TTL cache, keep-alive\n  connection reuse.\n- Linux + `io_uring` only. TLS/HTTPS/WSS via pure-Zig tls.zig library, bundled in `lib/`.\n  Enable with `-Denable-tls=true`.\n\nPerformance numbers should be read together with the benchmark mode. The local\nself-test is a correctness smoke test: client and server share one machine and\nthe default benchmark is only `50 x 100` keep-alive requests. Use\n`-Doptimize=ReleaseFast` and explicit benchmark environment variables before\ncomparing throughput:\n\n```bash\nzig build -Doptimize=ReleaseFast\nSWS_BENCH_CONNS=500 SWS_BENCH_REQS_PER_CONN=1000 ./zig-out/bin/im-bench\n```\n\n```\nIO thread (io_uring Ring A + fiber):\n  ├── accept/read/write CQE → fiber → handler → respond\n  ├── drain user SubmitQueues\n  ├── drain Next.go() ringbuffer tasks\n  ├── drain DeferredResponse / InvokeQueue → respond\n  ├── drainTick (DNS tick + invoke.drain + tick_hooks)\n  └── TTL incremental scan (StackPool live list)\n\nWorker pool (optional, offload CPU/GPU/blocking I/O):\n  └── Next.submit() → worker thread → compute → InvokeQueue → IO thread drains\n```\n\nHandlers run as **fibers on the IO thread** by default.\n- `Next.go()` — fiber on IO thread, zero thread switch. Use for DB io_uring, async I/O.\n- `Next.submit()` — worker pool. Use **only for CPU-intensive computation** that would block.\n\n## Concurrency Model (Must Read Before Code Review)\n\nsws is a **single-threaded** system with explicit handoff points. This is the\nsingle most important fact about the codebase. Internalizing it prevents an\nentire class of false bug reports.\n\n### The One Rule\n\n```\nIO thread owns everything. Worker threads own nothing except their own stack.\n\nIO thread ──[submit]──→ mutex queue ──→ worker pops task\nWorker    ──[invoke]──→ CAS list    ──→ IO thread drains next tick\n              ↑                           ↑\n         one-way handoff             one-way handoff\n```\n\nThere is **no shared mutable state** between the IO thread and worker threads.\nThey communicate only through two unidirectional handoff queues.\n\n### Code Review Checklist\n\n- **Do NOT add atomics.** `@atomicStore`, `@cmpxchgStrong`, `@atomicLoad` have\n  no place in IO-thread-only data paths. They don't protect anything (there is\n  no concurrent access) and actively mislead future readers into thinking\n  multi-threaded access exists. Use plain `field = value` / `if field != 0`.\n\n- **Do NOT add mutexes** to IO-thread data structures (StackSlot, Connection,\n  BufferPool, LargeBufferPool, DnsResolver, WsServer). They are accessed by\n  exactly one thread.\n\n- **WorkerPool internals** (`stack_freelist`, `stack_pool`) are shared among\n  workers. With the default `initPool4NextSubmit(1)`, there is exactly one\n  worker — no concurrency. The race only exists with `n \u003e 1`.\n\n- **The `Next.go()` ringbuffer** (`SubmitQueue`) is IO-thread push, IO-thread\n  pop (`drainNextTasks`). Single-threaded despite the \"SPSC\" name.\n\n- **`shared_fiber_active`** is read and written only by the IO thread. No\n  atomic needed. The per-task-stack wrappers (`httpTaskCleanup`,\n  `wsTaskCleanup`) do not touch it.\n\n- **When auditing code**, start by verifying which execution context each\n  piece of data lives in. If both ends are in the IO thread, any concern\n  about \"thread safety\" is a false alarm. If a worker thread touches it,\n  trace the handoff — is it through `submit()` (mutex) or `invoke.push()`\n  (CAS)? If neither, it's a bug.\n\n### Common Mistakes in Past Audits\n\n| Mistake | Why Wrong |\n|---------|-----------|\n| \"`shared_fiber_active` should be atomic\" | IO thread only. No other thread reads or writes it |\n| \"`LargeBufferPool.freelist_top` needs a lock\" | IO thread only. Worker never touches this pool |\n| \"`ensureWriteBuf` races with `submitWrite`\" | Both run on IO thread, sequentially |\n| \"`ConnState` transitions need atomics\" | IO thread only. State changes happen in event loop order |\n\n## Critical Usage Warning\n\n**Never perform filesystem reads or writes through the kernel block layer in\nhandler code.** The IO thread's io_uring event loop runs on a single thread.\nAny operation that blocks the calling thread will stall the entire server,\nincluding all active connections.\n\n### Storage Backends You Must NOT Use via File I/O\n\nThese backends route I/O through the kernel block layer and will block the IO\nthread, even when mounted as a local path:\n\n- **FUSE** — any filesystem mounted via FUSE (s3fs, gcsfuse, etc.)\n- **Longhorn v1** — kernel iSCSI initiator → engine → replica; synchronous\n  replication quorum inside the kernel I/O path\n- **Ceph RBD (kernel)** — kernel block device waits for OSD acknowledgements\n- Any network-attached block device mounted through the standard kernel\n  filesystem stack (NFS, iSCSI, DRBD with synchronous mode)\n\n### Storage Backends That Are Safe\n\n- **local_pv** — directly attached NVMe/SSD with low-latency page cache writes\n- **SPDK-based user-space storage** — storage engines that bypass the kernel\n  block layer entirely using polled-mode NVMe drivers and vhost-user shared\n  memory. Examples: **OpenEBS Mayastor**, Longhorn v2 (SPDK backend).\n\nSPDK storage is safe because the I/O path never enters the kernel — data moves\nDMA-direct from NVMe to user-space ring buffers, and the polled-mode driver\nnever blocks the calling thread.\n\n### For Remote Object Storage\n\nUse **non-blocking network sockets at the io_uring level** — issue `OP_SEND` /\n`OP_RECV` to the remote API endpoint directly:\n\n```\nhandler → OP_SEND/OP_RECV → S3/OSS/MinIO HTTP API\n           ↑ io_uring native, non-blocking\n```\n\nDo NOT mount S3/OSS via FUSE and read/write files.\n\n## Requirements\n\n- Linux 5.1+ (io_uring)\n- Zig 0.16.0\n\n## Quick Start\n\n```bash\ngit clone https://github.com/fndome/sws\ncd sws\nzig build run\n```\n\n## Use as a Library\n\n```zig\nconst sws = @import(\"sws\");\n\npub fn main() !void {\n    var server = try sws.AsyncServer.init(alloc, io, \"0.0.0.0:9090\", null, 0);\n    defer server.deinit();\n\n    server.GET(\"/hello\", myHandler);\n    try server.run();\n}\n```\n\n## Architecture\n\n### Source Layout (refactored)\n\n```\nsrc/http/\n├── async_server.zig   (526)  facade — init/deinit + public API forwarding\n├── event_loop.zig     (215)  run / dispatchCqes / drain* / TTL\n├── http_routing.zig   (310)  use / GET/POST / processBodyRequest + fiber dispatch\n├── http_response.zig  (163)  respond / respondJson / respondZeroCopy\n├── http_fiber.zig     (182)  HttpTaskCtx + httpTaskExec/Cleanup/Complete\n├── http_body.zig      (110)  submitBodyRead / onBodyChunk / onStreamRead\n├── ws_handler.zig     (381)  tryWsUpgrade / onWsFrame / sendWsFrame / write queue\n├── ws_fiber.zig       ( 50)  WsTaskCtx + wsTaskExec/Cleanup/Complete\n├── tcp_accept.zig     (114)  onAcceptComplete / allocFixedIndex\n├── tcp_read.zig       (367)  submitRead / onReadComplete (header parse + body route)\n├── tcp_write.zig      (128)  submitWrite / onWriteComplete\n├── connection_mgr.zig ( 82)  closeConn / getConn / nextUserData\n├── hook_system.zig    ( 48)  DeferredNode / addHook* / sendDeferredResponse\n├── connection.zig     ( 51)  Connection type\n├── context.zig        (118)  Context type\n├── types.zig          (  5)  Middleware / Handler types\n├── http_helpers.zig   ( 87)  request parsing utilities\n└── middleware_store.zig( 28)  MiddlewareStore\n\nsrc/client/\n├── http_client.zig    (1132) HttpClient — dedicated-thread, fiber-driven HTTP client\n├── ring.zig           ( 154) RingB — io_uring ring + DNS + TinyCache + InvokeQueue\n├── tiny_cache.zig     ( 267) per-host keep-alive connection pool\n├── dns.zig            ( 184) c-ares async DNS adapter\n└── README.md                 → [Why sws ships its own io_uring HTTP client](src/client/README.md)\n```\n\nExtracted from a 2725-line God Object in 5 sessions. Each module ≤381 lines, single responsibility. `async_server.zig` is now 526 lines of pure struct definition + init/deinit + forwarding shell.\n\n### Single IO thread + fiber\n\nThe entire event loop runs on **one IO thread**. Handlers execute as **fibers** (user-space coroutines) on the same thread.\n\n```\nIO thread (single):\n  io_uring.submit_and_wait(1)\n    → CQE dispatch (via StackPool sticker)\n    → fiber → handler → ctx.text/json/html\n    → drainPendingResumes (fiber resume queue)\n    → drainNextTasks (Next.go ringbuffer tasks)\n    → drainTick (DNS tick + invoke.drain + tick_hooks)\n    → TTL scan (StackPool live list, incremental)\n    → TTL scan (StackPool live list, incremental)\n    → loop\n```\n\nNo background threads unless you call `server.initPool4NextSubmit(n)`.\n\n### StackPool — O(1) connection pool\n\nConnections are stored in a **pre-allocated array** (not a hash map). O(1) acquire/release via freelist.\n\n```\nStackPool\u003cStackSlot, 1_048_576\u003e\n  ├── slots: [1M]StackSlot — contiguous, cache-line-aligned\n  ├── freelist: [1M]u32 — O(1) pop/push\n  ├── live: []u32 — active slot indices (TTL scan source)\n  └── warmup() — touch all pages to eliminate cold-start faults\n```\n\n#### StackSlot (384 bytes, 5 cache lines)\n\nEach connection slot is split across independent cache lines for contention-free hot-path access:\n\n```\nline1 ( 64B): fd, gen_id, state, write_offset, req_count — CQE dispatch (hottest)\nline2 ( 64B): conn_id, last_active_ms, active_list_pos — TTL scanning\nline3 ( 64B): fiber_context, large_buf_ptr — async anchors, Worker Pool, oversized body\nline4 (128B): writev_in_flight, response_buf, write_iovs, ws_write_queue — write path (low frequency)\nline5 ( 64B): sentinel (0x53574153) + workspace union — HTTP/WS/Compute view\n```\n\n**Ghost event defense:** `user_data = (gen_id \u003c\u003c 32) | idx`. After close, gen_id is zeroed. Any in-flight CQE arriving after close fails the gen_id match and is silently discarded.\n\n**Workspace switching:** The `line5.ws` union switches between `HttpWork`, `WsWork`, and `ComputeWork` views depending on connection state — no heap allocation for protocol parsing state.\n\n### Ring A + Dedicated Thread for Outbound\n\n**Ring A** (built-in): the main server's `io_uring` ring — accept, connection read/write, DNS, invoke.\n\n**Outbound rings** (Ring B, HTTP client): each runs on its own dedicated OS thread with its own `io_uring` ring. The IO thread is never interrupted for outbound I/O. See [src/client/README.md](src/client/README.md) for why the HTTP client is built-in.\n\n```\nRing A (main server, IO thread):\n  ├── accept / read / write / close\n  ├── io_registry (client callbacks)\n  ├── dns_resolver (async UDP DNS)\n  └── rs.invoke (cross-thread push → IO thread callback)\n\nRing B (HTTP client, dedicated thread):\n  ├── ring.submit_and_wait(1)\n  ├── tick → dns.tick + invoke.drain + copy_cqes + dispatch\n  ├── IORegistry\n  ├── DnsResolver\n  ├── InvokeQueue\n  └── TinyCache (per-host keep-alive pool)\n```\n\n### Init\n\n```zig\nvar server = try AsyncServer.init(alloc, io, \"0.0.0.0:9090\", app_ctx, fiber_stack_size_kb);\n//                                                                    ↑ 0 = 256KB\n```\n\nFirst handler/middleware registration calls `ensureNext()` → creates `Next` (ringbuffer) + `setDefault()`.\n\nInternally, `AsyncServer.init()` creates:\n- `pool`: StackPool — O(1) contiguous connection array\n- `large_pool`: LargeBufferPool(64) — 64 × 1MB blocks for oversized requests (\u003e32KB)\n- `rs`: RingShared — single ring shared resource (ring + registry + invoke)\n- `io_registry`: IORegistry — outbound client connection registry\n- `dns_resolver`: DnsResolver — async UDP DNS with TTL cache\n\nTo add the built-in HTTP client:\n\n```zig\n// RingB with 1s built-in TinyCache TTL:\nvar ring_b = try sws.HttpRing.init(alloc, io, server.ring.fd, 1000);\ndefer ring_b.deinit();\n\n// HttpClient auto-uses RingB's TinyCache — keep-alive, zero-config\nvar http_client = try sws.HttpClient.init(alloc, \u0026ring_b);\ntry http_client.start(); // spawn dedicated ring thread\ndefer http_client.deinit();\n```\n\n### Handler — Synchronous (on IO thread)\n\n```zig\nfn hello(allocator: Allocator, ctx: *Context) anyerror!void {\n    ctx.text(200, \"hello\");\n}\n```\n\n### Handler — `Next.go` (fiber, IO thread, no thread switch)\n\nFor async I/O (DB io_uring, HTTP client):\n\n```zig\nconst Ctx = struct { allocator: Allocator, resp: *DeferredResponse };\n\nfn exec(c: *Ctx, complete: *const fn (?*anyopaque, []const u8) void) void {\n    defer c.allocator.destroy(c);\n    defer c.allocator.destroy(c.resp);\n    c.resp.json(200, \"[{\\\"id\\\":1}]\");\n    complete(c, \"\");\n}\n\nfn myHandler(allocator: Allocator, ctx: *Context) anyerror!void {\n    const s: *AsyncServer = @ptrCast(@alignCast(ctx.server.?));\n    const resp = try allocator.create(DeferredResponse);\n    resp.* = .{ .server = s, .conn_id = ctx.conn_id, .allocator = allocator };\n    ctx.deferred = true;\n    Next.go(Ctx, .{ .allocator = allocator, .resp = resp }, exec);\n}\n```\n\n### Handler — `Next.submit` (worker pool, thread switch)\n\nFor offload work (crypto, compression, LLM/GPU inference, blocking I/O):\n\n```zig\nconst Ctx = struct { allocator: Allocator, resp: *DeferredResponse };\n\nfn exec(c: *Ctx, complete: *const fn (?*anyopaque, []const u8) void) void {\n    defer c.allocator.destroy(c);\n    defer c.allocator.destroy(c.resp);\n    // Offload work here (CPU/GPU/blocking I/O)...\n    c.resp.json(200, \"{\\\"done\\\": true}\");\n    complete(c, \"\");\n}\n\nfn myHandler(allocator: Allocator, ctx: *Context) anyerror!void {\n    const s: *AsyncServer = @ptrCast(@alignCast(ctx.server.?));\n    const resp = try allocator.create(DeferredResponse);\n    resp.* = .{ .server = s, .conn_id = ctx.conn_id, .allocator = allocator };\n    ctx.deferred = true;\n    Next.submit(Ctx, .{ .allocator = allocator, .resp = resp }, exec);\n}\n```\n\n### Worker pool (for Next.submit)\n\n```zig\ntry server.initPool4NextSubmit(1); // 1 worker thread (recommended)\n```\n\n**Recommendations:**\n- `1` — default, sufficient for crypto, compression\n- `N/2` (e.g. 4 on 8-core) — sustained LLM/GPU inference or blocking I/O\n\n### DeferredResponse\n\nSends HTTP response from any thread (CAS-based lock-free):\n\n```zig\nresp.json(200, \"{\\\"ok\\\":true}\");\nresp.text(200, \"plain\");\n```\n\n### Deferred Hooks, Tick Hooks\n\nExecute custom logic before each deferred response is sent, on the IO thread.\nEssential for MMORPG / real-time use cases (update game state, leaderboard, broadcast):\n\n```zig\nfn updateGameState(server: *AsyncServer, node: *DeferredNode) void {\n    const world: *GameWorld = @ptrCast(@alignCast(server.app_ctx.?));\n    world.update(node.body);\n}\n\ntry server.addHookDeferred(updateGameState);\n```\n\n**Rules:**\n- Hooks run in registration order on the IO thread — safe for IO-thread-exclusive data\n- `node.body` is valid during hook execution; do NOT free it\n- Do NOT store `node` pointer — the node is destroyed after the hook returns\n- Must not panic (log errors instead)\n\n#### Room Auto-Battle Example\n\nRooms with countdown → auto-battle for hundreds of players. Two hooks cooperate:\n`addHookTick` checks deadlines every loop iteration (no deferred node needed);\n`addHookDeferred` processes incoming player commands.\nBattle CPU work offloaded via `Next.submit`. Zero locks — all state on IO thread.\n\n```zig\nconst Room = struct {\n    id: u64,\n    state: enum { waiting, fighting, settle },\n    deadline: i64,                  // monotonic timestamp\n    teams: [2]std.ArrayList(*Player),\n};\n\nconst Player = struct { id: u64, hp: u32, atk: u32 };\n\nconst BattleCtx = struct {\n    blue_team: []PlayerSnapshot,\n    red_team:  []PlayerSnapshot,\n};\n\nconst PlayerSnapshot = struct { hp: u32, atk: u32 };\n```\n\n```zig\nfn roomTick(server: *AsyncServer) void {\n    const app: *GameApp = @ptrCast(@alignCast(server.app_ctx.?));\n    for (app.rooms.items) |*room| {\n        if (room.state == .waiting and server.monotonic_ms() \u003e= room.deadline) {\n            room.state = .fighting;\n            startBattle(server, room);\n        }\n    }\n}\n\nfn roomCommand(server: *AsyncServer, node: *DeferredNode) void {\n    const app: *GameApp = @ptrCast(@alignCast(server.app_ctx.?));\n    app.processCommand(node.body);  // join / ready / action\n}\n\nfn startBattle(server: *AsyncServer, room: *Room) void {\n    const ctx = server.allocator.create(BattleCtx) catch return;\n    ctx.blue_team = snapshotTeam(\u0026room.teams[0], server.allocator) catch return;\n    ctx.red_team  = snapshotTeam(\u0026room.teams[1], server.allocator) catch return;\n    Next.submit(BattleCtx, ctx, doBattle);\n}\n\nfn doBattle(ctx: *BattleCtx, complete: *const fn (?*anyopaque, []const u8) void) void {\n    const result = simulateCombat(ctx.blue_team, ctx.red_team);\n    var buf: [4096]u8 = undefined;\n    const json = result.toJson(\u0026buf);\n    server.sendDeferredResponse(room_id, 200, .json, json);\n    _ = complete;\n}\n\ntry server.addHookTick(roomTick);        // tick: fires every IO loop\ntry server.addHookDeferred(roomCommand); // deferred: fires per-player command\n```\n\n### Next.go / Next.submit\n\n```zig\nNext.go(Ctx, ctx, exec);       // fiber on IO thread (io_uring I/O)\nNext.submit(Ctx, ctx, exec);   // worker pool (offload work)\n```\n\nBoth are static. `Next.go` works out of the box (auto `setDefault` on first route). `Next.submit` requires `server.initPool4NextSubmit(n)`.\n\n#### GPU / Heavy Compute\n\nGPU compute uses `Next.submit` — worker thread calls CUDA / CANN / Vulkan runtime.\nio_uring direct dispatch for GPU is blocked on Linux kernel drivers (missing\n`IORING_OP_URING_CMD` for compute queues, NVIDIA / Huawei not yet shipped).\n\nOnce drivers add it, `IORegistry` handles GPU with zero code changes —\nsame `register(id, ptr, on_cqe)` → submit SQE → dispatch CQE pattern.\n\n**Current: fiber + worker pool**\n\nWorker pool always supports fiber. GPU task calls `Fiber.workerYield(poll, ctx)`\nafter submitting a kernel, freeing the worker thread to process other tasks while\nthe GPU runs. The worker tick polls parked fibers and resumes when the kernel completes.\n\n```zig\n// CPU task — no yield, runs to completion\nNext.submit(CpuCtx, ctx, struct {\n    fn exec(c: *CpuCtx, complete: ...) void {\n        const result = heavyCompute(c.input);\n        complete(c, result);\n    }\n}.exec);\n\n// GPU task — MUST call workerYield after submitting kernel\n//                                 ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓\nNext.submit(GpuCtx, ctx, struct {\n    fn exec(c: *GpuCtx, complete: ...) void {\n        cudaLaunchKernel(kernel, stream, args);\n        Fiber.workerYield(            // ← THIS LINE makes it a GPU task\n            struct { fn poll(s: *anyopaque) bool {\n                return cuStreamQuery(@ptrCast(@alignCast(s))) == CUDA_SUCCESS;\n            }}.poll,\n            @ptrCast(stream),\n        );\n        // resume point — GPU done\n        complete(c, output);\n    }\n}.exec);\n```\n\n**The only difference between CPU and GPU:** GPU tasks call `Fiber.workerYield`.\nWithout it, the worker thread blocks synchronously until the kernel completes,\ndefeating fiber multiplexing.\n\n\u003e ⚠️ **GPU tasks MUST use `Next.submit`, never `Next.go`.**\n\u003e\n\u003e `Next.go` runs on the IO thread. Two failure modes:\n\u003e - **Without `workerYield`:** `cuStreamSynchronize` blocks the IO thread —\n\u003e   io_uring CQE processing stops, entire server freezes.\n\u003e - **With `workerYield`:** fiber yields correctly, IO thread stays alive — but\n\u003e   the fiber never wakes up. The IO thread has no poll tick; it only responds to\n\u003e   io_uring CQEs. GPU kernels don't produce CQEs, so the IO thread never learns\n\u003e   the kernel finished.\n\u003e\n\u003e Worker threads have a built-in poll tick (`while poll_fn() try resume`) which\n\u003e is why GPU works there: `workerYield` → park → tick → poll → resume.\n\n**IMPORTANT: GPU uses `initPool4NextSubmit(1)`.**\nGPU drivers are async internally — one worker + fiber can submit N streams\nand poll for completion. No extra thread pool needed. io_uring not yet\nsupported for GPU compute (kernel driver gap).\n\n### RingShared\n\n`RingShared` is the materialization of a single io_uring ring + single thread — injected into server and any outbound client, all equal.\n\n```zig\nconst rs = server.rs;  // { ring, registry, invoke, io_tid }\n// Any client is injected equally:\nvar client = try RingSharedClient.init(alloc, rs, ...);\nvar http   = try HttpClient.init(alloc, ring_b, cache);\n```\n\n- `rs.ringPtr()` / `rs.registryPtr()` — IO-thread assertion guard (non-IO thread access → @panic)\n- `rs.invoke.push()` — any-thread-safe CAS callback (worker → IO thread)\n\n### RingSharedClient\n\nio_uring-driven outbound TCP client. Glue layer for integrating NATS / Redis / HTTP client\nlibraries into sws's IO thread — no separate runtime, no locks.\n\n```zig\nconst RingSharedClient = @import(\"sws\").RingSharedClient;\n\nfn onData(ctx: ?*anyopaque, data: []u8) void {\n    const nats: *NatsClient = @ptrCast(@alignCast(ctx));\n    nats.feed(data);\n}\n\nfn onClose(ctx: ?*anyopaque) void {\n    const nats: *NatsClient = @ptrCast(@alignCast(ctx));\n    nats.discard();\n}\n\n// In main(), before server.run():\nvar cs = try RingSharedClient.init(allocator, server.rs, onData, onClose, nats_ctx);\ndefer cs.deinit();\ntry cs.connect(\"127.0.0.1\", 4222);\n\n// Send data (queued, submitted via io_uring)\ntry cs.write(\"PUB subject 5\\r\\nhello\\r\\n\");\ncs.close();  // graceful\n```\n\n- All I/O on sws IO thread — `onData` / `onClose` run in the same context as hooks\n- `write()` queues data; pending writes auto-flushed as io_uring CQEs arrive\n- Protocol layer (NATS / Redis / HTTP) only needs `feed([]u8)` and `write([]const u8)`\n- Multiple clients per server; user_data uses a dedicated high bit to avoid collisions\n\n### TinyCache (built into RingB)\n\nSingle-entry TTL connection cache for outbound protocols. **Owned by RingB** — all\nlifecycle (init, tick, evict, deinit) is managed automatically. Users get connection\nreuse for free with `HttpClient`.\n\n- Same host:port connections auto-reused within TTL window\n- Expired entries auto-evicted by `RingB.tick()` each event loop iteration\n- Connect phase allows retries; read/write phase forbids retries (kernel TCP stack guarantees SQE-level writes)\n\n### Pipe\n\nAdapts RingSharedClient's push model to a pull model (`reader.read` / `writer.write`).\nEnables synchronous-protocol libraries (pgz, myzql) to run directly on the IO thread\nvia fiber yield/resume — no worker threads, no locks.\n\n```zig\n// In main(), after AsyncServer.init() and before server.run():\nconst Pipe = @import(\"sws\").Pipe;\nconst RingSharedClient = @import(\"sws\").RingSharedClient;\n\nfn onData(ctx: ?*anyopaque, data: []u8) void {\n    const p: *Pipe = @ptrCast(@alignCast(ctx));\n    p.feed(data) catch {};\n}\n\nfn onClose(ctx: ?*anyopaque) void {\n    const p: *Pipe = @ptrCast(@alignCast(ctx));\n    p.reset();\n}\n\nvar cs = try RingSharedClient.init(allocator, server.rs, onData, onClose, \u0026pipe);\nvar pipe = try Pipe.init(allocator, cs);\ndefer pipe.deinit();\n\ntry cs.connect(\"localhost\", 5432);\n// ... wait for connect (yield) ...\n\n// Any protocol lib with anytype reader/writer works:\n// var conn = try pgz.Connection.init(allocator, pipe.reader(), pipe.writer());\n// var result = try conn.query(\"SELECT 1\", struct { u8 });\n```\n\n- `feed(data)` pushes bytes from ClientStream → read buffer, resumes waiting fiber\n- `reader.read()` blocks the fiber (via yield) until data arrives — looks synchronous to caller\n- `writer.write()` queues into buffer; `flushWrite()` sends via ClientStream\n- `reset()` clears buffers on disconnect/reconnect\n- Requires protocol library to accept `anytype` reader/writer (pgz needs 1-line patch on `WriteBuffer.send`)\n\n### LargeBufferPool\n\nFor oversized requests (Content-Length \u003e 32KB) that can't fit in the 256KB shared fiber stack.\nPre-allocated 1MB blocks with O(1) freelist acquire/release. Each block carries an atomic\n**IDLE/BUSY state** — release is idempotent via CAS, preventing double-free from io_uring\nkernel retries or TTL-close vs. CQE-collision paths.\n\n```zig\nconst LargeBufferPool = @import(\"sws\").LargeBufferPool;\n\n// 64 blocks × 1MB = 64MB — built into AsyncServer by default\n// Usage in oversized body path:\nconst buf = self.large_pool.acquire() orelse return error.OutOfLargeBuffers;\n// io_uring READ CQE writes directly to buf.ptr\n// ... process body ...\nself.large_pool.release(buf);\n```\n\n### IO_QUANTUM — Next task fairness\n\n`drainNextTasks` is capped at 64 tasks per event loop iteration (`IO_QUANTUM`). This prevents\ndepth-first starvation: when a handler's `Next.go()` spawns new tasks, they don't preempt\nthe remaining ReadyQueue entries or CQE harvesting. P99 tail latency stays uniform under load.\n\n### HttpRing + HttpClient (Ring B)\n\nIndependent io_uring Ring B for outbound HTTP client. Shares the kernel io-wq thread pool\nvia `IORING_SETUP_ATTACH_WQ`. **TinyCache is built into RingB** — same host:port connections\nare automatically reused within the TTL window and evicted by `RingB.tick()`.\n\n```zig\nconst sws = @import(\"sws\");\n\n// Ring B init (attached to server's Ring A io-wq, 1s cache TTL):\nvar ring_b = try sws.HttpRing.init(allocator, io, server.ring.fd, 1000);\ndefer ring_b.deinit();\n\n// HttpClient — cache is automatically managed by RingB:\nvar http_client = try sws.HttpClient.init(allocator, \u0026ring_b);\ntry http_client.start(); // spawn dedicated thread\ndefer http_client.deinit();\n\n// Use from handler:\nconst resp = try http_client.get(\"http://api.example.com/data\");\ndefer resp.deinit();\n\n// POST with body:\nconst resp2 = try http_client.post(\"http://api.example.com/submit\", \"{\\\"key\\\":\\\"val\\\"}\");\n```\n\n#### c-ares async DNS (optional)\n\nBuilt-in `DnsResolver` covers basic needs (A record + TTL cache). For truncated UDP (TC bit → TCP retry) or SRV records, switch to c-ares:\n\n```bash\nsudo apt install libc-ares-dev\n```\n\nAdd to `build.zig`:\n```zig\nexe.linkSystemLibrary(\"cares\");\n```\n\nSwitch DNS backend:\n```zig\nconst HttpCaresDns = sws.HttpCaresDns;\n// ring.dns = HttpCaresDns.init(alloc, ring.rs);\n```\n\n### Fiber\n\nBuilt-in fiber (x86_64 and ARM64 Linux). All handler fibers share a **single pre-allocated stack buffer** (stored in `AsyncServer.shared_fiber_stack`, default 256KB) — sequential execution, no per-request stack allocation, zero contention.\n\n\u003e ⚠️ **Do NOT use `std.Io.async()` / `future.await()` in handlers.**\n\u003e\n\u003e Zig's `Future` is a **thread-based** design, not fiber-based:\n\u003e - `async()` → `std.Thread.spawn` + queued to OS thread pool (`Threaded.zig:2112`)\n\u003e - `await()` → `Thread.futexWait` — blocks the **OS thread** (`Threaded.zig:2436`)\n\u003e\n\u003e On the IO thread, blocking means:\n\u003e - io_uring CQE processing stops — no new connections, no reads, no writes\n\u003e - The entire server stalls for the duration of the work\n\u003e\n\u003e ### Why not patch the vtable to support Future on fibers?\n\u003e\n\u003e `future.await()` requires the caller's **stack frame to persist** across suspension:\n\u003e ```\n\u003e var future = io.async(work, .{data});\n\u003e const result = future.await(io);   // fiber yields here — stack must survive\n\u003e ctx.json(200, result);              // resumes here — expects data still intact\n\u003e ```\n\u003e\n\u003e SWS uses a **shared stack** (one 256KB buffer, all fibers reuse it). When a fiber\n\u003e yields in `await()`, the next fiber's execution overwrites that same memory. The\n\u003e resumed fiber's stack frame is corrupted.\n\u003e\n\u003e Switching to per-fiber stacks would fix this, but at a steep memory cost:\n\u003e\n\u003e | Concurrent requests | Per-fiber stack | Shared stack |\n\u003e |---|---:|---:|\n\u003e | 1K | 16 MB | 256 KB |\n\u003e | 20K | 320 MB | 256 KB |\n\u003e | 200K | 3.2 GB | 256 KB |\n\u003e | 1M | 16 GB | 256 KB |\n\u003e\n\u003e *(per-fiber stack at 16KB — the practical minimum for HTTP handlers)*\n\u003e\n\u003e At a typical production load of 200K concurrent requests, shared stack saves ~3GB.\n\u003e This directly translates to lower memory pressure and better operational stability.\n\u003e\n\u003e This is the fundamental tradeoff: **Future API semantics vs. 1M-connection memory model**.\n\u003e SWS chooses the latter. All async is done via `Next.go`/`Next.submit` with callbacks\n\u003e instead of `await`-style suspension.\n\u003e - Fibers are cooperative; OS threads are preemptive. This breaks the fiber model.\n\u003e\n\u003e | Zig pattern | SWS replacement |\n\u003e |---|---|\n\u003e | `io.async(cpuWork)` + `future.await(io)` | `Next.submit(Ctx, ctx, exec)` + `DeferredResponse` |\n\u003e | `io.async(ioWork)` + `future.await(io)` | `Next.go(Ctx, ctx, exec)` (fiber on IO thread) |\n\u003e\n\u003e **Pattern**:\n\u003e ```zig\n\u003e // ❌ Don't do this in handler — blocks IO thread:\n\u003e // var future = io.async(heavyWork, .{data});\n\u003e // const result = future.await(io);\n\u003e\n\u003e // ✅ Do this instead — IO thread never blocks:\n\u003e fn myHandler(allocator: Allocator, ctx: *Context) anyerror!void {\n\u003e     ctx.deferred = true;\n\u003e     const resp = try allocator.create(DeferredResponse);\n\u003e     resp.* = .{ .server = server, .conn_id = ctx.conn_id, .allocator = allocator };\n\u003e     Next.submit(Ctx, .{ .resp = resp, .data = data }, exec);\n\u003e }\n\u003e ```\n\u003e\n\u003e See `Next.submit` section above for the full exec/complete callback API.\n\n### Routing / Middleware / WebSocket / Context\n\nSee `example/` and `src/example.zig`.\n\n## Memory Model (1M connections target)\n\n| Component | Size | Notes |\n|-----------|------|-------|\n| StackSlot (per connection) | 384 bytes | 5 cache-line-aligned sub-structures |\n| StackPool (1M slots) | ~384 MB | contiguous, warmup-touched |\n| Connection hashmap (1M entries) | ~160 MB | AutoHashMap(u64, Connection) |\n| Freelist + live list | ~8 MB | 2 × [1M]u32, O(1) acquire/release |\n| Read buffer (idle) | 0 bytes | io_uring provided buffers, returned on idle |\n| Slab for io_uring reads | 64 MB | 16384 × 4KB blocks, kernel-recycled |\n| Tiered write pool | dynamic | 8 size classes (512B–64KB), freelist-recycled |\n| Shared fiber stack | 256 KB | All fibers share one pre-allocated stack |\n| LargeBufferPool | 64 MB | 64 × 1MB blocks for oversized requests |\n| **1M idle connections** | **~680 MB** | No per-thread stack overhead |\n\nLike [greatws](https://github.com/antlabs/greatws), idle connections consume zero buffer memory.\n\n### Cache-line layout rationale\n\nThe 384-byte StackSlot is split across independent cache lines:\n\n- **line1 (64B):** fd, gen_id, state, write_offset — only this is touched during CQE dispatch\n- **line2 (64B):** conn_id, last_active_ms, active_list_pos — only touched during TTL scanning\n- **line3 (64B):** fiber_context, large_buf_ptr — async anchors (Worker Pool / oversized bodies)\n- **line4 (128B):** writev_in_flight, response_buf, write_iovs, WS queue — write path, not in the hot path\n- **line5 (64B):** sentinel + workspace union — protocol parser scratch, zero extra allocation\n\nThe IO loop's hottest path (CQE dispatch → slot lookup) only touches line1. TTL scanning only touches line2. No cache-line ping-pong between unrelated operations.\n\n### WebSocket payload copying\n\nWS handlers may offload frame data asynchronously, so frame payloads must remain valid after handler returns. **WS frame payloads are always duped — never zero-copy.**\n\n**Performance impact (100B text frame):**\n\n| Operation | Cost | Notes |\n|-----------|------|-------|\n| memcpy(100B) | ~10ns | Copy frame payload |\n| GeneralPurposeAllocator alloc/free | ~100ns | One alloc+free per frame |\n\n**~110ns overhead per frame**. 1M connections, 1% active, 10 msg/s each = 100K msg/s:\n- CPU: 100K × 110ns = **11ms/s = 1.1% of one core**\n\n## TLS / HTTPS / WSS\n\nTLS is powered by the pure-Zig [tls.zig](lib/tls.zig) library (TLS 1.3 server). Enable at build time:\n\n```bash\nzig build -Denable-tls=true\n```\n\n### Server TLS\n\n```zig\nvar server = try sws.AsyncServer.init(alloc, io, \"0.0.0.0:9443\", null, 64,\n    .{ .cert_path = \"/etc/ssl/fullchain.pem\", .key_path = \"/etc/ssl/privkey.pem\" }\n);\n// Pass null to disable TLS\n```\n\n### Client TLS\n\n```zig\nvar client = try sws.HttpClient.init(alloc, \u0026ring_b);\ntry client.enableTls();\nconst resp = try client.get(\"https://api.example.com/data\");\n```\n\n**Certificate formats:** PEM (PKCS#8 private key), ECDSA P-256/P-384, RSA 2048/3072/4096. Let's Encrypt compatible.\n\n## Config\n\n| key | default | description |\n|-----|---------|-------------|\n| `fiber_stack_size_kb` | 256 | fiber stack size (KB). 0 = 256 |\n| `io_cpu` | null | pin IO thread to CPU core |\n| `idle_timeout_ms` | 30000 | close idle connections |\n| `write_timeout_ms` | 5000 | close stuck-write connections |\n| `buffer_size` | 4096 | io_uring buffer block size |\n| `buffer_pool_size` | 16384 | number of buffer blocks |\n| `max_fixed_files` | 65535 | registered fixed-file slots (beyond this uses plain-fd I/O) |\n\n## invokeOnIoThread\n\nCross-thread safe callback to IO thread. Underneath is `rs.invoke` (CAS lock-free linked list), drained automatically in `drainTick`.\n\n```zig\nserver.invokeOnIoThread(MyCtx, ctx, struct {\n    fn run(allocator, c: *MyCtx) void {\n        // Runs on IO thread — safe to access ring/registry\n        c.client.write(\"PUB ...\");\n        allocator.free(c.data);\n    }\n}.run);\n```\n\n## Advanced: io_uring-native DB Pool\n\nWire your DB driver's TCP fd into io_uring directly:\n\n```\nhandler (fiber on IO thread):\n  └── db.query(sql)\n        └── io_uring write(fd, query) → CQE → io_uring read(fd) → CQE → parse\n              → ctx.json(200, result)\n```\n\nFor connection pooling: maintain a pool of connected TCP fds in a ringbuffer. Handler pops fd, issues `write(sql)` + `read()` via io_uring, parses result, pushes fd back.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffndome%2Fsws","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffndome%2Fsws","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffndome%2Fsws/lists"}