{"id":30236907,"url":"https://github.com/rustworthy/auto-batching-proxy","last_synced_at":"2025-08-15T01:10:27.989Z","repository":{"id":309582297,"uuid":"1036546410","full_name":"rustworthy/auto-batching-proxy","owner":"rustworthy","description":"Auto-batching inference requests for better resource utilization. Axum-powered REST API wrapping text-embeddings-inference service","archived":false,"fork":false,"pushed_at":"2025-08-12T16:40:37.000Z","size":40,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-12T18:32:49.936Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rustworthy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-12T08:29:25.000Z","updated_at":"2025-08-12T16:40:41.000Z","dependencies_parsed_at":"2025-08-12T18:32:55.847Z","dependency_job_id":null,"html_url":"https://github.com/rustworthy/auto-batching-proxy","commit_stats":null,"previous_names":["rustworthy/auto-batching-proxy"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/rustworthy/auto-batching-proxy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustworthy%2Fauto-batching-proxy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustworthy%2Fauto-batching-proxy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustworthy%2Fauto-batching-proxy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustworthy%2Fauto-batching-proxy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rustworthy","download_url":"https://codeload.github.com/rustworthy/auto-batching-proxy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rustworthy%2Fauto-batching-proxy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270507037,"owners_count":24596903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-14T02:00:10.309Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-15T01:10:22.766Z","updated_at":"2025-08-15T01:10:27.979Z","avatar_url":"https://github.com/rustworthy.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Auto-batching proxy for inference requests\n\n## Task Description\n\nThe task was to create an auto-batching proxy service that would serve as a wrapper\nover another inference service (see `Makefile` for details on how we are launching that\nservice for development purposes). Internally, that batching proxy will 🥁 batch\nindividual embedding requests, while for the end-user the API is the same, as if\nthey were the only client of the inference service. This batching makes requests\nto the upstream service more efficient (and helps reduce costs).\n\nIn my hometown in 1990-2000s, there used to be drivers hanging around the realway\nstation who - if you missed your train or just did not bother to buy a ticket, would\noffer you a ride to another town - but a shared ride. They would gather (batch)\na few fellas like myself and then start the ride. But there were rules - they\ncould not take more that N people (depending on the vehicle size) and one person\nwho came first could not wait for too long (like no longer than an hour normally).\n\nIn the similar fashion, in our batching service here we got `MAX_BATCH_SIZE` and\n`MAX_WAIT_TIME` (in millis) parameters configurable via the environment (see other configurable\noptions in `.env.example`).\n\n## Solution\n\n### Stack\n\nOur REST API wrapper is powered by axum web-framework, which is our framework\nof choice. We have not added Openapi definitions to this project, but if we need\nto, we will integrate `utoipa` crate. All other crates we depend on are pretty\nstandard.\n\n### Interface\n\nOur REST API wrapper is powered by axum and currently provides one single\nendpoint `POST /embed` and so trying it our is super straightforward:\n\n```console\ncurl 127.0.0.1:8081/embed -X POST -d '{\"inputs\":[\"What is Vector Search?\", \"Hello, world!\"]}' -H 'Content-Type: application/json'\n```\n\nNotice how it looks exactly the same (but for PORT number) if you were querying\nthe [upstream service][4] directly:\n\n```console\ncurl 127.0.0.1:8080/embed -X POST -d '{\"inputs\":[\"What is Vector Search?\", \"Hello, world!\"]}' -H 'Content-Type: application/json'\n```\n\n### Handler\n\nOn the app's start-up, we are launching a task with the web-server and a dedicated\ntask for our inference service worker - which is effectively an actor responsible\nfor inference and hiding the implementation details from the axum handler to separate\nconcerns. Instances of the handler (threads processing end-users' requests) are\ncommunicating with that worker using channels and messages. Once a handler receives\na request it forwards it to the worker and awaits the worker's response (embeddings or\nan error) via a oneshot channel, and once it gets the response, it sends it's\nJSON representation to the end-user.\n\n### Worker\n\nThe worker just listens for messages from the axum handlers. The worker keeps\nsome state: it has got a message queue with a capacity as per `MAX_BATCH_SIZE`\nand a timeout as per `MAX_WAIT_TIME` - whichever comes first will make the worker\nsend the batch to the upstream service. If an error is received from the\nupstream inference service, it gets \"broadcast\" to the handlers. If the embeddings\nare received, the worker will make sure to not send the entirety of it to each\nhandler, rather only the segment that corresponds to the handler's inputs. We\nare relying here on the fact that the upstream service returns an array of embeddings\nin which an embdedding at index N is the result for the query at index N in the\ninputs container in our request.\n\nTo give a concrete example, imagine the batch size is set to `2`, and the first\nrequest contains inputs array `[\"hello\", \"world\"]` while the second request has\nonly one item `[\"bye\"]` - the worker will flatten these two into one array and\nsend to the upstream service as `[\"hello\", \"world\", \"bye\"]`. The response our worker\ngets will have the following shape:\n`[[-0.045, ... , -0.123144], [0.412, ..., -0.412], [0.1241, ..., 0.123]]`.\nThe worker still \"remembers\" at that point that it needs to send `2` embeddings\nto the first handler and `1` embeddig to the second handler instance.\n\nAlso - replaying the example above - if the batch size is set to `2`, the worker\nreceived a message from one handler `[\"hello\", \"world\"]` and the time-out\n(configured via `MAX_WAIT_TIME`) is reached, the worker will send\nsend `[\"hello\", \"world\"]` to the upstream server.\n\nEach time the batch get \"flushed\", the timeout gets unset and the queue gets\nemptied.\n\n## Demo\n\nNB: make sure you got [`GNU Make`][2], and [`docker`][3] installed.\n\nPopulate your very own local `.env` file with:\n\n```console\nmake dotenv\n```\n\nYou can now launch the auto-batching proxy together with the inference service\nwith a single command:\n\n```console\ndocker compose up --build\n```\n\nThe command above will build our proxy app, launch the upstream inference service\nfirst, make sure it is ready, and then launch the proxy app. The initial image build\ntakes some time plus the model need some warm up, so the \"cold\" start can take\nup to a few minutes.\n\nIf you tweak `MAX_WAIT_TIME` and `MAX_BATCH_SIZE` parameters in your `.env`\nfile, make sure to restart the containers.\n\n## Benchmarking\n\nWe've set `MAX_WAIT_TIME` to `1000` (1 second) and `MAX_BATCH_SIZE` to `8`\n(the upstream service's text embedding router batch cap), and `RUST_LOG`\nset tot \"auto_batching_proxy=error,axum=error\".\n\nWe then launched the services as described [above](#demo) and used the [`oha`][5]\nutility to generate some load.\n\n### With proxy\n\nThe command used (see `load` target in [`Makefile`](./Makefile)):\n\n```console\noha -c 200 -z 30s --latency-correction -m POST -d '{\"inputs\":[\"What is Vector Search?\", \"Hello, world!\"]}' -H 'Content-Type: application/json' http://localhost:8081/embed\n```\n\nWhich gave the following results:\n\n```\n  Success rate: 100.00%\n  Total:        30.0039 sec\n  Slowest:      2.0037 sec\n  Fastest:      0.2129 sec\n  Average:      1.6135 sec\n  Requests/sec: 126.6503\n\n  Total data:   62.68 MiB\n  Size/request: 17.83 KiB\n  Size/sec:     2.09 MiB\n```\n\n### Without proxy\n\nWe've used same utility on the same hardware and some max batch size and max wait,\nbut specified the upstream service's port in the command for direct communitation.\nThe command used (note the port number and see how we are mapping to this host port\nin our [`compose`](./compose.yaml) and also take a look at `load/noproxy`\ntarget in [`Makefile`](./Makefile))):\n\n```console\noha -c 200 -z 30s --latency-correction -m POST -d '{\"inputs\":[\"What is Vector Search?\", \"Hello, world!\"]}' -H 'Content-Type: application/json' http://localhost:8080/embed\n```\n\n```\n  Success rate: 100.00%\n  Total:        30.0047 sec\n  Slowest:      2.1063 sec\n  Fastest:      0.0452 sec\n  Average:      1.6371 sec\n  Requests/sec: 124.8803\n\n  Total data:   64.19 MiB\n  Size/request: 18.53 KiB\n  Size/sec:     2.14 MiB\n```\n\n### Observations\n\nThe reports above are examples from one single test run. In general - upon a few\nload test runs - we are observing pretty close request per second indicator.\nAlso the slowest requests are pretty close to each other, while the fastest request\nwithout proxy is 2.5x faster (~30-100ms vs ~100-200ms), i.e. our wrapper _does_\nintroduce some overhead. Apparently, we are compensating for this with the gains\nelsewhere - in the resources savings on the upstream service size and reduced costs\nfor each individual user.\n\nSubscribing for debug and trace events and writing those to stdout slows our\napplication down (~20% bandwidth reduction), so we ended up testing with error+\nevents level.\n\nWe also tried loading our auto-batching proxy with `MAX_BATCH_SIZE` set to `1`\n(and all other parameters the same), which gave us results close to those without\nproxy. Here are stats from one of the runs:\n\n```\n  Success rate: 100.00%\n  Total:        30.0061 sec\n  Slowest:      2.0051 sec\n  Fastest:      0.0672 sec\n  Average:      1.5677 sec\n  Requests/sec: 129.8068\n\n  Total data:   65.14 MiB\n  Size/request: 18.05 KiB\n  Size/sec:     2.17 MiB\n```\n\nWhich checks out: with the current implementation, the 8th client in the proxied\nscenario with 8 messages per batch will wait till the preceding 7 clients get\ntheir slices of the upstream inference service response. We could play around this\nand try and improve implementation so reduce the proxy overhead.\n\n## Dev Setup\n\nMake sure you got [`cargo`][1], [`GNU Make`][2], and [`docker`][3] installed,\nand hit:\n\n```console\nmake setup\n```\n\nYou should now be able to start the back-end in watch mode with:\n\n```console\nmake watch\n```\n\nYou can send requests with:\n\n```console\ncurl 127.0.0.1:8081/embed -X POST -d '{\"inputs\":[\"What is Vector Search?\", \"Hello, world!\"]}' -H 'Content-Type: application/json'\n```\n\nYou can also tweak configurations in the generated `.env` file (gets populated\nvia `make setup`), the dev-server will restart automatically (if you are using\nthe `make watch` command as described above).\n\n\u003c!-- -------------------------------- LINKS -------------------------------- --\u003e\n[1]: https://doc.rust-lang.org/cargo/getting-started/installation.html\n[2]: https://www.gnu.org/software/make/\n[3]: https://docs.docker.com/engine/install/\n[4]: https://github.com/huggingface/text-embeddings-inference\n[5]: https://github.com/hatoo/oha?tab=readme-ov-file#installation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frustworthy%2Fauto-batching-proxy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frustworthy%2Fauto-batching-proxy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frustworthy%2Fauto-batching-proxy/lists"}