https://github.com/sitetester/auto-batching-proxy

It will automatically batch inference requests from multiple independent users together in a single batch request for efficiency, so that for users the interface looks like individual requests, but internally it is handled as a batch request
https://github.com/sitetester/auto-batching-proxy

proxy rest rocket tei text-embeddings-inference

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/sitetester/auto-batching-proxy
Owner: sitetester
Created: 2025-08-23T14:52:34.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-08-28T15:28:59.000Z (11 months ago)
Last Synced: 2025-09-01T00:23:30.142Z (11 months ago)
Topics: proxy, rest, rocket, tei, text-embeddings-inference
Language: Rust
Homepage:
Size: 18 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Auto Batching Proxy

It will automatically batch inference requests from independent users in a single batch request for efficiency, so that for users the interface looks like individual requests, but internally it is handled as a batch
request, essentially it provide a REST API wrapper around some inference service like https://github.com/huggingface/text-embeddings-inference

Proxy server is configured with following parameters:

_Max Wait Time_ - maximal time user request can wait for other requests to be accumulated in a batch
_Max Batch Size_ - maximal number of requests that can be accumulated in a batch.

## Setup Inference Service
First, try running inference service in a container with `--model-id nomic-ai/nomic-embed-text-v1.5`
```
docker run --rm -it -p 8080:80 --pull always \
ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
--model-id nomic-ai/nomic-embed-text-v1.5
```
if it fails to start, then try with some other alternatives. Currently, code is functional for
`--model-id sentence-transformers/all-MiniLM-L6-v2` & `--model-id sentence-transformers/all-mpnet-base-v2`
Check [/screenshots](./screenshots) for some of the tried models.

```
docker run --rm -it -p 8080:80 --pull always \
ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
--model-id sentence-transformers/all-MiniLM-L6-v2
```
Note: [Backend does not support a batch size > 8](screenshots/run_model_status/max_batch_size.png)
but our proxy will respect this config param & will not send requests (as well as max inputs, which is 32 for `all-MiniLM-L6-v2`) more than supported batch size.

## Setup Proxy Service
- either run `cargo run` at root of the project, it will launch Rocket [with default configuration params](./screenshots/cargo/cargo_run.png)
- or otherwise [with custom params](./screenshots/cargo/cargo_run_with_params.png) like
```
RUST_LOG=INFO cargo run -- --max-batch-size 50 --max-wait-time-ms 3000
```

**[Unit tests](https://doc.rust-lang.org/book/ch11-03-test-organization.html#unit-tests)**
Relevant unit tests are provided inside `/src` source code files

**[Integration tests](https://doc.rust-lang.org/book/ch11-03-test-organization.html#integration-tests)**
Check the `/tests` folder, code is covered with various scenarios.

Run all tests via `cargo test`. Currently, tests are verified to be passed against
`--model-id sentence-transformers/all-MiniLM-L6-v2` & ` --model-id sentence-transformers/all-mpnet-base-v2`
& they also explain how/why which part of code was written for which particular use case.

Use the following simple CURL commands for quick testing
- for inference
```
curl -X POST http://localhost:8080/embed \
-H "Content-Type: application/json" \
-d '{"inputs": ["Hello world"]}'
```
- for proxy
```
curl -X POST http://localhost:3000/embed \
-H "Content-Type: application/json" \
-d '{"inputs": ["Hello", "World"]}'
```
to verify proxy is working for multiple concurrent requests
```
cd scripts
./proxy_concurrent_calls.sh
```

**Benchmark test results:**
Following output is taken from
```
$ RUSTFLAGS="-A dead_code" cargo test test_compare_single_input_inference_service_vs_auto_batching_proxy_with_x_separate_requests -- --nocapture
```
- `-A` shortcut for allow (to suppress warnings generated for unused functions, even though they are actually used in tests)
- [--nocapture](https://doc.rust-lang.org/cargo/commands/cargo-test.html#display-options) will recover display output

[Full output:](./screenshots/timing_summary_full.png)
![timing_summary.png](screenshots/timing_summary.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sitetester/auto-batching-proxy

Awesome Lists containing this project

README