An open API service indexing awesome lists of open source software.

https://github.com/epwalsh/batched-fn

🦀 Rust server plugin for deploying deep learning models with batched prediction
https://github.com/epwalsh/batched-fn

batching deep-learning rust

Last synced: 12 months ago
JSON representation

🦀 Rust server plugin for deploying deep learning models with batched prediction

Awesome Lists containing this project

README

          


batched-fn


Rust server plugin for deploying deep learning models with batched prediction.





Build


License


Crates


Docs



Deep learning models are usually implemented to make efficient use of a GPU by batching inputs together
in "mini-batches". However, applications serving these models often receive requests one-by-one.
So using a conventional single or multi-threaded server approach will under-utilize the GPU and lead to latency that increases
linearly with the volume of requests.

`batched-fn` is a drop-in solution for deep learning webservers that queues individual requests and provides them as a batch
to your model. It can be added to any application with minimal refactoring simply by inserting the [`batched_fn`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html)
macro into the function that runs requests through the model.

## Features

- 🚀 Easy to use: drop the `batched_fn!` macro into existing code.
- 🔥 Lightweight and fast: queue system implemented on top of the blazingly fast [flume crate](https://github.com/zesterer/flume).
- 🙌 Easy to tune: simply adjust [`max_delay`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html#config) and [`max_batch_size`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html#config).
- 🛑 [Back pressure](https://medium.com/@jayphelps/backpressure-explained-the-flow-of-data-through-software-2350b3e77ce7) mechanism included:
just set [`channel_cap`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html#config) and handle
[`Error::Full`](https://docs.rs/batched-fn/latest/batched_fn/enum.Error.html#variant.Full) by returning a 503 from your webserver.

## Examples

Suppose you have a model API that look like this:

```rust
// `Batch` could be anything that implements the `batched_fn::Batch` trait.
type Batch = Vec;

#[derive(Debug)]
struct Input {
// ...
}

#[derive(Debug)]
struct Output {
// ...
}

struct Model {
// ...
}

impl Model {
fn predict(&self, batch: Batch) -> Batch {
// ...
}

fn load() -> Self {
// ...
}
}
```

Without `batched-fn` a webserver route would need to call `Model::predict` on each
individual input, resulting in a bottleneck from under-utilizing the GPU:

```rust
use once_cell::sync::Lazy;
static MODEL: Lazy = Lazy::new(Model::load);

fn predict_for_http_request(input: Input) -> Output {
let mut batched_input = Batch::with_capacity(1);
batched_input.push(input);
MODEL.predict(batched_input).pop().unwrap()
}
```

But by dropping the [`batched_fn`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html) macro into your code you automatically get batched
inference behind the scenes without changing the one-to-one relationship between inputs and
outputs:

```rust
async fn predict_for_http_request(input: Input) -> Output {
let batch_predict = batched_fn! {
handler = |batch: Batch, model: &Model| -> Batch {
model.predict(batch)
};
config = {
max_batch_size: 16,
max_delay: 50,
};
context = {
model: Model::load(),
};
};
batch_predict(input).await.unwrap()
}
```

❗️ *Note that the `predict_for_http_request` function now has to be `async`.*

Here we set the [`max_batch_size`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html#config) to 16 and [`max_delay`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html#config)
to 50 milliseconds. This means the batched function will wait at most 50 milliseconds after receiving a single
input to fill a batch of 16. If 15 more inputs are not received within 50 milliseconds
then the partial batch will be ran as-is.

## Tuning max batch size and max delay

The optimal batch size and delay will depend on the specifics of your use case, such as how big of a batch you can fit in memory
(typically on the order of 8, 16, 32, or 64 for a deep learning model) and how long of a delay you can afford.
In general you want to set `max_batch_size` as high as you can, assuming the total processing time for `N` examples is minimized
with a batch size of `N`, and keep `max_delay` small relative to the time it takes for your
handler function to process a batch.

## Implementation details

When the `batched_fn` macro is invoked it spawns a new thread where the
[`handler`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html#handler) will
be ran. Within that thread, every object specified in the [`context`](https://docs.rs/batched-fn/latest/batched_fn/macro.batched_fn.html#context)
is initialized and then passed by reference to the handler each time it is run.

The object returned by the macro is just a closure that sends a single input and a callback
through an asyncronous channel to the handler thread. When the handler finishes
running a batch it invokes the callback corresponding to each input with the corresponding output,
which triggers the closure to wake up and return the output.