Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/leozqin/hops

A load-balancing reverse proxy server that enables you to address a fleet of diverse Ollama instances as a single one
https://github.com/leozqin/hops

llama llm llm-inference load-balancer ollama ollama-api reverse-proxy self-hosted selfhosted

Last synced: 6 days ago
JSON representation

A load-balancing reverse proxy server that enables you to address a fleet of diverse Ollama instances as a single one

Host: GitHub
URL: https://github.com/leozqin/hops
Owner: leozqin
License: mit
Created: 2025-01-03T22:00:20.000Z (7 days ago)
Default Branch: main
Last Pushed: 2025-01-05T05:49:20.000Z (6 days ago)
Last Synced: 2025-01-05T06:27:05.044Z (6 days ago)
Topics: llama, llm, llm-inference, load-balancer, ollama, ollama-api, reverse-proxy, self-hosted, selfhosted
Language: Python
Homepage:
Size: 8.79 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# HOPS
Heterogenous Ollama Proxy Server (styled as HOPS or `hops`) is a load-balancing reverse proxy server that enables you to address a fleet of diverse/heterogenous Ollama instances as a single one.

The benefit of this approach is that you can scale inference throughput by using any cheap consumer-grade hardware, and you don't need to route inferences to different fleets based on their model needs nor acquire numerous of the same expensive GPUs.

Instead, groups of like-minded individuals can pool their compute together and serve inferences with a shared goal in mind, or small institutions can create clusters using GPUs that they already have on hand.

Therefore, HOPS serves as a means to horizontally scale inference throughput (not vertically). Simply provision your Ollama instances with models that they can safely run, and HOPS will do the rest!

# How It Works/Featureset

When you request a model inference, HOPS transparently proxies the request to a server that has the model pulled. If more than one instance supports the model, HOPS will distribute requests among all the instances that support that model.

Currently, load balancing is done using randomization, but future load-balancing strategies include round-robin and memory-aware dynamic modes (prioritize Ollama instances that are likely to have the model loaded in memory, at the cost of additional metadata queries).

Intuitively, we'll also need to build a facility for retries and suspending unavailable hosts from the pool.

# API Coverage

Currently, HOPS is known to be functional with `v0.5.4` of Ollama, and the following Ollama REST endpoints are transparently implemented in HOPS:
1. `POST /api/generate` (single response and streaming)
2. `POST /api/chat` (single response and streaming)
3. `POST /api/embed`
4. `GET /api/tags` - returns the superset of models available across all known hosts
5. `POST /api/show` - returns the first instance of the specified model that the cluster supports