https://github.com/prantlf/ovai
HTTP proxy for accessing Vertex AI with the REST API interface of ollama. Optionally forwarding requests for other models to ollama. Written in Go.
https://github.com/prantlf/ovai
ai api-proxy google ollama ollama-api ollama-interface vertex-ai vertexai
Last synced: 4 months ago
JSON representation
HTTP proxy for accessing Vertex AI with the REST API interface of ollama. Optionally forwarding requests for other models to ollama. Written in Go.
- Host: GitHub
- URL: https://github.com/prantlf/ovai
- Owner: prantlf
- License: mit
- Created: 2024-05-12T16:37:53.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-28T12:17:50.000Z (11 months ago)
- Last Synced: 2024-10-30T01:51:46.402Z (11 months ago)
- Topics: ai, api-proxy, google, ollama, ollama-api, ollama-interface, vertex-ai, vertexai
- Language: Go
- Homepage:
- Size: 2.77 MB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# ovai - ollama-vertex-ai
HTTP proxy for accessing [Vertex AI] with the REST API interface of [ollama]. Optionally forwarding requests for other models to `ollama`. Written in [Go].
## Synopsis
Get embeddings for a text:
```
❯ curl localhost:22434/api/embed -d '{
"model": "text-embedding-005",
"input": "Half-orc is the best race for a barbarian."
}'{ "embeddings": [[0.05424513295292854, -0.023687424138188362, ...]] }
```## Setup
1. Download an archive with the executable for your hardware and operating system from [GitHub Releases].
2. Download a JSON file with your Google account key from Google Project Console and save it to the current directory under the name `google-account.json`.
3. Optionally create a file `model-defaults.json` in the current directory to change the [default model parameters].
4. Run the server:```
❯ ovaiListening on http://localhost:22434 ...
```### Configuring
The following properties from `google-account.json` are used:
```jsonc
{
"project_id": "...",
"private_key_id": "...",
"private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
"client_email": "...",
"scope": "https://www.googleapis.com/auth/cloud-platform", // optional, can be missing
"auth_uri": "https://www.googleapis.com/oauth2/v4/token" // optional, can be missing
}
```Set the environment variable `PORT` to override the default port 22434.
Set the environment variable `DEBUG` to one or more strings separated by commas to customise logging on `stderr`. The default value is `ovai` when run on the command line and `ovai:srv` inside the Docker container.
| `DEBUG` value | What will be logged |
|:--------------|:-----------------------------------------------------------------|
| `ovai` | important information about the bodies of requests and responses |
| `ovai:srv` | methods and URLs of requests and status codes of responses |
| `ovai:net` | requests forwarded to Vertex AI and received responses |
| `ovai,ovai:*` | all information above |Set the environment variable `OLLAMA_ORIGIN` to the origin of the `ollama` service to enable forwarding to `ollama`. If the requested model doesn't start with `gemini`, `multimodalembedding`, `textembedding` or `text-embedding`, the request will be forwarded to the `ollama` service. This can be used for using `ovai` as the single service with the `ollama` interface, which recognises both `Vertex AI` and `ollama` models.
Set the environment variable `NETWORK` to enforce IPV4 or IPV6. The default behaviour is to depend on the [Happy Eyeballs] implementation in Go and in the underlying OS. valid values:
| `NETWORK` value | What will be used |
|:----------------|:---------------------------------------------|
| `IPV4` | enforce the network connection via IPV4 only |
| `IPV6` | enforce the network connection via IPV6 only |### Docker
For example, run a container for testing purposes with verbose logging, deleted on exit, exposing the port 22434:
docker run --rm -it -p 22434:22434 -e DEBUG=ovai,ovai:* \
-v ${PWD}/google-account.json:/usr/src/app/google-account.json \
ghcr.io/prantlf/ovaiFor example, run a container named `ovai` in the background with custom defaults, forwarding to `ollama`, exposing the port 22434:
docker run --rm -dt -p 22434:22434 --name ovai \
--add-host host.docker.internal:host-gateway \
-e OLLAMA_ORIGIN=http://host.docker.internal:11434 \
-v ${PWD}/google-account.json:/usr/src/app/google-account.json \
-v ${PWD}/model-defaults.json:/usr/src/app/model-defaults.json \
prantlf/ovaiAnd the same task as above, only using Docker Compose (place [docker-compose.yml] or [docker-compose-ollama.yml], if you want to use ollama too, to the current directory) to make it easier:
docker-compose up -d --wait
docker-compose -f docker-compose-ollama.yml up -d --waitThe image is available as both `ghcr.io/prantlf/ovai` (GitHub) or `prantlf/ovai` (DockerHub).
### Building
Make sure that you have installed [Go] 1.22.3 or newer.
git clone https://github.com/prantlf/ovai.git
cd ovai
makeExecuting `./ovai`, `make docker-start` or `make docker-up` will require the `google-account.json` file in the current directory, if you don't just proxy the calls to ollama (which needs the `OLLAMA_ORIGIN` environment variable).
## API
See the original [REST API documentation] for details about the interface. See also the [lifecycle of the Vertex AI models].
### Embeddings
Creates a vectors from the specified input. See the available [embedding models].
```
❯ curl localhost:22434/api/embed -d '{
"model": "textembedding-gecko@003",
"input": ["Half-orc is the best race for a barbarian."]
}'{ "embeddings": [[0.05424513295292854, -0.023687424138188362, ...]] }
```The returned vector of floats has 768 dimensions.
Previous request remains supported for compatibility:
```
❯ curl localhost:22434/api/embeddings -d '{
"model": "textembedding-gecko@003",
"prompt": "Half-orc is the best race for a barbarian."
}'{ "embedding": [0.05424513295292854, -0.023687424138188362, ...] }
```### Text
Generates a text using the specified prompt. See the available [gemini text and chat models].
```
❯ curl localhost:22434/api/generate -d '{
"model": "gemini-1.5-flash-002",
"prompt": "Describe guilds from Dungeons and Dragons.",
"images": [],
"stream": false
}'{
"model": "gemini-1.5-flash-002",
"created_at": "2024-05-10T14:10:54.885Z",
"response": "Guilds serve as organizations that bring together individuals with ...",
"done": true,
"total_duration": 13884049373,
"load_duration": 0,
"prompt_eval_count": 7,
"prompt_eval_duration: 3471012343,
"eval_count: 557,
"eval_duration: 10413037030
}
```The property `stream` defaults to be `true`. The property `options` is optional with the following defaults:
```
"options": {
"num_predict": 8192,
"temperature": 1,
"top_p": 0.95,
"top_k": 40
}
```### Chat
Replies to a chat with the specified message history. See the available [gemini text and chat models].
```
❯ curl localhost:22434/api/chat -d '{
"model": "gemini-1.5-pro",
"messages": [
{
"role": "system",
"content": "You are an expert on Dungeons and Dragons."
},
{
"role": "user",
"content": "What race is the best for a barbarian?",
"images": []
}
],
"stream": false
}'{
"model": "gemini-1.5-pro",
"created_at": "2024-05-06T23:32:05.219Z",
"message": {
"role": "assistant",
"content": "Half-Orcs are a strong and resilient race, making them ideal for barbarians. ..."
},
"done": true,
"total_duration": 2325524053,
"load_duration": 0,
"prompt_eval_count": 9,
"prompt_eval_duration: 581381013,
"eval_count: 292,
"eval_duration: 1744143040
}
```The property `stream` defaults to `true`. The property `options` is optional with the following defaults:
```
"options": {
"num_predict": 8192,
"temperature": 1,
"top_p": 0.95,
"top_k": 40
}
```### Tags
Lists available models.
```
❯ curl localhost:22434/api/tags{
"models": [
{
"name": "moondream:latest",
"model": "moondream:latest",
"modified_at": "2024-06-02T16:39:32.532400236+02:00",
"size": 1738451197,
"digest": "55fc3abd386771e5b5d1bbcc732f3c3f4df6e9f9f08f1131f9cc27ba2d1eec5b",
"details": {
"parent_model": "",
"format": "gguf",
"family": "phi2",
"families": [
"phi2",
"clip"
],
"parameter_size": "1B",
"quantization_level": "Q4_0"
},
"expires_at": "0001-01-01T00:00:00Z"
}
]
}
```### Show
Show information about a model.
```
❯ curl localhost:22434/api/show -d '{"name":"moondream"}'{
"license": "....",
"modelfile": "...",
"parameters": "temperature 0\nstop \"\u003c|endoftext|\u003e\"\nstop \"Question:\"",
"template": "{{ if .Prompt }} Question: {{ .Prompt }}\n\n{{ end }} Answer: {{ .Response }}\n\n",
"details": {
"parent_model": "",
"format": "gguf",
"family": "phi2",
"families": [
"phi2",
"clip"
],
"parameter_size": "1B",
"quantization_level": "Q4_0"
}
}
```### Ping
Checks that the server is running.
```
❯ curl -f localhost:22434/api/ping -X HEAD
```### Shutdown
Gracefully shuts down the HTTP server and exits the process.
```
❯ curl localhost:22434/api/shutdown -X POST
```## Models
## Vertex AI
Recognised models for embeddings: textembedding-gecko@001, textembedding-gecko@002, textembedding-gecko@003, textembedding-gecko-multilingual@001, text-multilingual-embedding-002, text-embedding-004, text-embedding-005, multimodalembedding@001.
Recognised models for content generation and chat: gemini-2.0-flash-exp, gemini-1.5-flash-001, gemini-1.5-flash-002, gemini-1.5-flash-8b-001, gemini-1.5-pro-001, gemini-1.5-pro-002, gemini-1.0-pro-vision-001, gemini-1.0-pro-001, gemini-1.0-pro-002.
### Ollama
Small models usable on machines with less memory and no AI accelerator:
| Name | Size |
|:-----------------|-------:|
| gemma2:2b | 1.6 GB |
| granite3.1-dense:2b | 1.5 GB |
| granite3.1-moe:1b | 2.0 GB |
| granite3.1-moe:3b | 1.4 GB |
| granite-embedding:30m | 63 MB |
| granite-embedding:280m | 563 MB |
| internlm2:1.8b | 1.1 GB |
| llama3.2:1b | 1.3 GB |
| llama3.2:3b | 2.0 GB |
| llava-phi3 | 2.9 GB |
| moondream | 1.7 GB |
| nomic-embed-text | 274 MB |
| orca-mini | 2.0 GB |
| phi | 1.6 GB |
| phi3 | 2.2 GB |
| qwen2.5:0.5b | 397 MB |
| qwen2.5:1.5b | 986 MB |
| smollm | 990 MB |
| smollm:135m | 91 MB |
| smollm:360m | 229 MB |
| snowflake-arctic-embed2 | 1.2 GB |
| stablelm-zephyr | 1.6 GB |
| stablelm2 | 982 MB |
| tinyllama | 637 MB |#### gemma2
Google Gemma 2 is a high-performing and efficient model available in three sizes: 2B, 9B, and 27B.#### granite3.1-dense
The IBM Granite 2B and 8B models are text-only dense LLMs trained on over 12 trillion tokens of data, demonstrated significant improvements over their predecessors in performance and speed in IBM’s initial testing.#### granite3.1-moe
The IBM Granite 1B and 3B models are long-context mixture of experts (MoE) Granite models from IBM designed for low latency usage.#### granite-embedding
The IBM Granite Embedding 30M and 278M models models are text-only dense biencoder embedding models, with 30M available in English only and 278M serving multilingual use cases.#### internlm2
InternLM2.5 is a 7B parameter model tailored for practical scenarios with outstanding reasoning capability.#### llama3.2
Meta's Llama 3.2 goes small with 1B and 3B models.#### llava-phi3
A new small LLaVA model fine-tuned from Phi 3 Mini.#### moondream
moondream2 is a small vision language model designed to run efficiently on edge devices.#### nemotron-mini
A commercial-friendly small language model by NVIDIA optimized for roleplay, RAG QA, and function calling.#### nomic-embed-text
A high-performing open embedding model with a large token context window.#### nuextract
A 3.8B model fine-tuned on a private high-quality synthetic dataset for information extraction, based on Phi-3.#### orca-mini
A general-purpose model ranging from 3 billion parameters to 70 billion, suitable for entry-level hardware.#### phi
Phi-2: a 2.7B language model by Microsoft Research that demonstrates outstanding reasoning and language understanding capabilities.#### phi3
Phi-3 is a family of lightweight 3B (Mini) and 14B (Medium) state-of-the-art open models by Microsoft.#### qwen
Qwen 1.5 is a series of large language models by Alibaba Cloud spanning from 0.5B to 110B parameters#### qwen2
Qwen2 is a new series of large language models from Alibaba group#### qwen2.5
Qwen2.5 models are pretrained on Alibaba's latest large-scale dataset, encompassing up to 18 trillion tokens. The model supports up to 128K tokens and has multilingual support.#### smollm
🪐 A family of small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset.#### snowflake-arctic-embed2
Snowflake's frontier embedding model. Arctic Embed 2.0 adds multilingual support without sacrificing English performance or scalability.#### stablelm-zephyr
A lightweight chat model allowing accurate, and responsive output without requiring high-end hardware.#### stablelm2
Stable LM 2 is a state-of-the-art 1.6B and 12B parameter language model trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch.#### tinydolphin
An experimental 1.1B parameter model trained on the new Dolphin 2.8 dataset by Eric Hartford and based on TinyLlama.#### tinyllama
The TinyLlama project is an open endeavor to train a compact 1.1B Llama model on 3 trillion tokens.## Contributing
In lieu of a formal styleguide, take care to maintain the existing coding style. Lint and test your code.
## License
Copyright (C) 2024-2025 Ferdinand Prantl
Licensed under the [MIT License].
[MIT License]: http://en.wikipedia.org/wiki/MIT_License
[Vertex AI]: https://cloud.google.com/vertex-ai
[ollama]: https://ollama.com
[GitHub Releases]: https://github.com/prantlf/ovai/releases/
[Go]: https://go.dev
[default model parameters]: ./model-defaults.json
[Happy Eyeballs]: https://en.wikipedia.org/wiki/Happy_Eyeballs
[docker-compose.yml]: ./docker-compose.yml
[docker-compose-ollama.yml]: ./docker-compose-ollama.yml
[REST API documentation]: https://github.com/ollama/ollama/blob/main/docs/api.md
[lifecycle of the Vertex AI models]: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versioning
[embedding models]: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings#model_versions
[gemini text and chat models]: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/gemini#model_versions