https://github.com/mariochavez/llm_server

Rack API application for Llama.cpp
https://github.com/mariochavez/llm_server

llamacpp llm ruby server

Last synced: about 2 months ago
JSON representation

Rack API application for Llama.cpp

Host: GitHub
URL: https://github.com/mariochavez/llm_server
Owner: mariochavez
License: mit
Created: 2023-06-15T22:52:57.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-07-18T18:52:46.000Z (over 2 years ago)
Last Synced: 2025-01-01T22:51:27.436Z (10 months ago)
Topics: llamacpp, llm, ruby, server
Language: Ruby
Homepage: https://mariochavez.io
Size: 28.3 KB
Stars: 40
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-ruby-ai - LLM Server - Host open-source Large Language Model with llama.cpp, an have a private AI server. This server is a Rack API. (Open Source / Bot Platforms)

README

# LLM Server
LLM Server is a Ruby Rack API that hosts the `llama.cpp` binary in memory(1) and provides an endpoint for text completion using the configured Language Model (LLM).

(1) The server now introduces am `inteactive` configuration key. By default this value is set to `true`. I have found this mode works well with models like: Llama, Open Llama, and Vicuna.
Other models like Orca model tends to allucinate, but turning off `interactive` model and loading the model on each request works for Orca, especially for the smaller model 3b. It responds
very fast.

## Overview
LLM Server serves as a convenient wrapper for the [llama.cpp](https://github.com/ggerganov/llama.cpp) binary,
allowing you to interact with it through a simple API. It exposes a single endpoint that accepts text input and returns the completion generated by the Language Model.

`llama.cpp` process is kept in memory to provide a better experience. Use any Language Model supported by `llama.cpp`.
Please, look at the configuration section of the server to setup your model.

https://github.com/mariochavez/llm_server/assets/59967/c4ea73ab-a06b-409a-b4c2-c27cc0556579

## Prerequisites
To use LLM Server, ensure that you have the following components installed:

- Ruby (version 3.2.2 or higher)
- A `llama.cpp` binary. [llama.cpp](https://github.com/ggerganov/llama.cpp) repository have instructions to build the binary
- A Language Model (LLM) compatible with the `llama.cpp` binary. [Hugging Face](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads) is a place to look for a model

## Getting Started

Follow these steps to set up and run the LLM Server:

1. Clone the LLM Server repository:

```bash
$ git clone https://github.com/your-username/llm-server.git
```

2. Change to the project directory:

```bash
$ cd llm-server
```

3. Install the required dependencies:

```bash
$ bundle install
```

4. Copy the file `config/config.yml.sample`to `config/config.yml`. The sample file is a template to configure your models. See bellow for more information.

5. Start the server:

```bash
$ bin/server
```

This will start the server on the default port (9292). Export a `PORT` variable before starting the server to use a different port. Puma server starts in a single-mode with one thread to
protect the `llama.cpp` process from parallel inferences. The Puma server enqueues requests to be served first in, first out.

## Configuration
Before looking into server configuration, remember that you need at least one Large Language Model compatible with `llama.cpp`.

Place your models inside `./models` folder.

Update the configuration file to better fit your model.

```yaml
current_model: "vic-13b-1.3"
llama_bin: "../llama.cpp/main"
models_path: "./models"

models:
"orca-3b":
model: "orca-mini-3b.ggmlv3.q4_0.bin"
interactive: false
strip_before: "respuesta: "
parameters: >
-n 2048 -c 2048 --top_k 40 --temp 0.1 --repeat_penalty 1.2 -t 6 -ngl 1
timeout: 90
"vic-13b-1.3":
model: "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin"
suffix: "Asistente:"
reverse_prompt: "Usuario:"
parameters: >
-n 2048 -c 2048 --top_k 10000 --temp 0 --repeat_penalty 1.2 -t 4 -ngl 1
timeout: 90
```

The `models` key allows you to configure one or more models to be used by the server. Not that the server are going to use all of them at the same time.

To configure a model, use a unique helpful name, ex: open-llama-7b. Then add three parameters:
- `model`: This is the name the file for that model.
- `suffix`: String that suffix prompt. This is required for interactive mode.
- `reverse_prompt`: This halt generation at PROMPT, return control in interactive mode. This is required for interactive mode.
- `interactive`: Tells the server how to load the model. When `true`, model is loaded in interactive mode and it is keep in memory. When `false`, model is loaded on each request. This works fine for small models. By default, this value is `true`.
- `strip_before`: When running model in non-interactive mode, you can use this to strip from response any unwanted text.
- `parameters`: These are the parameters that are passed to `llama.cpp` process to load and run your model. It is important that model is executed as interactive to take advantage of being in memory all the time. See `llama.cpp` documentation to learn what other parameters to pass to the process.
- `timeout`: This tells the server how much time in seconds to wait for the model to produce a response before it assumes that model did’t respond.

The first three keys tells the server how to start the Large Language Model process.
- `current_model`: Has the key of a model defined in the `models` key. This is the model to be executed with the server.
- `llama_bin`: Points to the `llama.cpp` binary relatively to the server path.
- `models_path`: Is the path are saved. This is relative to the server path.

## API Documentation
The API is simple, it send a JSON object as payload and receives a JSON object as response. You can include headers `Accept` and `Content-Type` in every request with a value `application/json` or you can omit them, the server will assume the value for both of them.

If you request has a different values for `Accept` or `Content-Type` then you will receive a status code `406 - Not Acceptable`.

Requesting an endpoint not available will produce a `404 - Not found` response. In case of trouble with the Large Language Model you receive a `503 - Server Unavailable` status code.

### Text Completion

Endpoint: `POST /completion`

Request Body:
The request body should contain a JSON object with the following key:

- `prompt`: The input text for which completion is requested.

Example request body:

```json
{
"prompt": "Who created Ruby language?"
}
```

Response:
The response will be a JSON object containing the completion generated by the LLM and the used model.

Example response body:

```json
{
"model": "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin",
"response": "The Ruby programming language was created by Yukihiro Matsumoto in the late 1990s. He wanted to create a simple, intuitive and dynamic language that could be used for various purposes such as web development, scripting and data analysis."
}
```

## Examples
Here"s an example using `curl` to make a completion request:

```bash
curl -X POST -H "Content-Type: application/json" -d "{'prompt':'Who created Ruby language?'}" http://localhost:9292/completion
```

The response will be:

Feel free to modify the request body and experiment with different input texts or to provide a more complex prompt for the model.

### The client.

There is a gem [llm_client](https://rubygems.org/gems/llm_client) that you can use to interact with the LLM Server.

Here is an example on how to use the gem.

```ruby
response = LlmClient.completion("Who is the creator of Ruby language?")

if result.success?
puts "Completions generated successfully"
response = result.success
puts "Status: #{response.status}"
puts "Body: #{response.body}"
puts "Headers: #{response.headers}"
calculated_response = response.body[:response]
puts "Calculated Response: #{calculated_response}"
else
puts "Failed to generate completions"
error = result.failure
puts "Error: #{error}"
end
```

## Contributing

Bug reports and pull requests are welcome on GitHub at [https://github.com/mariochavez/llm_server](https://github.com/mariochavez/llm_server).
This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/mariochavez/llm_server/blob/main/CODE_OF_CONDUCT.md).

## License

The gem is available as open source under the terms of the [MIT License](https://github.com/mariochavez/llm_server/blob/main/LICENSE.txt).

## Code of Conduct

Everyone interacting in the Llm Server project"s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/mariochavez/llm_server/blob/main/CODE_OF_CONDUCT.md).

## Conclusion
LLM Server provides a simple way to interact with the `llama.cpp` binary and leverage the power of your configured Language Model. You can integrate this server into your applications to facilitate text completion tasks.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mariochavez/llm_server

Awesome Lists containing this project

README