Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nickcrews/llama-cpp-server-python

Bootstrap a server from llama-cpp in a few lines of python
https://github.com/nickcrews/llama-cpp-server-python

llamacpp llm python

Last synced: 6 days ago
JSON representation

Bootstrap a server from llama-cpp in a few lines of python

Host: GitHub
URL: https://github.com/nickcrews/llama-cpp-server-python
Owner: NickCrews
License: mit
Created: 2024-06-24T03:27:46.000Z (3 months ago)
Default Branch: main
Last Pushed: 2024-07-06T23:09:39.000Z (3 months ago)
Last Synced: 2024-09-22T19:02:05.631Z (11 days ago)
Topics: llamacpp, llm, python
Language: Python
Homepage:
Size: 26.4 KB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # llama-cpp-server-python

**Bootstrap a [server from llama-cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) in a few lines of python.**

```python

from openai import OpenAI

from llama_cpp_server_python import Server

repo = "Qwen/Qwen2-0.5B-Instruct-GGUF"

filename = "qwen2-0_5b-instruct-q4_0.gguf"

with Server.from_huggingface(repo=repo, filename=filename) as server:

    client = OpenAI(base_url=server.base_url)

    # interact with the client

```

For more control, you can download the model and binary separately,

and pass in other parameters:

```python

binary_path = "path/to/llama-server"

model_path = "path/to/model.gguf"

from llama_cpp_server_python import download_binary, download_model

download_binary(binary_path)

download_model(dest=model_path, repo=repo, filename=filename)

server = Server(binary_path=binary_path, model_path=model_path, port=6000, ctx_size=1024)

server.start()

client = OpenAI(base_url=server.base_url)

# interact with the client

server.stop() # or use a context manager as above

```

For detailed API, read the source code.

## Install

This only currently works on Linux and Mac. File an issue if you want a pointer on

what needs to happen to make Windows work.

For now, install directly from source:

`python -m pip install git+https://github.com/NickCrews/llama-cpp-server-python@00cc5ece8783848139d41fb7f9c5e5c9b7a62686`

I recommend using a static SHA for stability, but you could also do `@main` to be lazy.

## Motivation

I has a few requirements:

- use a local LLM (free)

- support batched inference (I was doing bulk processing, ie with pandas)

- support structured output (ie limit output to valid json)

I found https://github.com/abetlen/llama-cpp-python, but as of this writing,

[it did not support batched inference](https://github.com/abetlen/llama-cpp-python/issues/771),

and it didn't support structured output.

However, the [server from the upstream llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/server)

project supports all of these requirements:

See `--cont-batching` argument during server startup,

and `json_schema` param of the `/completion` endpoint.

So I wanted a quick and easy way to

- download and install the server binary

- download some model weights from huggingface hub

- get a server running and then use a "http://localhost:8080" url in a client.

This is NOT a client. You can either use an OpenAI library as above,

or send http POST requests directly.

## License

MIT