https://github.com/toogle/mlx-dev-server
A server to run MLX models locally, optimized for code completion
https://github.com/toogle/mlx-dev-server
Last synced: 4 months ago
JSON representation
A server to run MLX models locally, optimized for code completion
- Host: GitHub
- URL: https://github.com/toogle/mlx-dev-server
- Owner: toogle
- License: mit
- Created: 2025-02-23T14:21:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-10-31T20:41:03.000Z (7 months ago)
- Last Synced: 2025-10-31T20:47:43.067Z (7 months ago)
- Language: Python
- Size: 72.3 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MLX Dev Server
[Installation](#installation) | [Usage](#usage) | [Examples](#examples)
A simple solution to run LLMs locally on Macs with Apple Silicon. Optimized for code completion tasks with DeepSeek, Qwen and other models.

## Features
- 🚀 **Fast**: uses [Apple MLX](https://github.com/ml-explore/mlx) to run models on GPU using unified memory
- 💪 **Efficient**: cancels generation when client disconnects (see [Motivation](#motivation) on why it is important for code completion)
- 🧩 **Compatible**: provides OpenAI-like API to easily integrate with existing applications (see [Examples](#examples))
- 💾 **Memory Efficient**: unloads models when they are not used
- 🔗 **Reliable**: test coverage is 97%
## Motivation
While [Ollama](https://github.com/ollama/ollama) is effective for many tasks, it can be less responsive for code completion due to its handling of prompt processing.
Code completion requires quick processing of large inputs (1k+ tokens) and short output generation (<100 tokens typically). And most completions are cancelled because developers often pause for a moment and continue typing, discarding the completion. Ollama processes the entire prompt before cancellation, leading to potential delays.
MLX Dev Server addresses this by cancelling both prompt processing and generation when the client disconnects, ensuring consistent and responsive code completion.
## Installation
```bash
pip install mlx-dev-server
```
## Usage
Simply run `mlx_dev_server`.
Available command line arguments:
- `-p, --port`: Port to listen on (default is `8080`)
- `-k, --keep-alive`: Time in seconds to keep models loaded in memory (default is `300`)
- `-m, --max-loaded-models`: Maximum number of models to keep loaded (default is `2`)
- `--host`: Host to listen on (default is `localhost`)
- `--max-tokens`: Maximum tokens to generate if not specified (default is `4096`)
- `--max-kv-size`: Maximum size of the key-value cache (default is `4096`)
- `--prefill-step-size`: Step size for prompt processing (default is `128`)
## Examples
### VSCode
Install [llm-vscode](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode) extension. Then add the following to `settings.json`:
```json
{
"llm.backend": "openai",
"llm.url": "http://localhost:8080",
"llm.modelId": "mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx",
"llm.configTemplate": "Custom",
"llm.requestBody": {
"parameters": {
"temperature": 0.2,
"top_p": 0.95,
"max_tokens": 60
}
},
"llm.fillInTheMiddle.prefix": "<|fim▁begin|>",
"llm.fillInTheMiddle.middle": "<|fim▁end|>",
"llm.fillInTheMiddle.suffix": "<|fim▁hole|>",
"llm.tokenizer": {
"repository": "mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx"
},
"llm.contextWindow": 1024
}
```
> [!NOTE]
> This configuration limits the number of generated tokens to 60.
> This is to speed up the response of the model if it decides to generate a multi-line code snippet.
### Neovim
Add the following spec to [lazy.nvim](https://github.com/folke/lazy.nvim) configuration to enable [llm.nvim](https://github.com/huggingface/llm.nvim) plugin:
```lua
{
'huggingface/llm.nvim',
opts = {
backend = 'openai',
url = 'http://localhost:8080',
model = 'mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx',
request_body = {
temperature = 0.2,
top_p = 0.95,
max_tokens = 60
},
fim = {
prefix = '<|fim▁begin|>',
middle = '<|fim▁end|>',
suffix = '<|fim▁hole|>'
},
tokenizer = {
repository = 'mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx'
},
context_window = 1024
}
}
```
### OpenAI Python API library
```python
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:8080/v1',
api_key='mlx-dev-server', # not needed but required
)
response = client.chat.completions.create(
model='mlx-community/Mistral-Nemo-Instruct-2407-8bit',
messages=[{
'role': 'user',
'content': 'say hello',
}],
)
print(response.choices[0].message.content)
```