An open API service indexing awesome lists of open source software.

https://github.com/lukehinds/fastllm

FastLLM - Rust based LLM Inference API
https://github.com/lukehinds/fastllm

inference llama llm mistral rust security speed

Last synced: about 1 year ago
JSON representation

FastLLM - Rust based LLM Inference API

Awesome Lists containing this project

README

          

# FastLLM

A Rust inference server providing OpenAI-compatible APIs for local LLM deployment. Run language models directly from HuggingFace with native MacOS Metal, CUDA and CPU support.

## Experimental

This is a work in progress and the API is not yet stable!

## Key Features
- Native acceleration on Metal (Apple Silicon) and CUDA
- Direct model loading from HuggingFace Hub
- Run various architectures like Mistral, Qwen, TinyLlama
- Generate embeddings using models like all-MiniLM-L6-v2

## Design Principles

FastLLM adheres to the following core design principles:

**Simple and Modular**
- Clean, well-documented code structure
- Modular architecture for easy model integration
- Trait-based design for flexible model implementations
- Automatic architecture detection from model configs

**Zero Config**
- Sensible defaults for all features and optimizations
- Automatic hardware detection and optimization
- Smart fallbacks when optimal settings aren't available

**Easy to Extend**
- Clear separation of concerns
- Minimal boilerplate for adding new models
- Comprehensive test coverage and examples
- Detailed documentation for model integration

The goal is to make it as straightforward as possible to add new models while maintaining high performance by default.

## Supported Models

| Model Family | Supported Architectures | Example Models |
|------------|-----------------|-------------|
| **Llama** | LlamaForCausalLM | • TinyLlama-1.1B-Chat
• Any Llama2 derivative |
| **Mistral** | MistralForCausalLM | • Mistral-7B and derivatives
• Mixtral-8x7B |
| **Qwen** | Qwen2ForCausalLM
Qwen2_5_VLForConditionalGeneration | • Qwen2
• Qwen2.5 |
| **BERT** | BertModel
RobertaModel
DebertaModel | • all-MiniLM-L6-v2
• Any BERT/RoBERTa/DeBERTa model |

## Quick Start

### Prerequisites

- Rust toolchain ([install from rustup.rs](https://rustup.rs))
- HuggingFace token (for gated models)

### Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/fastllm.git
cd fastllm

# Optional: Set HuggingFace token for gated models
export HF_TOKEN="your_token_here"

# Build the project (MacOS Metal)
cargo build --release --features "metal"

# Build the project (Linux CUDA)
cargo build --release --features "cuda"

# Build the project (CPU)
cargo build --release
```

### Running the Server

```bash
# Start with default settings
./target/release/fastllm

# Or specify a model directly
./target/release/fastllm --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

## 🔧 Configuration

FastLLM can be configured through multiple methods (in order of precedence):

1. **Command Line Arguments**
```bash
./target/release/fastllm --model mistralai/Mistral-7B-v0.1
```

2. **Environment Variables**
```bash
export FASTLLM_SERVER__HOST=0.0.0.0
export FASTLLM_SERVER__PORT=8080
export FASTLLM_MODEL__MODEL_ID=your-model-id
```

3. **Configuration File** (`config.json`)
```json
{
"server": {
"host": "127.0.0.1",
"port": 3000
},
"model": {
"model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"revision": "main"
}
}
```

## 🔌 API Examples

### Chat Completion

```bash
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"stream": true
}'
```

### Text Embeddings

```bash
curl http://localhost:3000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": "The food was delicious and the service was excellent."
}'
```

## 🗺️ Roadmap

- [ ] Support for more architectures (DeepSeek, Phi, etc.)
- [ ] Comprehensive benchmarking suite
- [ ] Model management API (/v1/models)
- [ ] Improved caching and optimization
- [ ] Multi-GPU Inference

## Contributing

Contributions are welcome! Feel free to:
- Open issues for bugs or feature requests
- Submit pull requests
- Share benchmarks and performance reports

## License

Apache 2.0 - See [LICENSE](LICENSE) for details.

---