https://github.com/lukehinds/fastllm
FastLLM - Rust based LLM Inference API
https://github.com/lukehinds/fastllm
inference llama llm mistral rust security speed
Last synced: about 1 year ago
JSON representation
FastLLM - Rust based LLM Inference API
- Host: GitHub
- URL: https://github.com/lukehinds/fastllm
- Owner: lukehinds
- License: apache-2.0
- Created: 2025-02-01T21:52:56.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-31T13:55:39.000Z (over 1 year ago)
- Last Synced: 2025-04-15T22:56:34.873Z (about 1 year ago)
- Topics: inference, llama, llm, mistral, rust, security, speed
- Language: Rust
- Homepage:
- Size: 151 KB
- Stars: 12
- Watchers: 1
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# FastLLM
A Rust inference server providing OpenAI-compatible APIs for local LLM deployment. Run language models directly from HuggingFace with native MacOS Metal, CUDA and CPU support.
## Experimental
This is a work in progress and the API is not yet stable!
## Key Features
- Native acceleration on Metal (Apple Silicon) and CUDA
- Direct model loading from HuggingFace Hub
- Run various architectures like Mistral, Qwen, TinyLlama
- Generate embeddings using models like all-MiniLM-L6-v2
## Design Principles
FastLLM adheres to the following core design principles:
**Simple and Modular**
- Clean, well-documented code structure
- Modular architecture for easy model integration
- Trait-based design for flexible model implementations
- Automatic architecture detection from model configs
**Zero Config**
- Sensible defaults for all features and optimizations
- Automatic hardware detection and optimization
- Smart fallbacks when optimal settings aren't available
**Easy to Extend**
- Clear separation of concerns
- Minimal boilerplate for adding new models
- Comprehensive test coverage and examples
- Detailed documentation for model integration
The goal is to make it as straightforward as possible to add new models while maintaining high performance by default.
## Supported Models
| Model Family | Supported Architectures | Example Models |
|------------|-----------------|-------------|
| **Llama** | LlamaForCausalLM | • TinyLlama-1.1B-Chat
• Any Llama2 derivative |
| **Mistral** | MistralForCausalLM | • Mistral-7B and derivatives
• Mixtral-8x7B |
| **Qwen** | Qwen2ForCausalLM
Qwen2_5_VLForConditionalGeneration | • Qwen2
• Qwen2.5 |
| **BERT** | BertModel
RobertaModel
DebertaModel | • all-MiniLM-L6-v2
• Any BERT/RoBERTa/DeBERTa model |
## Quick Start
### Prerequisites
- Rust toolchain ([install from rustup.rs](https://rustup.rs))
- HuggingFace token (for gated models)
### Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/fastllm.git
cd fastllm
# Optional: Set HuggingFace token for gated models
export HF_TOKEN="your_token_here"
# Build the project (MacOS Metal)
cargo build --release --features "metal"
# Build the project (Linux CUDA)
cargo build --release --features "cuda"
# Build the project (CPU)
cargo build --release
```
### Running the Server
```bash
# Start with default settings
./target/release/fastllm
# Or specify a model directly
./target/release/fastllm --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
## 🔧 Configuration
FastLLM can be configured through multiple methods (in order of precedence):
1. **Command Line Arguments**
```bash
./target/release/fastllm --model mistralai/Mistral-7B-v0.1
```
2. **Environment Variables**
```bash
export FASTLLM_SERVER__HOST=0.0.0.0
export FASTLLM_SERVER__PORT=8080
export FASTLLM_MODEL__MODEL_ID=your-model-id
```
3. **Configuration File** (`config.json`)
```json
{
"server": {
"host": "127.0.0.1",
"port": 3000
},
"model": {
"model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"revision": "main"
}
}
```
## 🔌 API Examples
### Chat Completion
```bash
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"stream": true
}'
```
### Text Embeddings
```bash
curl http://localhost:3000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": "The food was delicious and the service was excellent."
}'
```
## 🗺️ Roadmap
- [ ] Support for more architectures (DeepSeek, Phi, etc.)
- [ ] Comprehensive benchmarking suite
- [ ] Model management API (/v1/models)
- [ ] Improved caching and optimization
- [ ] Multi-GPU Inference
## Contributing
Contributions are welcome! Feel free to:
- Open issues for bugs or feature requests
- Submit pull requests
- Share benchmarks and performance reports
## License
Apache 2.0 - See [LICENSE](LICENSE) for details.
---