https://github.com/docusealco/rllama
Ruby FFI bindings for llama.cpp to run open-source LLMs such as GPT-OSS, Qwen 3, Gemma 3, and Llama 3 locally with Ruby.
https://github.com/docusealco/rllama
ai embeddings ffi gguf inference llamacpp llm ruby
Last synced: about 1 month ago
JSON representation
Ruby FFI bindings for llama.cpp to run open-source LLMs such as GPT-OSS, Qwen 3, Gemma 3, and Llama 3 locally with Ruby.
- Host: GitHub
- URL: https://github.com/docusealco/rllama
- Owner: docusealco
- Created: 2025-10-05T08:02:20.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2025-10-07T13:37:13.000Z (5 months ago)
- Last Synced: 2025-10-12T16:11:31.111Z (4 months ago)
- Topics: ai, embeddings, ffi, gguf, inference, llamacpp, llm, ruby
- Language: Ruby
- Homepage: https://www.docuseal.com/blog/run-open-source-llms-locally-with-ruby
- Size: 40 KB
- Stars: 66
- Watchers: 0
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# Rllama
Ruby bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) to run open-source language models locally. Run models like GPT-OSS, Qwen 3, Gemma 3, Llama 3, and many others directly in your Ruby application code.
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'rllama'
```
And then execute:
```bash
bundle install
```
Or install it yourself as:
```bash
gem install rllama
```
## CLI Chat
The `rllama` command-line utility provides an interactive chat interface for conversing with language models. After installing the gem, you can start chatting immediately:
```bash
rllama
```
When you run `rllama` without arguments, it will display:
- **Downloaded models**: Any models you've already downloaded to `~/.rllama/models/`
- **Popular models**: A curated list of popular models available for download, including:
- Gemma 3 1B
- Llama 3.2 3B
- Phi-4
- Qwen3 30B
- GPT-OSS
Simply enter the number of the model you want to use. If you select a model that hasn't been downloaded yet, it will be automatically downloaded from Hugging Face.
You can also specify a model path or URL directly:
```bash
rllama path/to/your/model.gguf
```
```bash
rllama https://huggingface.co/microsoft/phi-4-gguf/resolve/main/phi-4-Q3_K_S.gguf
```
Once the model has loaded, you can start chatting.
## Usage
### Text Generation
Generate text completions using local language models:
```ruby
require 'rllama'
# Load a model
model = Rllama.load_model('lmstudio-community/gemma-3-1B-it-QAT-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf')
# Generate text
result = model.generate('What is the capital of France?')
puts result.text
# => "The capital of France is Paris."
# Access generation statistics
puts "Tokens generated: #{result.stats[:tokens_generated]}"
puts "Tokens per second: #{result.stats[:tps]}"
puts "Duration: #{result.stats[:duration]} seconds"
# Don't forget to close the model when done
model.close
```
#### Generation parameters
Adjust the generation with parameters:
```ruby
result = model.generate(
'Write a short poem about Ruby programming',
max_tokens: 2024,
temperature: 0.8,
top_k: 40,
top_p: 0.95,
min_p: 0.05
)
```
#### Streaming generation
Stream generated text token-by-token:
```ruby
model.generate('Explain quantum computing') do |token|
print token
end
```
#### System prompt
Include system promt to guide model behavior:
```ruby
result = model.generate(
'What are best practices for Ruby development?',
system: 'You are an expert Ruby developer with 10 years of experience.'
)
```
#### Messages list
Pass multiple messages with roles for more complex interactions:
```ruby
result = model.generate([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' },
{ role: 'assistant', content: 'The capital of France is Paris.' },
{ role: 'user', content: 'What is its population?' }
])
puts result.text
```
### Chat
For ongoing conversations, use a context object that maintains the conversation history:
```ruby
# Initialize a chat context
context = model.init_context
# Send messages and maintain conversation history
response1 = context.message('What is the capital of France?')
puts response1.text
# => "The capital of France is Paris."
response2 = context.message('What is the population of that city?')
puts response2.text
# => "Paris has a population of approximately 2.1 million people..."
response3 = context.message('What was my first message?')
puts response3.text
# => "Your first message was asking about the capital of France."
# The context remembers all previous messages in the conversation
# Close context when done
context.close
```
### Embeddings
Generate vector embeddings for text using embedding models:
```ruby
require 'rllama'
# Load an embedding model
model = Rllama.load_model('lmstudio-community/embeddinggemma-300m-qat-GGUF/embeddinggemma-300m-qat-Q4_0.gguf')
# Generate embedding for a single text
embedding = model.embed('Hello, world!')
puts embedding.length
# => 724 (depending on your model)
# Generate embeddings for multiple sentences
embeddings = model.embed([
'roses are red',
'violets are blue',
'sugar is sweet'
])
puts embeddings.length
# => 3
puts embeddings[0].length
# => 768
model.close
```
#### Vector parameters
By default, embedding vectors are normalized. You can disable normalization with `normalize: false`:
```ruby
# Generate unnormalized embeddings
embedding = model.embed('Sample text', normalize: false)
```
## Finding Models
You can download GGUF format models from various sources:
- [Hugging Face](https://huggingface.co/models?library=gguf) - Search for models with "GGUF" format
## License
MIT
## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/docusealco/rllama.