Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kieranklaassen/leva
LLM Evaluation Framework for Rails apps to be used with production data.
https://github.com/kieranklaassen/leva
llm llm-evaluation ruby-on-rails
Last synced: 9 days ago
JSON representation
LLM Evaluation Framework for Rails apps to be used with production data.
- Host: GitHub
- URL: https://github.com/kieranklaassen/leva
- Owner: kieranklaassen
- License: mit
- Created: 2024-08-13T17:41:18.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-09-20T02:43:33.000Z (15 days ago)
- Last Synced: 2024-09-22T23:31:56.600Z (13 days ago)
- Topics: llm, llm-evaluation, ruby-on-rails
- Language: HTML
- Homepage:
- Size: 248 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: MIT-LICENSE
Awesome Lists containing this project
README
# Leva - Flexible Evaluation Framework for Language Models
Leva is a Ruby on Rails framework for evaluating Language Models (LLMs) using ActiveRecord datasets on production models. It provides a flexible structure for creating experiments, managing datasets, and implementing various evaluation logic on production data with security in mind.
![Leva - Workbench- Google Chrome](https://github.com/user-attachments/assets/ee487941-e11b-4c2a-983b-771ef27dd73c)
![Leva - rty- Google Chrome](https://github.com/user-attachments/assets/f9986a12-731b-4747-9f86-5ac6fffd5cbc)## Installation
Add this line to your application's Gemfile:
```ruby
gem 'leva'
```And then execute:
```bash
bundle install
```Add the migrations to your database:
```bash
rails leva:install:migrations
rails db:migrate
```## Usage
### 1. Setting up Datasets
First, create a dataset and add any ActiveRecord records you want to evaluate against. To make your models compatible with Leva, include the `Leva::Recordable` concern in your model:
````ruby
class TextContent < ApplicationRecord
include Leva::Recordable# @return [String] The ground truth label for the record
def ground_truth
expected_label
end# @return [Hash] A hash of attributes to be displayed in the dataset records index
def index_attributes
{
text: text,
expected_label: expected_label,
created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
}
end# @return [Hash] A hash of attributes to be displayed in the dataset record show view
def show_attributes
{
text: text,
expected_label: expected_label,
created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
}
end# @return [Hash] A hash of attributes to be displayed in the dataset record show view
def to_llm_context
{
text: text,
expected_label: expected_label,
created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
}
end
enddataset = Leva::Dataset.create(name: "Sentiment Analysis Dataset") dataset.add_record TextContent.create(text: "I love this product!", expected_label: "Positive") dataset.add_record TextContent.create(text: "Terrible experience", expected_label: "Negative") dataset.add_record TextContent.create(text: "It's ok", expected_label: "Neutral")
### 2. Implementing Runs
Create a run class to handle the execution of your inference logic:
```bash
rails generate leva:runner sentiment
```````ruby
class SentimentRun < Leva::BaseRun
def execute(record)
# Your model execution logic here
# This could involve calling an API, running a local model, etc.
# Return the model's output
end
end
```### 3. Implementing Evals
Create one or more eval classes to evaluate the model's output:
```bash
rails generate leva:eval sentiment_accuracy
``````ruby
class SentimentAccuracyEval < Leva::BaseEval
def evaluate(prediction, record)
score = prediction == record.expected_label ? 1.0 : 0.0
[score, record.expected_label]
end
endclass SentimentF1Eval < Leva::BaseEval
def evaluate(prediction, record)
# Calculate F1 score
# ...
[f1_score, record.f1_score]
end
end
```### 4. Running Experiments
You can run experiments with different runs and evals:
```ruby
experiment = Leva::Experiment.create!(name: "Sentiment Analysis", dataset: dataset)run = SentimentRun.new
evals = [SentimentAccuracyEval.new, SentimentF1Eval.new]Leva.run_evaluation(experiment: experiment, run: run, evals: evals)
```### 5. Using Prompts
You can also use prompts with your runs:
```ruby
prompt = Leva::Prompt.create!(
name: "Sentiment Analysis",
version: 1,
system_prompt: "You are an expert at analyzing text and returning the sentiment.",
user_prompt: "Please analyze the following text and return the sentiment as Positive, Negative, or Neutral.\n\n{{TEXT}}",
metadata: { model: "gpt-4", temperature: 0.5 }
)experiment = Leva::Experiment.create!(
name: "Sentiment Analysis with LLM",
dataset: dataset,
prompt: prompt
)run = SentimentRun.new
evals = [SentimentAccuracyEval.new, SentimentF1Eval.new]Leva.run_evaluation(experiment: experiment, run: run, evals: evals)
```### 6. Analyzing Results
After the experiments are complete, analyze the results:
```ruby
experiment.evaluation_results.group_by(&:evaluator_class).each do |evaluator_class, results|
average_score = results.average(&:score)
puts "#{evaluator_class.capitalize} Average Score: #{average_score}"
end
```## Configuration
Ensure you set up any required API keys or other configurations in your Rails credentials or environment variables.
## Leva's Components
### Classes
- `Leva`: Handles the process of running experiments.
- `Leva::BaseRun`: Base class for run implementations.
- `Leva::BaseEval`: Base class for eval implementations.### Models
- `Leva::Dataset`: Represents a collection of data to be evaluated.
- `Leva::DatasetRecord`: Represents individual records within a dataset.
- `Leva::Experiment`: Represents a single run of an evaluation on a dataset.
- `Leva::RunnerResult`: Stores the results of each run execution.
- `Leva::EvaluationResult`: Stores the results of each evaluation.
- `Leva::Prompt`: Represents a prompt for an LLM.## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/kieranklaassen/leva.
## License
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
## Roadmap
- [ ] Parallelize evaluation