{"id":49760842,"url":"https://github.com/homemade-software-inc/completion-kit","last_synced_at":"2026-07-09T07:00:27.994Z","repository":{"id":351835614,"uuid":"1176488032","full_name":"homemade-software-inc/completion-kit","owner":"homemade-software-inc","description":"Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and compare runs to see what got better.","archived":false,"fork":false,"pushed_at":"2026-07-04T17:18:24.000Z","size":56583,"stargazers_count":1,"open_issues_count":9,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-07-04T19:06:14.920Z","etag":null,"topics":["anthropic","evaluation-framework","evaluation-metrics","llm","llm-as-judge","llm-eval","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics","llmops","mcp","ollama","openai","prompt-engineering","prompt-testing","rails","rails-engine","ruby","ruby-on-rails"],"latest_commit_sha":null,"homepage":"https://completionkit.com","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/homemade-software-inc.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-09T04:19:29.000Z","updated_at":"2026-07-04T17:18:26.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/homemade-software-inc/completion-kit","commit_stats":null,"previous_names":["homemade-software-inc/completion-kit"],"tags_count":93,"template":false,"template_full_name":null,"purl":"pkg:github/homemade-software-inc/completion-kit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homemade-software-inc%2Fcompletion-kit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homemade-software-inc%2Fcompletion-kit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homemade-software-inc%2Fcompletion-kit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homemade-software-inc%2Fcompletion-kit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/homemade-software-inc","download_url":"https://codeload.github.com/homemade-software-inc/completion-kit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/homemade-software-inc%2Fcompletion-kit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35288823,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-09T02:00:07.329Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anthropic","evaluation-framework","evaluation-metrics","llm","llm-as-judge","llm-eval","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics","llmops","mcp","ollama","openai","prompt-engineering","prompt-testing","rails","rails-engine","ruby","ruby-on-rails"],"created_at":"2026-05-11T06:16:54.410Z","updated_at":"2026-07-09T07:00:27.977Z","avatar_url":"https://github.com/homemade-software-inc.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/logo.png\" alt=\"CompletionKit\" width=\"360\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://badge.fury.io/rb/completion-kit\"\u003e\u003cimg src=\"https://badge.fury.io/rb/completion-kit.svg\" alt=\"Gem Version\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/homemade-software-inc/completion-kit/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/homemade-software-inc/completion-kit/actions/workflows/ci.yml/badge.svg\" alt=\"CI\" /\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/coverage-100%25-brightgreen\" alt=\"coverage\" /\u003e\n\u003c/p\u003e\n\nYour prompts need tests too.\n\nRun every prompt against real data. Score each output with an LLM judge against criteria you define. Change anything: the prompt, the model, the temperature, the dataset. Re-run and see exactly what got better and what broke. When the scores tell you something's off, CompletionKit suggests an improved prompt based on the judge's actual feedback on your runs. You inspect the diff, apply it as a new version, and verify the improvement.\n\nIt's the difference between \"this prompt seems to work\" and \"this prompt scores 4.3 out of 5 across 200 inputs, up from 3.8 last version.\"\n\n**[Start on completionkit.com →](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**\n\n\u003e **Just want to use it?** [CompletionKit Cloud](https://completionkit.com) is the same engine, fully hosted — zero install, no Rails ops, plans at [completionkit.com/pricing](https://completionkit.com/pricing).\n\n![The CompletionKit dashboard — workspace totals, run activity over the last 14 days, the worst-scoring metric, version-over-version score changes, and recent runs](https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/screenshots/dashboard.png)\n\n## Three ways to run it\n\nSame engine, same UI, same REST API and MCP server — pick the deployment that fits. The first two are stack-agnostic: you run CompletionKit as a product and talk to it over HTTP and MCP, whatever language your own app is written in. The third is for teams already building on Rails.\n\n### 1. Hosted — [completionkit.com](https://completionkit.com) (recommended)\n\nThe fastest path. Sign up and you're running on the same engine you'd self-host, without touching a Rails app. No `db:migrate`, no Puma, no Solid Queue, no provider key management — multi-tenant workspaces, your team logs in, you go. Plans at [completionkit.com/pricing](https://completionkit.com/pricing).\n\n### 2. Self-hosted — the bundled standalone app\n\nRun it on your own infra as a self-contained product. There's nothing to integrate and no Ruby to write — once it's up, you drive everything through the web UI, the REST API, and the MCP server, from whatever stack your own app is built in. It needs Postgres and a host that can run the app (Fly, Render, Heroku, Docker, …).\n\n```bash\ngit clone https://github.com/homemade-software-inc/completion-kit.git\ncd completion-kit/standalone\nbundle install\nbin/rails completion_kit:install:migrations\nbin/rails db:migrate\n```\n\nRun **both** a web server and a Solid Queue worker. In two terminals:\n\n```bash\nbin/rails server\n```\n\n```bash\nbin/jobs\n```\n\nOr with [foreman](https://github.com/ddollar/foreman) in one terminal: `foreman start -f Procfile.dev`.\n\nVisit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it. See [Deploying self-hosted](#deploying-self-hosted) for the production-env setup.\n\n### 3. Rails engine — mount into your existing Rails app\n\n```ruby\ngem \"completion-kit\"\n```\n\n```bash\nbin/rails generate completion_kit:install\nbin/rails db:migrate\n```\n\nThe engine mounts at `/completion_kit`. Generate / judge flows enqueue Active Job jobs (`CompletionKit::GenerateRowJob`, `CompletionKit::JudgeReviewJob`, `CompletionKit::RunCompletionCheckJob`), so your host app needs an Active Job adapter that actually processes them — Solid Queue, Sidekiq, GoodJob, etc. The `:async` adapter is **not** suitable for production: it runs jobs in the web Puma's thread pool with no durability and no retry, and a long LLM call will block request handling.\n\n**Host-app layout integration.** If your host app overrides the engine layout (e.g. `layout \"application\"` on engine controllers, or rendering engine views inside your own shell), include both the engine's stylesheet and JavaScript in that layout:\n\n```erb\n\u003c%= stylesheet_link_tag \"completion_kit/application\", media: \"all\" %\u003e\n\u003c%= javascript_include_tag \"completion_kit/application\", defer: true %\u003e\n```\n\nWithout the JavaScript include, in-page behaviours silently fail: live tag-breadcrumb updates, relative-time ticking, CSV row hover-expand, model-refresh progress, focus-first-error, and local-time formatting.\n\n## Providers\n\nCompletionKit discovers available models from each provider's API automatically.\n\n| Provider | Env vars | What it covers |\n|----------|----------|----------------|\n| **OpenAI** | `OPENAI_API_KEY` | GPT-5, GPT-4.1, GPT-4o, etc. |\n| **Anthropic** | `ANTHROPIC_API_KEY` | Claude Opus, Sonnet, Haiku |\n| **Ollama / local endpoint** | `OLLAMA_API_ENDPOINT` (default: `http://localhost:11434/v1`) | Any model you've `ollama pull`-ed, or any OpenAI-compatible local server (vLLM, LM Studio, llama.cpp) |\n| **OpenRouter** | `OPENROUTER_API_KEY` | 100+ models from 30+ providers through one API key |\n\nSet these as environment variables or configure them in the generated initializer. You can also add provider credentials through the web UI under Settings.\n\n### Encryption\n\nProvider API keys are encrypted at rest using [Active Record encryption](https://guides.rubyonrails.org/active_record_encryption.html). You need three encryption keys configured before the app will boot in production.\n\nGenerate them:\n\n```bash\nbin/rails db:encryption:init\n```\n\nThen set them as environment variables:\n\n```bash\nCOMPLETION_KIT_ENCRYPTION_PRIMARY_KEY=\u003cgenerated value\u003e\nCOMPLETION_KIT_ENCRYPTION_DETERMINISTIC_KEY=\u003cgenerated value\u003e\nCOMPLETION_KIT_ENCRYPTION_KEY_DERIVATION_SALT=\u003cgenerated value\u003e\n```\n\nOr add them to `config/credentials.yml.enc` under `active_record_encryption`. In development, the standalone app uses built-in fallback values so you can skip this step locally.\n\n## Authentication\n\nCompletionKit requires authentication in any deployed environment. In development and test, routes are open by default (with a log warning); every other environment returns 403 until auth is configured.\n\n### Basic Auth (recommended for simple setups)\n\n```ruby\nCompletionKit.configure do |c|\n  c.username = \"admin\"\n  c.password = ENV[\"COMPLETION_KIT_PASSWORD\"]\nend\n```\n\n### Custom Auth (Devise, etc.)\n\n```ruby\nCompletionKit.configure do |c|\n  c.auth_strategy = -\u003e(controller) { controller.authenticate_user! }\nend\n```\n\nOnly one mode can be active.\n\n## Rate limiting\n\nThe REST API, the MCP endpoint, and the web UI are rate limited per IP, per minute. The defaults are generous; tune them in the initializer:\n\n```ruby\nCompletionKit.configure do |c|\n  c.api_rate_limit = 120  # REST API + MCP, requests per minute (default 120)\n  c.web_rate_limit = 300  # web UI, requests per minute (default 300)\nend\n```\n\nLimiting uses `Rails.cache`. A shared cache store (Solid Cache, Redis) throttles accurately across multiple app instances; a per-process store still throttles each instance independently.\n\n## How it works\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/diagrams/workflow.png\" alt=\"CompletionKit workflow: a prompt and a dataset feed a run against a model, an LLM judge scores each output on your rubric, low scores drive an AI-suggested rewrite, and the new prompt version re-runs so you can compare\" width=\"820\" /\u003e\n\u003c/p\u003e\n\nIt's a loop. Each pass leaves you with a score you can compare against the last one.\n\n1. **Create a prompt** with `{{variable}}` placeholders\n2. **Upload a dataset.** A CSV where column headers match the variable names.\n3. **Run it** against a model and score outputs with an LLM judge against criteria you define.\n4. **Iterate.** Change the prompt, the model, the temperature, or the dataset and re-run. CompletionKit versions your prompts so you can always compare against previous results.\n5. **Get suggestions.** When scores drop, ask CompletionKit for an AI-generated improvement. The suggestion is based on the judge's actual per-response feedback, not generic prompt-engineering advice. Inspect the diff and apply it as a new version.\n\n## Concepts\n\n- **Prompt.** A versioned template with `{{variable}}` placeholders. Editing a prompt that's already been run creates a new version, so earlier results stay reproducible.\n- **Dataset.** A CSV of real inputs. Each row becomes one test case.\n- **Run.** One execution of a prompt against a dataset. Captures every input (model, temperature, metrics) and stores all outputs and scores.\n- **Response.** The model's output for one dataset row, with reviews attached.\n- **Metric.** An evaluation dimension with a name, instruction, evaluation steps, and a 1-5 star scoring scale. The LLM judge uses this to score each response.\n- **Metric Group.** A reusable group of metrics you can apply to a run as a set.\n- **Tag.** A domain label you can attach to prompts, runs, metrics, and datasets. Auto-assigned from a 10-color palette. Filter any index page by tag (`?tag[]=...`).\n- **Provider Credential.** An API key for a model provider. Encrypted at rest, never returned through the API.\n\n## REST API\n\nEvery resource is accessible via a bearer-token JSON API:\n\n```ruby\nCompletionKit.configure { |c| c.api_token = ENV[\"COMPLETION_KIT_API_TOKEN\"] }\n```\n\n```bash\ncurl -H \"Authorization: Bearer $TOKEN\" \\\n  http://localhost:3000/completion_kit/api/v1/prompts\n```\n\nVisit `/completion_kit/api_reference` in your running app for per-endpoint docs with copy-to-clipboard curl examples.\n\n## MCP server\n\nCompletionKit runs a [Model Context Protocol](https://modelcontextprotocol.io) server at `/completion_kit/mcp`, exposing every resource as tools that MCP-aware clients (Claude Code, Cursor, etc.) can drive directly:\n\n```json\n{\n  \"mcpServers\": {\n    \"completion-kit\": {\n      \"url\": \"https://your-app.com/completion_kit/mcp\",\n      \"headers\": { \"Authorization\": \"Bearer YOUR_TOKEN\" }\n    }\n  }\n}\n```\n\nThe in-app API reference page has install snippets you can copy straight into your MCP client config.\n\n## Deploying self-hosted\n\nAny Rails-friendly host works (Fly, Heroku, Render, Docker, etc.). Point it at a Postgres instance via `DATABASE_URL`, set your provider env vars, and run `cd standalone \u0026\u0026 bin/rails db:migrate` on each deploy.\n\n| Variable | Purpose | Default |\n|----------|---------|---------|\n| `COMPLETION_KIT_API_TOKEN` | Bearer token for REST API and MCP | (none, API disabled) |\n| `COMPLETION_KIT_USERNAME` | Web UI login username | `admin` |\n| `COMPLETION_KIT_PASSWORD` | Web UI login password | (none, open in dev) |\n\nYou also need the three `COMPLETION_KIT_ENCRYPTION_*` keys from the [Encryption](#encryption) section above.\n\nWhen the gem ships a new migration, install it locally and commit before pushing:\n\n```bash\ncd standalone\nbin/rails completion_kit:install:migrations\nbin/rails db:migrate\ngit add db/migrate/ \u0026\u0026 git commit -m \"install new engine migration\"\n```\n\n### Docker\n\nThe standalone app ships a `Dockerfile`, so you can self-host it without a Ruby toolchain on the host. Build with the **repository root** as the context — the app depends on the engine source alongside it:\n\n```bash\ndocker build -f standalone/Dockerfile -t completion-kit .\n```\n\nCompletionKit needs a Rails secret (`SECRET_KEY_BASE`) and three Active Record encryption keys. With Docker there's no Rails toolchain on the host to run `bin/rails db:encryption:init`, so generate them with `openssl`. Generate them **once** and keep them stable — if the encryption keys change, provider credentials already stored in the database can no longer be decrypted. Write everything to an env file:\n\n```bash\ncat \u003e completion-kit.env \u003c\u003cEOF\nDATABASE_URL=postgres://user:pass@host/completionkit\nSECRET_KEY_BASE=$(openssl rand -hex 64)\nCOMPLETION_KIT_ENCRYPTION_PRIMARY_KEY=$(openssl rand -hex 32)\nCOMPLETION_KIT_ENCRYPTION_DETERMINISTIC_KEY=$(openssl rand -hex 32)\nCOMPLETION_KIT_ENCRYPTION_KEY_DERIVATION_SALT=$(openssl rand -hex 32)\nEOF\n```\n\n`openssl rand` runs as the file is written, so each line gets a real random value. Keep `completion-kit.env` out of version control and back it up somewhere safe.\n\nRun the web process and a job worker from the same image, both pointed at that file:\n\n```bash\ndocker run -d -p 3000:3000 --env-file completion-kit.env completion-kit\ndocker run -d --env-file completion-kit.env completion-kit ./bin/jobs\n```\n\nBoth processes must share the same `SECRET_KEY_BASE` and encryption keys — the single env file guarantees that. The web container runs `db:prepare` on boot, so migrations apply on first start and on every deploy.\n\n## Multi-tenant host apps (advanced)\n\nFor hosts that mount CompletionKit in a multi-tenant app, two optional hooks scope engine records per tenant without forking the engine:\n\n```ruby\nCompletionKit.configure do |config|\n  config.tenant_scope = -\u003e {\n    org = Current.organization\u0026.id\n    org ? where(organization_id: org) : where(\"1=0\")\n  }\n  config.tenant_scope_columns = [:organization_id]\nend\n```\n\n`tenant_scope` runs as each engine model's `default_scope` (use `unscoped` to bypass). `tenant_scope_columns` is appended to every engine uniqueness validation. Adding the tenant columns and composite unique indexes lives in your host migrations. Both defaults (`nil`, `[]`) are no-ops.\n\nOne hook lets a host apply run-history retention everywhere the engine lists, counts, or traverses runs, without overriding controllers or views:\n\n```ruby\nCompletionKit.configure do |config|\n  config.runs_display_scope = -\u003e { where(created_at: 90.days.ago..) }\n  config.runs_display_footer_partial = \"runs/retention_notice\"\nend\n```\n\n`runs_display_scope` is a callable evaluated against a `Run` relation, in the same bare-`where` style as `tenant_scope` (it runs via `instance_exec`, so write it zero-arg with the relation as `self`, like a Rails `scope` lambda). It must return a relation: a callable that returns `nil` or anything non-chainable raises when a list renders. The engine applies it through `Run.display_scoped` at every run list and count it owns (the runs index, prompt and dataset show pages, the compare picker, new-run tag defaults, the v1 API index and its `X-Total-Count`, the MCP `runs_list` tool, the dashboard and API reference recent-runs panels, the dashboard \"Runs\" stat-card count, and provider-credential usage stats) and through `Run.visible_run_ids` for child records that traverse runs (the metric trust-panel sample and the agreement examples shown on a metric page). Use it for list-only retention rather than a global `default_scope`, which would null `Run` associations everywhere they are traversed.\n\nDeliberately exempt, because they must still see every run: id-addressed single-run lookups (`runs#show`, the MCP `runs_get` tool, the v1 API show), delete-confirmation cascade counts, the auto-generated run-name counter, the dashboard's stat-card gate (it keys off the unscoped run total so a long-tenured workspace whose runs have all aged past retention still sees its activity cards rather than a blank dashboard), and the judge few-shot seeding that learns from corrected examples even on hidden runs.\n\n`runs_display_footer_partial` names a partial rendered below the runs list on the dashboard, the index, and the prompt and dataset show pages; it receives the shown runs as a `runs` local. Use it for a notice like \"older runs are hidden, upgrade to see them\" — your host owns the retention rule in `runs_display_scope`, so it computes the hidden count itself. Both default to `nil` (no-ops), leaving standalone behaviour unchanged.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and pull request guidelines.\n\n## License\n\nCompletionKit 0.3.0 and later are licensed under the [Business Source License 1.1](LICENSE). You may use CompletionKit freely for any purpose, including production, except to offer it (or a derivative) to third parties as a hosted or managed service whose primary value is CompletionKit itself. Three years after each release, that version automatically re-licenses to GPL-3.\n\nCompletionKit 0.2.x and earlier remain available under the [MIT License](https://github.com/homemade-software-inc/completion-kit/blob/v0.2.0/MIT-LICENSE).\n\nFor alternative licensing, contact hello@homemade.software.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhomemade-software-inc%2Fcompletion-kit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhomemade-software-inc%2Fcompletion-kit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhomemade-software-inc%2Fcompletion-kit/lists"}