https://github.com/whittle/kosher
Kosher updates BDD for the AI era.
https://github.com/whittle/kosher
bdd behavior-driven-development inference llm mcp mcp-server slm test-automation testing testing-framework testing-tools user-stories
Last synced: 24 days ago
JSON representation
Kosher updates BDD for the AI era.
- Host: GitHub
- URL: https://github.com/whittle/kosher
- Owner: whittle
- Created: 2026-02-11T19:33:46.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-02-12T05:32:20.000Z (3 months ago)
- Last Synced: 2026-02-12T14:13:11.490Z (3 months ago)
- Topics: bdd, behavior-driven-development, inference, llm, mcp, mcp-server, slm, test-automation, testing, testing-framework, testing-tools, user-stories
- Language: Python
- Homepage:
- Size: 68.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Kosher
Kosher is a Behavior-Driven Development (BDD) tool that reads Gherkin feature
files and executes user stories against web applications using an AI inference
engine and Playwright browser automation.
## Requirements
Requires ollama and playwright-mcp running locally. Tested using Python v3.14.
## Getting Started
```bash
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull the model
ollama pull qwen2.5-coder:14b-instruct-q4_K_M
# 3. Start Playwright MCP server (separate terminal)
npx @playwright/mcp@latest
# 4. Clone and set up project
git clone
cd kosher
python -m venv venv
source venv/bin/activate
pip install -e .
# 5. Run proof of concept
python poc/main.py
```
## Reliability Summary: qwen2.5-coder:14b for Gherkin → Playwright
Success Rate: ~90% on a 6-step login flow (20 runs)
What Works Well:
- Correctly interprets Given/When/Then semantics
- Maps steps to appropriate browser tools (navigate, click, type, snapshot, wait_for)
- Extracts element refs from snapshots and uses them correctly (most of the time)
- Learns new patterns from system prompt updates (adopted browser_wait_for after instruction)
Failure Modes:
1. Placeholder refs - Sometimes outputs instead of actual ref like e5, suggesting it "knows" what to do but
doesn't execute properly
2. Skipped tool execution - Occasionally outputs JSON + "DONE" in one response without waiting for tool results
3. No native tool calling - Always outputs JSON in text content; requires parsing with parse_tool_call_from_text()
4. Instruction drift - Multi-step instructions (navigate → snapshot → confirm) sometimes get partially followed
Implications:
- Viable for PoC and demos
- Production use would need retry logic, validation, and possibly a more capable model
- System prompt engineering is effective for teaching new patterns
- The 10% failure rate is LLM variability, not fixable by prompt alone