{"id":47308798,"url":"https://github.com/devrev/devrev-search-bench","last_synced_at":"2026-03-17T09:50:33.902Z","repository":{"id":343698088,"uuid":"1154784888","full_name":"devrev/devrev-search-bench","owner":"devrev","description":"Semantic search over DevRev knowledge base using OpenAI embeddings and FAISS","archived":false,"fork":false,"pushed_at":"2026-03-11T11:35:26.000Z","size":68,"stargazers_count":5,"open_issues_count":2,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-11T17:56:21.169Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devrev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-10T19:19:34.000Z","updated_at":"2026-03-11T14:58:18.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/devrev/devrev-search-bench","commit_stats":null,"previous_names":["devrev/devrev-search-bench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/devrev/devrev-search-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devrev%2Fdevrev-search-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devrev%2Fdevrev-search-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devrev%2Fdevrev-search-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devrev%2Fdevrev-search-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devrev","download_url":"https://codeload.github.com/devrev/devrev-search-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devrev%2Fdevrev-search-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30621070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-17T08:10:05.930Z","status":"ssl_error","status_checked_at":"2026-03-17T08:10:04.972Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-03-17T09:50:33.320Z","updated_at":"2026-03-17T09:50:33.894Z","avatar_url":"https://github.com/devrev.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DevRev Search — Semantic Search over DevRev Knowledge Base\n\nSemantic search system for the [DevRev Search](https://huggingface.co/datasets/devrev/search) dataset. Embeds ~65K knowledge base articles using either OpenAI `text-embedding-3-small` or Ollama `qwen3-embedding:0.6b`, indexes them with FAISS, and retrieves relevant documents for test queries.\n\n## Quick Start\n\n### 1. Clone \u0026 Install\n\n```bash\ngit clone https://github.com/\u003cyour-username\u003e/devrev-search.git\ncd devrev-search\npython -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\n```\n\n### 2. Choose Embedding Provider\n\nThe notebook supports two providers via `EMBEDDING_PROVIDER` in Section 5:\n- `openai` (default)\n- `ollama` (local open-source model)\n\n#### Option A: OpenAI\n\n```bash\nexport OPENAI_API_KEY=\"your-openai-api-key\"\n```\n\n#### Option B: Ollama (local)\n\n```bash\n# Install Ollama first: https://ollama.com/download\nollama pull qwen3-embedding:0.6b\n```\n\nThen set `EMBEDDING_PROVIDER = \"ollama\"` in the notebook config cell.\n\n### 3. Run the Notebook\n\nOpen `devrev_search.ipynb` in Jupyter and run cells sequentially:\n\n```bash\njupyter notebook devrev_search.ipynb\n```\n\n## Project Structure\n\n```\ndevrev-search/\n├── devrev_search.ipynb      # Main notebook: embed, index, search, evaluate\n├── download_datasets.py     # Standalone script to download datasets as parquet\n├── requirements.txt         # Python dependencies\n├── test_queries_results.json # Search results for test queries\n└── README.md\n```\n\n## What the Notebook Does\n\n| Section | Description                                                                           |\n| ------- | ------------------------------------------------------------------------------------- |\n| **1–4** | Load \u0026 explore the 3 dataset splits (annotated queries, test queries, knowledge base) |\n| **5**   | Generate embeddings (OpenAI or Ollama) and build a FAISS index                         |\n| **6**   | Interactive search — query the knowledge base                                         |\n| **7**   | Run evaluation on all test queries and save results in annotated-queries format       |\n| **8**   | Load a previously saved index (skip re-embedding)                                     |\n\n## Dataset\n\nThe [`devrev/search`](https://huggingface.co/datasets/devrev/search) dataset from Hugging Face contains:\n\n- **`knowledge_base`** — ~65K article chunks from DevRev support docs\n- **`annotated_queries`** — Queries paired with golden retrievals (train)\n- **`test_queries`** — Held-out queries for evaluation\n\n## Output Format\n\nResults are saved in the same format as `annotated_queries`:\n\n```json\n{\n  \"query_id\": \"a97f93d2-...\",\n  \"query\": \"end customer organization name not appearing...\",\n  \"retrievals\": [\n    {\n      \"id\": \"ART-1234_KNOWLEDGE_NODE-5\",\n      \"text\": \"...\",\n      \"title\": \"...\"\n    }\n  ]\n}\n```\n\n## Cost Estimate\n\nIf using OpenAI (`text-embedding-3-small`), embedding ~65K documents costs approximately **$0.50–$1.00** (at $0.02 per 1M tokens).  \nIf using Ollama (`qwen3-embedding:0.6b`), there is no API cost (runs locally).\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevrev%2Fdevrev-search-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevrev%2Fdevrev-search-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevrev%2Fdevrev-search-bench/lists"}