https://github.com/devrev/devrev-search-bench
Semantic search over DevRev knowledge base using OpenAI embeddings and FAISS
https://github.com/devrev/devrev-search-bench
Last synced: 13 days ago
JSON representation
Semantic search over DevRev knowledge base using OpenAI embeddings and FAISS
- Host: GitHub
- URL: https://github.com/devrev/devrev-search-bench
- Owner: devrev
- Created: 2026-02-10T19:19:34.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-03-11T11:35:26.000Z (19 days ago)
- Last Synced: 2026-03-11T17:56:21.169Z (19 days ago)
- Language: Jupyter Notebook
- Size: 66.4 KB
- Stars: 5
- Watchers: 0
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
# DevRev Search — Semantic Search over DevRev Knowledge Base
Semantic search system for the [DevRev Search](https://huggingface.co/datasets/devrev/search) dataset. Embeds ~65K knowledge base articles using either OpenAI `text-embedding-3-small` or Ollama `qwen3-embedding:0.6b`, indexes them with FAISS, and retrieves relevant documents for test queries.
## Quick Start
### 1. Clone & Install
```bash
git clone https://github.com//devrev-search.git
cd devrev-search
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
### 2. Choose Embedding Provider
The notebook supports two providers via `EMBEDDING_PROVIDER` in Section 5:
- `openai` (default)
- `ollama` (local open-source model)
#### Option A: OpenAI
```bash
export OPENAI_API_KEY="your-openai-api-key"
```
#### Option B: Ollama (local)
```bash
# Install Ollama first: https://ollama.com/download
ollama pull qwen3-embedding:0.6b
```
Then set `EMBEDDING_PROVIDER = "ollama"` in the notebook config cell.
### 3. Run the Notebook
Open `devrev_search.ipynb` in Jupyter and run cells sequentially:
```bash
jupyter notebook devrev_search.ipynb
```
## Project Structure
```
devrev-search/
├── devrev_search.ipynb # Main notebook: embed, index, search, evaluate
├── download_datasets.py # Standalone script to download datasets as parquet
├── requirements.txt # Python dependencies
├── test_queries_results.json # Search results for test queries
└── README.md
```
## What the Notebook Does
| Section | Description |
| ------- | ------------------------------------------------------------------------------------- |
| **1–4** | Load & explore the 3 dataset splits (annotated queries, test queries, knowledge base) |
| **5** | Generate embeddings (OpenAI or Ollama) and build a FAISS index |
| **6** | Interactive search — query the knowledge base |
| **7** | Run evaluation on all test queries and save results in annotated-queries format |
| **8** | Load a previously saved index (skip re-embedding) |
## Dataset
The [`devrev/search`](https://huggingface.co/datasets/devrev/search) dataset from Hugging Face contains:
- **`knowledge_base`** — ~65K article chunks from DevRev support docs
- **`annotated_queries`** — Queries paired with golden retrievals (train)
- **`test_queries`** — Held-out queries for evaluation
## Output Format
Results are saved in the same format as `annotated_queries`:
```json
{
"query_id": "a97f93d2-...",
"query": "end customer organization name not appearing...",
"retrievals": [
{
"id": "ART-1234_KNOWLEDGE_NODE-5",
"text": "...",
"title": "..."
}
]
}
```
## Cost Estimate
If using OpenAI (`text-embedding-3-small`), embedding ~65K documents costs approximately **$0.50–$1.00** (at $0.02 per 1M tokens).
If using Ollama (`qwen3-embedding:0.6b`), there is no API cost (runs locally).
## License
MIT