{"id":31832118,"url":"https://github.com/codestrate/semantic_search_engine","last_synced_at":"2025-10-11T22:29:50.059Z","repository":{"id":315905915,"uuid":"1060911984","full_name":"CodeStrate/semantic_search_engine","owner":"CodeStrate","description":null,"archived":false,"fork":false,"pushed_at":"2025-09-29T18:47:01.000Z","size":52027,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-29T20:45:28.750Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CodeStrate.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-20T21:05:40.000Z","updated_at":"2025-09-29T18:47:05.000Z","dependencies_parsed_at":"2025-09-21T15:26:16.499Z","dependency_job_id":"cae5c99e-2776-440f-8339-ec599b9a297f","html_url":"https://github.com/CodeStrate/semantic_search_engine","commit_stats":null,"previous_names":["codestrate/semantic_search_engine"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CodeStrate/semantic_search_engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodeStrate%2Fsemantic_search_engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodeStrate%2Fsemantic_search_engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodeStrate%2Fsemantic_search_engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodeStrate%2Fsemantic_search_engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CodeStrate","download_url":"https://codeload.github.com/CodeStrate/semantic_search_engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodeStrate%2Fsemantic_search_engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279009070,"owners_count":26084549,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-11T22:29:46.021Z","updated_at":"2025-10-11T22:29:50.053Z","avatar_url":"https://github.com/CodeStrate.png","language":"Python","readme":"# Semantic Search Engine\n\n## NOTE: Sample chunked SQLite DB and Chroma DB are added so API can be directly tested.\n\nA semantic search engine for machinery safety documents that combines vector similarity search with BM25 lexical search for improved retrieval accuracy.\n\n## Project Structure\n\n```\nsemantic_search_engine/\n├── api/\n│   ├── __init__.py\n│   ├── main.py              # FastAPI application\n│   └── qna_model.py         # Pydantic models for API\n├── chunk_db/\n│   ├── __init__.py\n│   ├── chunk_data.py        # PDF processing and text chunking\n│   ├── ingest_chunks.py     # Chunk storage and vector embedding\n│   └── ingest_into_db.py    # Database ingestion pipeline\n├── sourced_data/\n│   └── *.pdf               # Source PDF documents\n├── hybrid_reranker/\n│   ├── __init__.py\n│   └── bm25_reranker.py    # BM25 + vector hybrid search\n├── utils/\n│   ├── __init__.py\n│   ├── common_utils.py          # Commonly used utilities in this app\n│   ├── download_source_data.py  # Data download script\n│   ├── normalize_scores.py      # Score normalization utilities\n│   └── retrieval_utils.py       # Answer formatting and citations\n├── vector_db/\n│   ├── __init__.py\n│   └── baseline_search.py   # Vector similarity search\n├── sources.json             # Data source configuration\n└── README.md\n```\n\n## Installation\n\n1. **Clone the repository**:\n```bash\ngit clone https://github.com/CodeStrate/semantic_search_engine.git\ncd semantic_search_engine\n```\n\n2. **Install dependencies**:\n```bash\npip install -r requirements.txt\n```\n\n## Setup and Usage\n\n### 1. Download Data\n```bash\npython utils/download_source_data.py\n```\nThis downloads PDF documents specified in `sources.json` to the `data/` directory.\n\n### 2. Chunk and Ingest Data\n```bash\npython -m chunk_db.ingest_into_db\n```\nThis processes PDFs, extracts text with OCR cleaning, and creates text chunks with metadata.\n\n### 3. Embed and Store in ChromaDB\n```bash\npython -m chunk_db.ingest_chunks\n```\nThis generates embeddings using `all-MiniLM-L6-v2` (it also downloads it if not available for ChromaDB) and stores them in ChromaDB for vector search.\n\n### 4. Start the API Server\n```bash\npython api/main.py\n```\nThe FastAPI server will start on `http://localhost:8000`\n\n## API Endpoints\n\n### POST `/ask`\n\nSubmit a query and get an answer with citations.\n\n**Request Body**:\n```json\n{\n  \"query\": \"What is OSHA?\",\n  \"k\": 5,\n  \"mode\": \"baseline\"\n}\n```\n\n**Parameters**:\n- `query` (string, required): User's question\n- `k` (integer, optional): Number of chunks to retrieve (default: 5)\n- `mode` (string, optional): Search mode - `\"baseline\"` or `\"hybrid-bm25\"` (default: `\"baseline\"`)\n\n**Response**:\n```json\n{\n  \"answer\": \"OSHA is the Occupational Safety and Health Administration...\",\n  \"contexts\": [\n    [\n      {\n        \"chunk_id\": \"123\",\n        \"src_id\": \"src01\",\n        \"title\": \"OSHA Guidelines\",\n        \"url\": \"https://example.com/osha.pdf\",\n        \"score\": 0.95\n      }\n    ],\n    [0.95, 0.87, 0.82]\n  ],\n  \"mode\": \"baseline\"\n}\n```\n\n## Curl Requests\n\n### Basic Query\n```bash\ncurl -X POST \"http://localhost:8000/ask\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"query\": \"What is OSHA?\",\n    \"k\": 5,\n    \"mode\": \"baseline\"\n  }'\n```\n\n### Hybrid Search Query (tricky)\n```bash\ncurl -X POST \"http://localhost:8000/ask\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"query\": \"machinery safety regulations and compliance requirements\",\n    \"k\": 10,\n    \"mode\": \"hybrid-bm25\"\n  }'\n```\n\n### Off-topic Query (to test abstinence)\n```bash\ncurl -X POST \"http://localhost:8000/ask\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"query\": \"hello how are you today?\",\n    \"k\": 5,\n    \"mode\": \"baseline\"\n  }'\n```\n\n## Testing\n\nTest individual components or `debug/` snippets:\n\n```bash\n# Test baseline search\npython -m vector_db.baseline_search\n\n# Test hybrid reranking\npython -m hybrid_reranker.bm25_reranker\n```\nEach component is independently testable and can be run as a module using Python's `-m` flag.\n\n# Learnings\nWhile it seems easy (just chunk, embed and retrieve) it's not, a lot of time was spent on tuning the chunking functionality, especially when the **PDFs** are **OCR based** so the PDF text extraction is never perfect and given the limitations I had to spend a lot of test_chunking iterations just to find a good chunk spot. It is the same reason I chose to pursue `Langchain`'s Chunking methodology with separators and overlaps to make sure my chunks are context aware on their own as we can't use a generative model (`generation`) here. While it works well, there are still some **tricky** queries that can stump my retriever. **Chunking** needs more testing and time. I saw how some harder queries didn't even return an answer due to my `abstain` filter. Which I had to increase.\n\nIf I could use generation then `retrieval_utils` and **Regex based OCR cleaning is not required** (at all). This project taught me the power of generative LLMs and how they can seemingly do anything that's told to. Given use-case doesn't require a **Paid** API though, I am sure a `1B` Ollama or HuggingFace model would suffice. I also finally learned how to add citations in retrieval through `metadata` (from Chroma Docs). While I saw `rank-bm25` was the most used bm25 implementation, I saw `bm25s` in a HuggingFace Blog and was fascinated by its lightweighted-ness. To further make the project lightweight I chose not to use many NLP based libraries which might have affected my result quality. Using `Chroma` and `PyMuPDF` were my personal choices given my previous experience using them. We could have definitely reduced some weight with `FAISS` but then `SentenceTransformers` is required which has a dependency on `torch` and other heavy dependencies. It didn't seem worth it.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodestrate%2Fsemantic_search_engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodestrate%2Fsemantic_search_engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodestrate%2Fsemantic_search_engine/lists"}