{"id":50724848,"url":"https://github.com/ayushpramanik/distributed-search-engine","last_synced_at":"2026-06-10T03:03:06.189Z","repository":{"id":360249679,"uuid":"1249275916","full_name":"AyushPramanik/Distributed-Search-Engine","owner":"AyushPramanik","description":"Production-style distributed search engine built with C++, Go, gRPC, Redis, and React featuring sharded inverted indexes, distributed query aggregation, BM25 ranking, caching, observability, and real-time infrastructure dashboards.","archived":false,"fork":false,"pushed_at":"2026-05-25T15:51:08.000Z","size":60,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-25T17:31:05.577Z","etag":null,"topics":["backend","distributed-systems","infrastructure","observability","scalability","search-engine"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AyushPramanik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-25T14:29:32.000Z","updated_at":"2026-05-25T15:51:12.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/AyushPramanik/Distributed-Search-Engine","commit_stats":null,"previous_names":["ayushpramanik/distributed-search-engine"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/AyushPramanik/Distributed-Search-Engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AyushPramanik%2FDistributed-Search-Engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AyushPramanik%2FDistributed-Search-Engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AyushPramanik%2FDistributed-Search-Engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AyushPramanik%2FDistributed-Search-Engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AyushPramanik","download_url":"https://codeload.github.com/AyushPramanik/Distributed-Search-Engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AyushPramanik%2FDistributed-Search-Engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34134634,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backend","distributed-systems","infrastructure","observability","scalability","search-engine"],"created_at":"2026-06-10T03:03:03.821Z","updated_at":"2026-06-10T03:03:06.179Z","avatar_url":"https://github.com/AyushPramanik.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Distributed Search Engine\n\nA production-style distributed full-text search engine built from scratch. Designed to demonstrate distributed systems engineering, high-performance indexing, and infrastructure observability.\n\n## Architecture\n\n```\n┌──────────────┐     REST/JSON      ┌──────────────────┐     REST/JSON      ┌─────────────────────────────────┐\n│  Next.js UI  │ ──────────────────▶│   API Gateway    │ ──────────────────▶│     Query Coordinator (Go)      │\n│  (React/TS)  │                    │      (Go)        │                    │  fan-out · merge · rank · route  │\n└──────────────┘                    │  Redis cache     │                    └──────────────┬──────────────────┘\n                                    │  Prometheus      │                                   │\n                                    └──────────────────┘                    ┌──────────────┼──────────────┐\n                                                                            │              │              │\n                                                                   ┌────────▼────┐ ┌───────▼────┐ ┌──────▼─────┐\n                                                                   │  Shard-1   │ │  Shard-2  │ │  Shard-3  │\n                                                                   │   (C++17)  │ │  (C++17)  │ │  (C++17)  │\n                                                                   │ BM25/TF-IDF│ │BM25/TF-IDF│ │BM25/TF-IDF│\n                                                                   └────────────┘ └───────────┘ └────────────┘\n```\n\n## Features\n\n**Core search infrastructure**\n- Inverted index with Porter stemmer, stopword filtering, and positional indexing\n- BM25 and TF-IDF ranking algorithms with configurable parameters\n- Snippet extraction with query-term highlighting context\n- Thread-safe concurrent indexing and search (`std::shared_mutex`)\n\n**Distributed architecture**\n- 3-shard cluster with FNV-32a consistent hash routing for document placement\n- Parallel fan-out search: all shards queried simultaneously per request\n- Global result merging and re-ranking across shards\n- Graceful shard failure handling — partial results returned on timeout\n- Background health monitor with 15-second polling interval\n\n**Caching**\n- Redis query cache with 5-minute TTL and LRU eviction\n- Cache invalidation on document writes\n- Per-request cache hit/miss tracking\n\n**Observability**\n- Prometheus metrics at every service layer (p50/p95/p99 latency, QPS, error rate)\n- Grafana dashboards provisioned automatically\n- Structured JSON request logging\n- Per-shard document count and health metrics\n\n**Frontend**\n- Dark-mode search UI with real-time latency display\n- Cluster topology visualization\n- Live metrics dashboard with streaming charts (Recharts)\n- Relevance score and shard attribution per result\n\n## Quick Start\n\n**Requirements:** Docker, Docker Compose v2\n\n```bash\ngit clone https://github.com/ayushpramanik/distributed-search-engine\ncd distributed-search-engine\ndocker compose up --build -d\n```\n\n| Service     | URL                          |\n|-------------|------------------------------|\n| Frontend    | http://localhost:3000        |\n| API Gateway | http://localhost:3001        |\n| Grafana     | http://localhost:3100 (admin/admin) |\n| Prometheus  | http://localhost:9091        |\n| Shard-1     | http://localhost:18081       |\n\n**Seed with sample data:**\n\n```bash\nbash scripts/seed-data.sh\n```\n\n**Test the search API:**\n\n```bash\n# Search\ncurl \"http://localhost:3001/api/search?q=distributed+systems\u0026algorithm=bm25\"\n\n# Index a document\ncurl -X POST http://localhost:3001/api/documents \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"id\":\"doc-1\",\"title\":\"Test\",\"content\":\"Distributed search engines use inverted indexes for fast full-text retrieval.\"}'\n\n# Cluster health\ncurl http://localhost:3001/api/health\n```\n\n## Project Structure\n\n```\n├── shard-node/           C++17 search engine core\n│   └── src/\n│       ├── index/        Inverted index, tokenizer, BM25/TF-IDF scorer\n│       ├── server/       HTTP server (cpp-httplib)\n│       └── metrics/      Prometheus text format exporter\n├── query-coordinator/    Go — distributed query orchestration\n│   └── internal/\n│       ├── coordinator/  Fan-out, health monitor, consistent hashing\n│       ├── shard/        Per-shard HTTP client\n│       └── aggregator/   Result merging and re-ranking\n├── api-gateway/          Go — public REST API with Redis caching\n├── shared/               Go — shared type definitions\n├── frontend/             Next.js 14 + TypeScript + Tailwind\n├── monitoring/           Prometheus config + Grafana dashboards\n├── load-testing/         k6 load test scripts\n├── scripts/              Setup and data seeding\n├── docs/                 Architecture, API reference\n└── shared-proto/         Protobuf service contracts\n```\n\n## Load Testing\n\n```bash\n# Install k6: https://k6.io/docs/getting-started/installation/\n\n# Search load (ramp to 500 concurrent users)\nk6 run load-testing/search-load.js\n\n# Indexing stress test\nk6 run load-testing/index-load.js\n```\n\nTarget thresholds: p95 search latency \u003c 250ms, error rate \u003c 1%.\n\n## Local Development\n\n```bash\n# C++ shard (requires cmake + g++ or clang++)\ncd shard-node \u0026\u0026 cmake -B build \u0026\u0026 cmake --build build\n./shard-node/build/shard_node\n\n# Go coordinator\ncd query-coordinator \u0026\u0026 go run ./cmd/coordinator\n\n# Go API gateway\ncd api-gateway \u0026\u0026 go run ./cmd/gateway\n\n# Frontend\ncd frontend \u0026\u0026 npm install \u0026\u0026 npm run dev\n```\n\n## Design Decisions\n\n**Why C++ for shard nodes?** Index operations are CPU-bound (tokenization, scoring over large posting lists). C++ provides deterministic memory layout, zero-overhead abstractions, and direct control over concurrency primitives.\n\n**Why HTTP/JSON for internal RPC?** Simplifies local development and debugging without a code generation step. A production deployment would add gRPC (proto file included in `shared-proto/`) for reduced serialization overhead.\n\n**Why FNV hashing vs consistent hashing ring?** FNV mod N is simpler and sufficient for a static cluster. For elastic scaling, replace with a virtual node ring (each physical node gets 150 virtual positions) to minimize reshuffling on membership change.\n\n**Why coarse-grained cache invalidation?** Full cache flush on any write is simple and correct. A production system would track per-query affected terms and invalidate selectively, or use a shorter TTL for higher write workloads.\n\n## Tech Stack\n\n| Layer | Technology |\n|-------|-----------|\n| Search core | C++17, cpp-httplib, nlohmann/json |\n| Coordinator | Go 1.21, chi router |\n| API Gateway | Go 1.21, chi router, go-redis/v9 |\n| Frontend | Next.js 14, TypeScript, Tailwind CSS, Recharts |\n| Cache | Redis 7 |\n| Metrics | Prometheus, Grafana |\n| Load testing | k6 |\n| Containers | Docker, Docker Compose |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fayushpramanik%2Fdistributed-search-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fayushpramanik%2Fdistributed-search-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fayushpramanik%2Fdistributed-search-engine/lists"}