https://github.com/ayushpramanik/distributed-search-engine
Production-style distributed search engine built with C++, Go, gRPC, Redis, and React featuring sharded inverted indexes, distributed query aggregation, BM25 ranking, caching, observability, and real-time infrastructure dashboards.
https://github.com/ayushpramanik/distributed-search-engine
backend distributed-systems infrastructure observability scalability search-engine
Last synced: 20 days ago
JSON representation
Production-style distributed search engine built with C++, Go, gRPC, Redis, and React featuring sharded inverted indexes, distributed query aggregation, BM25 ranking, caching, observability, and real-time infrastructure dashboards.
- Host: GitHub
- URL: https://github.com/ayushpramanik/distributed-search-engine
- Owner: AyushPramanik
- License: mit
- Created: 2026-05-25T14:29:32.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-25T15:51:08.000Z (about 1 month ago)
- Last Synced: 2026-05-25T17:31:05.577Z (about 1 month ago)
- Topics: backend, distributed-systems, infrastructure, observability, scalability, search-engine
- Language: C++
- Homepage:
- Size: 58.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Distributed Search Engine
A production-style distributed full-text search engine built from scratch. Designed to demonstrate distributed systems engineering, high-performance indexing, and infrastructure observability.
## Architecture
```
┌──────────────┐ REST/JSON ┌──────────────────┐ REST/JSON ┌─────────────────────────────────┐
│ Next.js UI │ ──────────────────▶│ API Gateway │ ──────────────────▶│ Query Coordinator (Go) │
│ (React/TS) │ │ (Go) │ │ fan-out · merge · rank · route │
└──────────────┘ │ Redis cache │ └──────────────┬──────────────────┘
│ Prometheus │ │
└──────────────────┘ ┌──────────────┼──────────────┐
│ │ │
┌────────▼────┐ ┌───────▼────┐ ┌──────▼─────┐
│ Shard-1 │ │ Shard-2 │ │ Shard-3 │
│ (C++17) │ │ (C++17) │ │ (C++17) │
│ BM25/TF-IDF│ │BM25/TF-IDF│ │BM25/TF-IDF│
└────────────┘ └───────────┘ └────────────┘
```
## Features
**Core search infrastructure**
- Inverted index with Porter stemmer, stopword filtering, and positional indexing
- BM25 and TF-IDF ranking algorithms with configurable parameters
- Snippet extraction with query-term highlighting context
- Thread-safe concurrent indexing and search (`std::shared_mutex`)
**Distributed architecture**
- 3-shard cluster with FNV-32a consistent hash routing for document placement
- Parallel fan-out search: all shards queried simultaneously per request
- Global result merging and re-ranking across shards
- Graceful shard failure handling — partial results returned on timeout
- Background health monitor with 15-second polling interval
**Caching**
- Redis query cache with 5-minute TTL and LRU eviction
- Cache invalidation on document writes
- Per-request cache hit/miss tracking
**Observability**
- Prometheus metrics at every service layer (p50/p95/p99 latency, QPS, error rate)
- Grafana dashboards provisioned automatically
- Structured JSON request logging
- Per-shard document count and health metrics
**Frontend**
- Dark-mode search UI with real-time latency display
- Cluster topology visualization
- Live metrics dashboard with streaming charts (Recharts)
- Relevance score and shard attribution per result
## Quick Start
**Requirements:** Docker, Docker Compose v2
```bash
git clone https://github.com/ayushpramanik/distributed-search-engine
cd distributed-search-engine
docker compose up --build -d
```
| Service | URL |
|-------------|------------------------------|
| Frontend | http://localhost:3000 |
| API Gateway | http://localhost:3001 |
| Grafana | http://localhost:3100 (admin/admin) |
| Prometheus | http://localhost:9091 |
| Shard-1 | http://localhost:18081 |
**Seed with sample data:**
```bash
bash scripts/seed-data.sh
```
**Test the search API:**
```bash
# Search
curl "http://localhost:3001/api/search?q=distributed+systems&algorithm=bm25"
# Index a document
curl -X POST http://localhost:3001/api/documents \
-H "Content-Type: application/json" \
-d '{"id":"doc-1","title":"Test","content":"Distributed search engines use inverted indexes for fast full-text retrieval."}'
# Cluster health
curl http://localhost:3001/api/health
```
## Project Structure
```
├── shard-node/ C++17 search engine core
│ └── src/
│ ├── index/ Inverted index, tokenizer, BM25/TF-IDF scorer
│ ├── server/ HTTP server (cpp-httplib)
│ └── metrics/ Prometheus text format exporter
├── query-coordinator/ Go — distributed query orchestration
│ └── internal/
│ ├── coordinator/ Fan-out, health monitor, consistent hashing
│ ├── shard/ Per-shard HTTP client
│ └── aggregator/ Result merging and re-ranking
├── api-gateway/ Go — public REST API with Redis caching
├── shared/ Go — shared type definitions
├── frontend/ Next.js 14 + TypeScript + Tailwind
├── monitoring/ Prometheus config + Grafana dashboards
├── load-testing/ k6 load test scripts
├── scripts/ Setup and data seeding
├── docs/ Architecture, API reference
└── shared-proto/ Protobuf service contracts
```
## Load Testing
```bash
# Install k6: https://k6.io/docs/getting-started/installation/
# Search load (ramp to 500 concurrent users)
k6 run load-testing/search-load.js
# Indexing stress test
k6 run load-testing/index-load.js
```
Target thresholds: p95 search latency < 250ms, error rate < 1%.
## Local Development
```bash
# C++ shard (requires cmake + g++ or clang++)
cd shard-node && cmake -B build && cmake --build build
./shard-node/build/shard_node
# Go coordinator
cd query-coordinator && go run ./cmd/coordinator
# Go API gateway
cd api-gateway && go run ./cmd/gateway
# Frontend
cd frontend && npm install && npm run dev
```
## Design Decisions
**Why C++ for shard nodes?** Index operations are CPU-bound (tokenization, scoring over large posting lists). C++ provides deterministic memory layout, zero-overhead abstractions, and direct control over concurrency primitives.
**Why HTTP/JSON for internal RPC?** Simplifies local development and debugging without a code generation step. A production deployment would add gRPC (proto file included in `shared-proto/`) for reduced serialization overhead.
**Why FNV hashing vs consistent hashing ring?** FNV mod N is simpler and sufficient for a static cluster. For elastic scaling, replace with a virtual node ring (each physical node gets 150 virtual positions) to minimize reshuffling on membership change.
**Why coarse-grained cache invalidation?** Full cache flush on any write is simple and correct. A production system would track per-query affected terms and invalidate selectively, or use a shorter TTL for higher write workloads.
## Tech Stack
| Layer | Technology |
|-------|-----------|
| Search core | C++17, cpp-httplib, nlohmann/json |
| Coordinator | Go 1.21, chi router |
| API Gateway | Go 1.21, chi router, go-redis/v9 |
| Frontend | Next.js 14, TypeScript, Tailwind CSS, Recharts |
| Cache | Redis 7 |
| Metrics | Prometheus, Grafana |
| Load testing | k6 |
| Containers | Docker, Docker Compose |