https://github.com/loayahmed304/search-engine
https://github.com/loayahmed304/search-engine
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/loayahmed304/search-engine
- Owner: LoayAhmed304
- License: mit
- Created: 2025-03-21T20:00:13.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-07-18T15:01:23.000Z (3 months ago)
- Last Synced: 2025-07-18T19:41:54.119Z (3 months ago)
- Language: Java
- Size: 3.69 MB
- Stars: 9
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## $\color{rgba(240, 171, )}{\textsf{Hola! Can you say "The best search engine ever" with me? }}$
![]()
# ๐ Search Engine Project
> A full-stack search engine built with Java/Spring Boot backend and Vue.js frontend
## Project Overview
This is a high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides a modern web interface for searching. The system is designed with scalability and performance in mind, featuring multi-threaded crawling, efficient indexing, and intelligent ranking algorithms.
## Video Preview
https://github.com/user-attachments/assets/2c2071b9-f7cc-4c47-91e2-fb824069e937
## โจ Features
### ๐ท๏ธ Web Crawler- **Multi-threaded Architecture**: Configurable thread pool (default: 20 threads)
- **High Performance**: Crawls 1000 documents in under 1 minute using 5 threads
- **Smart Batching**: Prioritizes popular pages using frequency-based batching
- **Robots.txt Compliance**: Respects web server policies with robust caching
- **Duplicate Detection**: Content hashing prevents redundant processing
- **URL Normalization**: Standardizes and filters invalid URLs
- **Compression**: Stores crawled content efficiently##### Optimization Techniques
- Uses documents compression and decompression to store data of much less size in the database for faster operations.
- The RobotsHandler implements a domain-based caching system, maps hostnames to parsed robots.txt rules, ensuring each domain's rules are fetched only once regardless of how many URLs from that domain are crawled.### ๐งพ Indexer
> Transforms HTML documents into inverted indices for fast search
- **Advanced Tokenization**: Intelligent text processing and cleanup
- **Stop Word Filtering**: Removes common words for better relevance
- **Stemming Support**: Reduces words to their root forms
- **Field Extraction**: Processes titles, headers, and content separately
- **Efficient Storage**: Optimized database operations### ๐ Ranking System
> Ranks pages based on their PageRank, TF, and IDF scores
- **TF-IDF** scoring for term relevance per page
- **Normalized PageRank** influence for domain authority
- Using PageRank algorithm, which takes ~10ms on 6,000 documents
- **Structural field boosts** for `` and `` tag matches
- **Penalty** applied for missing `` tags
- **Score capping** to avoid overinflation
- Computes PageRank as an **offline process** for all crawled URLs
- **Optimized database operations** for fetching & saving ranks,
- targets: <200ms for 6,000 documents.
- Ranking logic runs in **8~50ms** depending on query### ๐ Query processing & Phrase Searching
> Processes user queries, supports phrase search, and generates result snippets.
- **Unified Tokenization**: Applies the same cleanup, stemming, and stop-word removal as the indexer to ensure consistency
- **Exact Phrase Search**: Supports precise phrase matching, even when stop words are present
- **Multi-threading**: Speeds up snippet generation and phrase matching
- **Fast Response Times**:
- General query: **0.01 โ 0.2 seconds**
- Phrase search: **< 0.3 seconds**#### Optimization Techniques
- Snippet generation is triggered only when the corresponding result page is requested
- Uses token position lookup from the inverted index โ avoids full document scans (no regex!)
- Early filtering (before stemming) to narrow down the result set for phrase searching queries#### Flow
![]()
## ๐ Development Guidelines
- Follow the conventions outlined in `Project-Guidelines.md`## ๐ ๏ธ How to Run
### Prerequisites
- **MongoDB** must be running locally
- **Maven** should be installed and available in your terminal
- Ensure your `application.properties` file is properly configured
- Clone the repository and navigate to the project root directory
### Backend Setup1. Open a terminal and navigate to the backend directory:
```bash
cd engine
```
2. Run Spring Boot
```bash
mvn spring-boot:run
```
---To start each module **in order**:
1. Run the Crawler (default thread count is 20)
```bash
make crawl THREADS=10
```2. Run the Indexer
```bash
make index
```3. Run the PageRank Module
```bash
make pagerank
```4. Run the Ranker (with an optional query)
```bash
make rank QUERY="your search terms"
```---
### Frontend Setup
1. Open a new terminal and go to the frontend directory:
```bash
cd client
```
2. Install dependencies:
```bash
npm install
```
3. Start the development server:
```bash
npm run dev
```
4. Open your browser and visit: [http://localhost:5173](http://localhost:5173/)## ๐ฅ Contributors