https://github.com/loayahmed304/search-engine

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/loayahmed304/search-engine
Owner: LoayAhmed304
License: mit
Created: 2025-03-21T20:00:13.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-07-18T15:01:23.000Z (3 months ago)
Last Synced: 2025-07-18T19:41:54.119Z (3 months ago)
Language: Java
Size: 3.69 MB
Stars: 9
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## $\color{rgba(240, 171, )}{\textsf{Hola! Can you say "The best search engine ever" with me? }}$

# 🔍 Search Engine Project

> A full-stack search engine built with Java/Spring Boot backend and Vue.js frontend

## Project Overview

This is a high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides a modern web interface for searching. The system is designed with scalability and performance in mind, featuring multi-threaded crawling, efficient indexing, and intelligent ranking algorithms.

## Video Preview

https://github.com/user-attachments/assets/2c2071b9-f7cc-4c47-91e2-fb824069e937

## ✨ Features
### 🕷️ Web Crawler

- **Multi-threaded Architecture**: Configurable thread pool (default: 20 threads)
- **High Performance**: Crawls 1000 documents in under 1 minute using 5 threads
- **Smart Batching**: Prioritizes popular pages using frequency-based batching
- **Robots.txt Compliance**: Respects web server policies with robust caching
- **Duplicate Detection**: Content hashing prevents redundant processing
- **URL Normalization**: Standardizes and filters invalid URLs
- **Compression**: Stores crawled content efficiently

##### Optimization Techniques
- Uses documents compression and decompression to store data of much less size in the database for faster operations.
- The RobotsHandler implements a domain-based caching system, maps hostnames to parsed robots.txt rules, ensuring each domain's rules are fetched only once regardless of how many URLs from that domain are crawled.

### 🧾 Indexer

> Transforms HTML documents into inverted indices for fast search

- **Advanced Tokenization**: Intelligent text processing and cleanup
- **Stop Word Filtering**: Removes common words for better relevance
- **Stemming Support**: Reduces words to their root forms
- **Field Extraction**: Processes titles, headers, and content separately
- **Efficient Storage**: Optimized database operations

### 📊 Ranking System

> Ranks pages based on their PageRank, TF, and IDF scores

- **TF-IDF** scoring for term relevance per page
- **Normalized PageRank** influence for domain authority
- Using PageRank algorithm, which takes ~10ms on 6,000 documents
- **Structural field boosts** for `` and `