Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ahmed-kamal2004/aurora_searchengine
New Search Engine Coming to the world
https://github.com/ahmed-kamal2004/aurora_searchengine
Last synced: 10 days ago
JSON representation
New Search Engine Coming to the world
- Host: GitHub
- URL: https://github.com/ahmed-kamal2004/aurora_searchengine
- Owner: ahmed-kamal2004
- License: mit
- Created: 2024-04-05T03:44:50.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-06-17T20:34:58.000Z (5 months ago)
- Last Synced: 2024-06-18T23:06:18.733Z (5 months ago)
- Language: Java
- Size: 35 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Aurora
### Intro
## New Search Engine Coming to the World
#### Crawler
The web crawler is responsible for collecting documents from the web. Key features include:
- **Avoiding Re-visits:** Ensuring the crawler does not visit the same page more than once.
- **URL Normalization:** Checking if different URLs refer to the same page.
- **Document Type Handling:** Limiting crawling to specific document types (HTML for this project).
- **State Maintenance:** Allowing the crawler to resume from where it left off after interruptions.
- **Robots.txt Compliance:** Respecting rules set by web administrators to exclude certain pages.
- **Multithreading:** Supporting user-defined number of threads with proper synchronization.
- **Seed Management:** Careful selection and management of seed URLs.
- **Crawl Limit:** Capable of crawling up to 6000 pages.
- **Visit Order:** Utilizing appropriate data structures to determine the order of page visits.
#### IndexerThe indexer processes the downloaded documents to facilitate fast and efficient querying. Features include:
- **Persistence:** Maintaining the index in secondary storage (file structure or database).
- **Fast Retrieval:** Optimized for quick response to queries for specific words or sets of words.
- **Incremental Updates:** Capability to update the index with new documents without rebuilding from scratch.
- **Design Consideration:** Ensuring compatibility with the ranker and search modules.### Query Processor
This module handles user search queries with the following features:
- **Preprocessing:** Preparing search queries for efficient processing.
- **Stemming:** Matching words with the same root (e.g., "travel" matches "traveler", "traveling").
- **Phrase Searching:** Supporting phrase searches with quotation marks, ensuring precise order matching.### Ranker 🚀
The ranker sorts search results based on relevance and popularity:
- **Relevance:** Calculated using methods like tf-idf, considering word occurrence in titles, headers, and body text.
- **Popularity:** Measured independently of the query, using algorithms like PageRank.
#### PageRank Algorithm
- Details :
```
PR(i) = (1 - d) + d * Σ(PR(j) / Outlinks(j)) where j points to i
```
Where d is 0.15 "Approx".This equation another view is
First I initialize M
```
M = (1-d) A + dB
```
where :
> d is dumping factor,
> S is the number of URLs on the web,
> B is Matrix of S x S filled with 1/S float number,
> A is a transition matrix of size S x S that indicates the relations between every URL and other URLs outgoing from it.Final Equation :bulb:
```
X = M.T * X
```
The number of Iterations is determined by the degree of precision required.
Precision criteria: âš¡
```
| norm(X after multiplication operation) - norm(X before multiplication operation) | should be < Precision Factor
```
### Web interface
The web interface provides user interaction with the search engine:
- **Query Handling:** Receives and processes user queries.
- **Result Display:** Shows search results with snippets highlighting query words.
- **Pagination:** Handles large result sets by dividing them into pages.**Examples**
![SearchEngine](./assets/Main%20Page.png)
![image](./assets/Results%20Page.png)
![image](./assets/Voice%20Recognition.png)
## Built With
- Java
- SpringBoot
- ReactJS## API "Spring Boot" 📖
#### POST:
- "http://localhost:8090/ranker/rank" -- for applying PageRank algorithm
##### Response
> True or False "if there is an error"
#### GET:
- "http://localhost:8090/ranker/search"
##### Request Body
> { "query": "al-Khwarizmi" }
##### Response
> Ranked URLs with appropriate information.## Database "MongoDB"
![image](https://github.com/ahmed-kamal2004/Aurora_SearchEngine/assets/98265644/5592cb3e-9f0b-47a2-9c3b-56fab912969c)## LICENSE
[MIT](/LICENSE) © ahmed-kamal2004