https://github.com/zilean12/search-engine
Develop a custom search engine using Information Retrieval
https://github.com/zilean12/search-engine
Last synced: 6 months ago
JSON representation
Develop a custom search engine using Information Retrieval
- Host: GitHub
- URL: https://github.com/zilean12/search-engine
- Owner: Zilean12
- Created: 2024-11-11T15:37:49.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-11-12T13:39:02.000Z (11 months ago)
- Last Synced: 2025-02-09T22:42:56.330Z (8 months ago)
- Language: Python
- Homepage:
- Size: 38.1 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
🔍 Simple Search Engine
![]()
![]()
An intelligent document search engine that leverages natural language processing techniques to provide relevant and personalized search results. Powered by Flask, TF-IDF, and cosine similarity.Table of Contents
------------------ [Features](#-features)
- [Prerequisites](#-prerequisites)
- [Installation](#-installation)
- [Project Structure](-#project-structure)
- [Usage](#-usage)
- [Key Components](#-key-components)
- [Text Preprocessing](#text-preprocessing)
- [Inverted Index](#inverted-index)
- [TF-IDF Calculation](#tf-idf-calculation)
- [Cosine Similarity](#cosine-similarity)
- [Spell Checking](#spell-checking)
- [Dependencies](#-dependencies)
## 📋 Features- **Text Preprocessing**: Tokenization, stop word removal, and lemmatization
- **Inverted Index Construction**: Allows efficient term-based lookups
- **TF-IDF Calculation**: Measures the importance of terms in each document
- **Cosine Similarity**: Computes similarity between the query and documents for ranking
- **Spell Checking**: Automatically corrects misspelled terms in user queries
- **Web Interface**: Search through documents using a simple HTML form## 🛠️ Prerequisites
- Python 3.10+
- Internet connection (for downloading NLTK stopwords and spaCy model)## 🚀 Installation
1. **Clone the Repository**
```bash
git clone https://github.com/Zilean12/Search-Engine.git
```
```bash
cd Search-Engine
2. **Install Required Packages** Install the necessary Python packages listed in `requirements.txt`:
```bash
pip install -r requirements.txt
3. **Download spaCy Model**
```bash
python -m spacy download en_core_web_sm
4. **Download NLTK Data** Download the stopwords dataset from NLTK
5. **Run the Application** Start the Flask app by running:
```bash
python app.py
The app will be available at `http://127.0.0.1:5000`.## 🗂️ Project Structure
-----
1. `app.py`: Main application file with text processing, TF-IDF calculation, and Flask routes.
2. `templates/index.html`: HTML template for the search interface.
3. `static/style.css`: CSS file for styling the web interface.
4. `requirements.txt`: List of required Python packages.
## 🔍 Usage
-----1. Open the app in your browser (`http://127.0.0.1:5000`).
2. Enter a search query in the input box and click "Search."
3. The application will display documents ranked by relevance to the query, showing their cosine similarity scores. Misspelled terms in the query will be automatically corrected.## 🔑 Key Components
------### Text Preprocessing
The text is converted to lowercase, punctuation is removed, stop words are removed, and remaining words are stemmed.
### Inverted Index
An inverted index is created to store document IDs for each unique term, facilitating fast lookup of terms in documents.
### TF-IDF Calculation
The TF-IDF score is calculated for each term in each document. TF (Term Frequency) and IDF (Inverse Document Frequency) scores are used to measure term importance.
### Cosine Similarity
The similarity between the query and each document is calculated using cosine similarity, which helps rank documents based on relevance.
### Spell Checking
The application uses a custom spell checker to automatically correct misspelled terms in user queries, improving the search experience.
## 🧰 Dependencies
------- **Flask**: Web framework for Python, used for handling HTTP requests and serving the web application.
- **NLTK (Natural Language Toolkit)**: Used for text preprocessing tasks, such as removing stopwords.
- **NumPy**: Provides support for numerical operations and vector calculations, essential for data processing.
- **Tabulate**: Formats data in tables for improved readability in the console.
- **Colorama**: Cross-platform library for adding color formatting to terminal output, making console messages more intuitive.
- **spaCy**: Advanced NLP library, used with the `en_core_web_sm` model to support text processing and tokenization.
- **rapidfuzz**: Library for fuzzy string matching, enhancing search capabilities by identifying approximate matches.