https://github.com/zilean12/search-engine

Develop a custom search engine using Information Retrieval
https://github.com/zilean12/search-engine

Last synced: 6 months ago
JSON representation

Develop a custom search engine using Information Retrieval

Host: GitHub
URL: https://github.com/zilean12/search-engine
Owner: Zilean12
Created: 2024-11-11T15:37:49.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-11-12T13:39:02.000Z (11 months ago)
Last Synced: 2025-02-09T22:42:56.330Z (8 months ago)
Language: Python
Homepage:
Size: 38.1 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

🔍 Simple Search Engine

An intelligent document search engine that leverages natural language processing techniques to provide relevant and personalized search results. Powered by Flask, TF-IDF, and cosine similarity.

Table of Contents
-----------------

- [Features](#-features)
- [Prerequisites](#-prerequisites)
- [Installation](#-installation)
- [Project Structure](-#project-structure)
- [Usage](#-usage)
- [Key Components](#-key-components)
- [Text Preprocessing](#text-preprocessing)
- [Inverted Index](#inverted-index)
- [TF-IDF Calculation](#tf-idf-calculation)
- [Cosine Similarity](#cosine-similarity)
- [Spell Checking](#spell-checking)
- [Dependencies](#-dependencies)

## 📋 Features

- **Text Preprocessing**: Tokenization, stop word removal, and lemmatization
- **Inverted Index Construction**: Allows efficient term-based lookups
- **TF-IDF Calculation**: Measures the importance of terms in each document
- **Cosine Similarity**: Computes similarity between the query and documents for ranking
- **Spell Checking**: Automatically corrects misspelled terms in user queries
- **Web Interface**: Search through documents using a simple HTML form

## 🛠️ Prerequisites

- Python 3.10+
- Internet connection (for downloading NLTK stopwords and spaCy model)

## 🚀 Installation

1. **Clone the Repository**

```bash
git clone https://github.com/Zilean12/Search-Engine.git
```
```bash
cd Search-Engine
2. **Install Required Packages** Install the necessary Python packages listed in `requirements.txt`:
```bash
pip install -r requirements.txt
3. **Download spaCy Model**
```bash
python -m spacy download en_core_web_sm
4. **Download NLTK Data** Download the stopwords dataset from NLTK
5. **Run the Application** Start the Flask app by running:
```bash
python app.py

The app will be available at `http://127.0.0.1:5000`.

## 🗂️ Project Structure

-----

1. `app.py`: Main application file with text processing, TF-IDF calculation, and Flask routes.

2. `templates/index.html`: HTML template for the search interface.

3. `static/style.css`: CSS file for styling the web interface.

4. `requirements.txt`: List of required Python packages.

## 🔍 Usage
-----

1. Open the app in your browser (`http://127.0.0.1:5000`).
2. Enter a search query in the input box and click "Search."
3. The application will display documents ranked by relevance to the query, showing their cosine similarity scores. Misspelled terms in the query will be automatically corrected.

## 🔑 Key Components
------

### Text Preprocessing

The text is converted to lowercase, punctuation is removed, stop words are removed, and remaining words are stemmed.

### Inverted Index

An inverted index is created to store document IDs for each unique term, facilitating fast lookup of terms in documents.

### TF-IDF Calculation

The TF-IDF score is calculated for each term in each document. TF (Term Frequency) and IDF (Inverse Document Frequency) scores are used to measure term importance.

### Cosine Similarity

The similarity between the query and each document is calculated using cosine similarity, which helps rank documents based on relevance.

### Spell Checking

The application uses a custom spell checker to automatically correct misspelled terms in user queries, improving the search experience.

## 🧰 Dependencies
------

- **Flask**: Web framework for Python, used for handling HTTP requests and serving the web application.
- **NLTK (Natural Language Toolkit)**: Used for text preprocessing tasks, such as removing stopwords.
- **NumPy**: Provides support for numerical operations and vector calculations, essential for data processing.
- **Tabulate**: Formats data in tables for improved readability in the console.
- **Colorama**: Cross-platform library for adding color formatting to terminal output, making console messages more intuitive.
- **spaCy**: Advanced NLP library, used with the `en_core_web_sm` model to support text processing and tokenization.
- **rapidfuzz**: Library for fuzzy string matching, enhancing search capabilities by identifying approximate matches.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zilean12/search-engine

Awesome Lists containing this project

README

🔍 Simple Search Engine