https://github.com/vladimircuriel/information-retrieval-system
An NLP-based information retrieval system that indexes document collections, understands natural language queries, and returns relevance-ranked results using modern ranking algorithms.
https://github.com/vladimircuriel/information-retrieval-system
ai nlp nltk sklearn spicy streamlit
Last synced: about 2 months ago
JSON representation
An NLP-based information retrieval system that indexes document collections, understands natural language queries, and returns relevance-ranked results using modern ranking algorithms.
- Host: GitHub
- URL: https://github.com/vladimircuriel/information-retrieval-system
- Owner: vladimircuriel
- License: other
- Created: 2025-09-07T20:18:54.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-08T05:54:04.000Z (10 months ago)
- Last Synced: 2025-09-08T07:23:56.272Z (10 months ago)
- Topics: ai, nlp, nltk, sklearn, spicy, streamlit
- Language: Python
- Homepage:
- Size: 746 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Classes Information Retrieval System
---
**Information Retrieval System** is a Python-powered information-retrieval engine that lets you index entire document collections, interpret free-form queries, and return relevance-ranked answers—all with no external service required.
Unlike a simple CSV search that scans each row for literal query matches, IR model builds a TF-IDF vector index and applies NLP preprocessing (tokenization, lemmatization, stop-word removal) so that queries and documents are compared in a high-dimensional space using cosine similarity—yielding fast, relevance-ranked results rather than unranked, exact‐match hits, and allowing seamless extension to more advanced ranking algorithms like BM25 or neural embeddings.
## Table of Contents
- [Features](#features)
- [Application](#application)
- [Installation](#installation)
## Features
- **Local Indexing**: Builds a TF-IDF vector index over your document collection (any pandas DataFrame column) on initialization—no external services involved.
- **Custom NLP Pipeline**: Leverages `query_processing()` for tokenization, normalization, stop-word removal and lemmatization to turn free-form queries into polished search inputs.
- **Fast Similarity Search**: Transforms queries into TF-IDF vectors and computes cosine similarity against your corpus to generate relevance scores in milliseconds.
- **Top-N Ranking**: Returns a configurable number of results, sorted by descending score, including document ID, metadata fields (e.g. Major, Course Title), cleaned description, and relevance score.
- **Persistence & Reuse**: Keeps the fitted `TfidfVectorizer` and index in memory or disk (your choice) so subsequent searches are instant and consistent across runs.
- **Schema-Agnostic**: Simply point the system at any DataFrame and column name—no fixed schema required, making it easy to index PDFs, CSVs or custom data sources.
- **Extensible Scoring**: Core TF-IDF + cosine similarity can be augmented with additional ranking algorithms (BM25, neural embeddings) as your needs evolve.
## Application

## Installation
### Prerequisites
- **Docker**
### Steps
1. **Clone the repository**:
```bash
git clone https://github.com/vladimircuriel/information-retrieval-system
```
2. **Navigate to the project directory**:
```bash
cd information-retrieval-system
```
3. **Run the commands**:
```bash
docker build -t system:latest .
```
```bash
docker run -p 8501:8501 system:latest
```
4. **Access the application**:
Open your browser and visit `http://localhost:8501` to access the user interface.