https://github.com/vladimircuriel/information-retrieval-system

An NLP-based information retrieval system that indexes document collections, understands natural language queries, and returns relevance-ranked results using modern ranking algorithms.
https://github.com/vladimircuriel/information-retrieval-system

ai nlp nltk sklearn spicy streamlit

Last synced: about 2 months ago
JSON representation

An NLP-based information retrieval system that indexes document collections, understands natural language queries, and returns relevance-ranked results using modern ranking algorithms.

Host: GitHub
URL: https://github.com/vladimircuriel/information-retrieval-system
Owner: vladimircuriel
License: other
Created: 2025-09-07T20:18:54.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-09-08T05:54:04.000Z (10 months ago)
Last Synced: 2025-09-08T07:23:56.272Z (10 months ago)
Topics: ai, nlp, nltk, sklearn, spicy, streamlit
Language: Python
Homepage:
Size: 746 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Classes Information Retrieval System

---

**Information Retrieval System** is a Python-powered information-retrieval engine that lets you index entire document collections, interpret free-form queries, and return relevance-ranked answers—all with no external service required.

Unlike a simple CSV search that scans each row for literal query matches, IR model builds a TF-IDF vector index and applies NLP preprocessing (tokenization, lemmatization, stop-word removal) so that queries and documents are compared in a high-dimensional space using cosine similarity—yielding fast, relevance-ranked results rather than unranked, exact‐match hits, and allowing seamless extension to more advanced ranking algorithms like BM25 or neural embeddings.

## Table of Contents

- [Features](#features)
- [Application](#application)
- [Installation](#installation)

## Features

- **Local Indexing**: Builds a TF-IDF vector index over your document collection (any pandas DataFrame column) on initialization—no external services involved.
- **Custom NLP Pipeline**: Leverages `query_processing()` for tokenization, normalization, stop-word removal and lemmatization to turn free-form queries into polished search inputs.
- **Fast Similarity Search**: Transforms queries into TF-IDF vectors and computes cosine similarity against your corpus to generate relevance scores in milliseconds.
- **Top-N Ranking**: Returns a configurable number of results, sorted by descending score, including document ID, metadata fields (e.g. Major, Course Title), cleaned description, and relevance score.
- **Persistence & Reuse**: Keeps the fitted `TfidfVectorizer` and index in memory or disk (your choice) so subsequent searches are instant and consistent across runs.
- **Schema-Agnostic**: Simply point the system at any DataFrame and column name—no fixed schema required, making it easy to index PDFs, CSVs or custom data sources.
- **Extensible Scoring**: Core TF-IDF + cosine similarity can be augmented with additional ranking algorithms (BM25, neural embeddings) as your needs evolve.

## Application

![ScreenShot - 12AM-48M@2x](https://github.com/user-attachments/assets/1496d230-0226-4807-bb54-127721c04d10)

## Installation

### Prerequisites

- **Docker**

### Steps

1. **Clone the repository**:

```bash
git clone https://github.com/vladimircuriel/information-retrieval-system
```

2. **Navigate to the project directory**:

```bash
cd information-retrieval-system
```

3. **Run the commands**:

```bash
docker build -t system:latest .
```

```bash
docker run -p 8501:8501 system:latest
```
4. **Access the application**:

Open your browser and visit `http://localhost:8501` to access the user interface.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vladimircuriel/information-retrieval-system

Awesome Lists containing this project

README

Classes Information Retrieval System