https://github.com/pngo1997/inverted-index

Practices on Inverted Index algorithm.
https://github.com/pngo1997/inverted-index

information-retrieval inverted-index natural-language-processing python query-processing text-analysis text-processing tf-idf

Last synced: about 2 months ago
JSON representation

Practices on Inverted Index algorithm.

Host: GitHub
URL: https://github.com/pngo1997/inverted-index
Owner: pngo1997
Created: 2024-03-22T19:38:14.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-01-31T19:12:11.000Z (over 1 year ago)
Last Synced: 2025-02-28T14:13:30.756Z (over 1 year ago)
Topics: information-retrieval, inverted-index, natural-language-processing, python, query-processing, text-analysis, text-processing, tf-idf
Language: Jupyter Notebook
Homepage:
Size: 1.63 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🔍 Information Retrieval: Positional Index, TF-IDF, and Inverted Index

## 📜 Overview
This project focuses on key **Information Retrieval (IR)** tasks, including:
1. **Positional Indexing** – Phrase query searching using a positional index.
2. **TF-IDF Weighting & Cosine Similarity** – Calculating term weights and identifying relevant documents.
3. **Inverted Index Construction** – Indexing TED Talk descriptions for efficient retrieval.

📌 **Datasets Used**:
- **Positional Index Sample Data** (Manually provided).
- **Document-Term Matrix** (For TF-IDF calculations).
- **TED Talks Dataset (`ted_main.csv`)** – Extracting and processing text descriptions.

📌 **Programming Language**: `Python 3`
📌 **Libraries Used**: `NLTK`, `NumPy`, `pandas`, `math`, `csv`

## 🏷️ 1️⃣ Positional Index & Phrase Query Matching
- **Queries Evaluated**:
- `"fools rush in"`
- `"fools rush in" AND "angels fear to tread"`
- **Task**: Identify matching documents and positions from a given **positional index**.
- **Issue Identified**: The index may have inconsistencies affecting search accuracy.

## 🔢 2️⃣ TF-IDF Computation & Cosine Similarity
- **Task**: Compute **TF-IDF** weights.
- **Cosine Similarity**: Determine the most relevant document: cos(θ) = (A • B) / (||A|| ||B||)
- **Goal**: Rank document similarity based on weighted term importance.

## 🔎 3️⃣ Inverted Index Construction (TED Dataset)
- **Preprocessing**: Tokenization, Lowercasing, Stopword Removal, Stemming.
- **Index Output Files**:
1. `TED_term_index.csv` – Term-to-ID mapping with document frequency.
2. `TED_doc_index.csv` – Document-to-ID mapping (TED Talk URLs).
3. `TED_inverted_index.csv` – Term ID → (Doc ID, Term Frequency).
- **Query Processing**: Boolean AND retrieval for:
- `'climate' AND 'change'`
- `'climate' AND 'fuel'`
- `'artificial' AND 'intelligence'`
- `'giant' AND 'troll'`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pngo1997/inverted-index

Awesome Lists containing this project

README