https://github.com/pngo1997/inverted-index
Practices on Inverted Index algorithm.
https://github.com/pngo1997/inverted-index
information-retrieval inverted-index natural-language-processing python query-processing text-analysis text-processing tf-idf
Last synced: about 2 months ago
JSON representation
Practices on Inverted Index algorithm.
- Host: GitHub
- URL: https://github.com/pngo1997/inverted-index
- Owner: pngo1997
- Created: 2024-03-22T19:38:14.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-31T19:12:11.000Z (over 1 year ago)
- Last Synced: 2025-02-28T14:13:30.756Z (over 1 year ago)
- Topics: information-retrieval, inverted-index, natural-language-processing, python, query-processing, text-analysis, text-processing, tf-idf
- Language: Jupyter Notebook
- Homepage:
- Size: 1.63 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🔍 Information Retrieval: Positional Index, TF-IDF, and Inverted Index
## 📜 Overview
This project focuses on key **Information Retrieval (IR)** tasks, including:
1. **Positional Indexing** – Phrase query searching using a positional index.
2. **TF-IDF Weighting & Cosine Similarity** – Calculating term weights and identifying relevant documents.
3. **Inverted Index Construction** – Indexing TED Talk descriptions for efficient retrieval.
📌 **Datasets Used**:
- **Positional Index Sample Data** (Manually provided).
- **Document-Term Matrix** (For TF-IDF calculations).
- **TED Talks Dataset (`ted_main.csv`)** – Extracting and processing text descriptions.
📌 **Programming Language**: `Python 3`
📌 **Libraries Used**: `NLTK`, `NumPy`, `pandas`, `math`, `csv`
## 🏷️ 1️⃣ Positional Index & Phrase Query Matching
- **Queries Evaluated**:
- `"fools rush in"`
- `"fools rush in" AND "angels fear to tread"`
- **Task**: Identify matching documents and positions from a given **positional index**.
- **Issue Identified**: The index may have inconsistencies affecting search accuracy.
## 🔢 2️⃣ TF-IDF Computation & Cosine Similarity
- **Task**: Compute **TF-IDF** weights.
- **Cosine Similarity**: Determine the most relevant document: cos(θ) = (A • B) / (||A|| ||B||)
- **Goal**: Rank document similarity based on weighted term importance.
## 🔎 3️⃣ Inverted Index Construction (TED Dataset)
- **Preprocessing**: Tokenization, Lowercasing, Stopword Removal, Stemming.
- **Index Output Files**:
1. `TED_term_index.csv` – Term-to-ID mapping with document frequency.
2. `TED_doc_index.csv` – Document-to-ID mapping (TED Talk URLs).
3. `TED_inverted_index.csv` – Term ID → (Doc ID, Term Frequency).
- **Query Processing**: Boolean AND retrieval for:
- `'climate' AND 'change'`
- `'climate' AND 'fuel'`
- `'artificial' AND 'intelligence'`
- `'giant' AND 'troll'`