https://github.com/mohammad-malik/wikipedia-naive-search

This repository houses a naïve search engine utilising MapReduce technology which leverages a 5GB csv file as dataset. It makes use of the Vector Space Model for Information Retrieval. This was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).
https://github.com/mohammad-malik/wikipedia-naive-search

indexing mapreduce python search-engine university-assignment university-course

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/mohammad-malik/wikipedia-naive-search
Owner: mohammad-malik
Created: 2024-03-16T08:53:54.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-23T10:02:31.000Z (8 months ago)
Last Synced: 2025-01-29T08:43:20.409Z (6 months ago)
Topics: indexing, mapreduce, python, search-engine, university-assignment, university-course
Language: Python
Homepage:
Size: 992 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Hadoop MapReduce Naive Search Engine

This repository houses a basic search engine implementation utilizing Hadoop's MapReduce framework to process an extensive text corpus efficiently.
The dataset used for this project is a subset of the English Wikipedia dump. It is 5.2 GB in total.
The project focuses on implementing a naive search algorithm to address challenges in information.

## Dataset Preparation:
We started by dividing the 5 GB Wikipedia dataset into smaller, manageable chunks to facilitate easier processing and analysis. This of course was only temporary, as the dataset would later be used in full for the final search engine.

## Data Preprocessing:
Our code cleaned and standardized the text data, removing stopwords, and normalized terms for consistency across the dataset.

## TF-IDF Score Calculation:
The implementation calculates Term Frequency (TF) and Inverse Document Frequency (IDF) scores to evaluate the importance of words within documents relative to the entire dataset.
Then it uses the Vector Space Model Implementation, which involves coding a model to represent both documents and queries as vectors, to measure similarities for ranking purposes.

### Developing the Search Engine with MapReduce:

Word Enumeration: Our code will scan the dataset to identify unique words, assigning each a unique identifier.

Document Count: We'll compute the IDF for each term, essentially counting the documents each term appears in.

Indexing: The script will process each document to create a TF/IDF vector representation, forming the basis of our search index.

Query Processing: We'll develop functions to convert user queries into vectors and find the most relevant documents by comparing these vectors with our document index.

## Dependencies:
To run this implementation of a Hadoop-MapReduce Search Engine, you'll need the following:

- **Apache Hadoop** [(install)](https://hadoop.apache.org/releases.html)
- **Python** [(install)](https://www.python.org/downloads/)
- **NLTK** [(install)](https://www.nltk.org/)
- **pandas** [(install)](https://pandas.pydata.org/docs/getting_started/install.html)
- **numpy** [(install)](https://numpy.org/)
- **Dataset link** [Download Dataset](https://drive.google.com/file/d/1lGVGqzF5CNWaoV-zoz8_mlThvHwMgcsP/view?usp=sharing)

Ensure you have these software and libraries installed on your system before proceeding.

## Features

- Efficient Indexing: Utilizing MapReduce tasks to efficiently analyze the entire corpus and generate unique word IDs, calculate Inverse Document Frequency (IDF), and create a consolidated vocabulary.
- Vectorized Representation: The Indexer computes a machine-readable representation of the entire document corpus using TF/IDF weighting.
- Relevance Analysis: The Ranker Engine generates a vectorized representation for user queries and conducts relevance analysis by calculating the relevance function between the query and each document. This enables the retrieval of sorted lists of relevant documents based on relevance scores.

## Team:

- **Manal Aamir**: [GitHub](https://github.com/manal-aamir)
- **Mohammad Malik**: [GitHub](https://github.com/mohammad-malik)
- **Aqsa Fayaz**: [GitHub](https://github.com/Aqsa-Fayaz)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mohammad-malik/wikipedia-naive-search

Awesome Lists containing this project

README