Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/youssef-saaed/simple-search-engine-for-information-retrieval
This repository hosts the implementation of a Simple Search Engine designed for efficient information retrieval. The project encompasses several stages from data collection to evaluation, ensuring a comprehensive approach to search and retrieval.
https://github.com/youssef-saaed/simple-search-engine-for-information-retrieval
information-retrieval nlp pyterrier search-engine
Last synced: 4 days ago
JSON representation
This repository hosts the implementation of a Simple Search Engine designed for efficient information retrieval. The project encompasses several stages from data collection to evaluation, ensuring a comprehensive approach to search and retrieval.
- Host: GitHub
- URL: https://github.com/youssef-saaed/simple-search-engine-for-information-retrieval
- Owner: youssef-saaed
- License: mit
- Created: 2024-06-09T17:27:57.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-06-09T17:31:09.000Z (8 months ago)
- Last Synced: 2024-06-09T18:55:02.430Z (8 months ago)
- Topics: information-retrieval, nlp, pyterrier, search-engine
- Language: Jupyter Notebook
- Homepage:
- Size: 1.83 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Simple Search Engine for Information Retrieval
This repository hosts the implementation of a Simple Search Engine designed for efficient information retrieval. The project encompasses several stages from data collection to evaluation, ensuring a comprehensive approach to search and retrieval.
## Data Collection
- **CISI Test Collection**: After exploring various test collections, the CISI test collection was chosen for its relevance and comprehensiveness.
- **Parsing to CSV**: Documents, qrels, and topics files were parsed and stored in CSV format for ease of handling and processing.## Preprocessing
- **Text Processing Function**: A specialized function was created to perform tokenization, case folding, stemming, and removing stopwords using the `nltk` library.
- **Application**: This preprocessing function was applied to all documents to standardize and prepare the data for indexing.## Indexing
- **DFIndexer**: Utilized DFIndexer to construct an index for the document corpus.
- **Word-Document Dictionary**: Created a dictionary where each word is associated with a set of documents containing that word.
- **Frequency Dictionary**: Formulated a secondary dictionary mapping each word to another dictionary that records the frequency of the word in each document.## Query Processing
- **Preprocessing Queries**: User queries are preprocessed using the same function applied to documents.
- **TF-IDF Weighting**: Employed TF-IDF weighting to retrieve relevant results for user queries.## Query Expansion
- **RM3 Model**: Expanded user queries using the RM3 model to enhance the search results.
- **ELMo Re-ranking**: Re-ranked the result documents using ELMo embedding to improve the relevance of the retrieved documents.## User Interface (UI)
- **Flask Web App**: Developed a simple Flask web application with two pages: one featuring a search bar and button, and another to display the search results.
- **Performance Metrics**: The UI shows the number of documents retrieved and the time taken for the search process.## Evaluation
- **Search Engine Assessment**: Evaluated the search engine's performance using the qrels and topics parsed during the data collection stage.