Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alaazameldev/text-based-search-engine
Implementation of a search engine using TF-IDF and Word Embedding-based vectorization techniques for efficient document retrieval
https://github.com/alaazameldev/text-based-search-engine
chromadb fastapi gensim-word2vec nltk numpy precision-recall python scikit-learn tf-idf-vectorizer
Last synced: 27 days ago
JSON representation
Implementation of a search engine using TF-IDF and Word Embedding-based vectorization techniques for efficient document retrieval
- Host: GitHub
- URL: https://github.com/alaazameldev/text-based-search-engine
- Owner: alaazamelDev
- License: mit
- Created: 2024-08-12T14:04:00.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-09-22T16:45:10.000Z (about 1 month ago)
- Last Synced: 2024-10-10T08:01:29.311Z (27 days ago)
- Topics: chromadb, fastapi, gensim-word2vec, nltk, numpy, precision-recall, python, scikit-learn, tf-idf-vectorizer
- Language: Jupyter Notebook
- Homepage:
- Size: 1.15 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Text-Based Search Engine Project
## Project Overview
This project, developed as an assignment for the Information Retrieval subject, demonstrates the implementation of search engines using two distinct techniques: TF-IDF based vectorization and embedding-based vectorization. Our goal is to showcase efficient and accurate document retrieval in response to user queries, highlighting the differences and advantages of each approach.
## Features
- Dual search engine implementation: TF-IDF and Word Embedding based
- Query suggestion functionality
- Document clustering and topic detection
- Similar document retrieval
- Efficient offline processing and fast online querying## Technologies Used
- **Python**: Primary programming language
- **NumPy**: For numerical computations
- **Chroma DB**: Vector database for efficient similarity search
- **Gensim**: For Word2Vec model implementation
- **Scikit-learn**: For TF-IDF vectorization and other machine learning utilities
- **FastAPI**: For creating the web API
- **NLTK**: For text processing and tokenization## Datasets
- **Antique**: A non-factoid question answering dataset [Link](https://ir-datasets.com/antique.html#antique/train)
- **Wikipedia**: A subset of Wikipedia articles [Link](https://ir-datasets.com/wikir.html#wikir/en1k/training)## Process Workflow
### TF-IDF Based Search Engine
| Process | Description |
|---------|-------------|
| Offline Process | 1. Load and preprocess documents
2. Create vocabulary
3. Compute TF-IDF matrix
4. Store TF-IDF matrix and vocabulary |
| Online Process | 1. Receive user query
2. Preprocess query
3. Convert query to TF-IDF vector
4. Compute similarity with document vectors
5. Rank and return top results |### Word2Vec Based Search Engine
| Process | Description |
|---------|-------------|
| Offline Process | 1. Load and preprocess documents
2. Train or load pre-trained Word2Vec model
3. Compute document embeddings
4. Store embeddings in Chroma DB |
| Online Process | 1. Receive user query
2. Preprocess query
3. Compute query embedding
4. Perform similarity search in Chroma DB
5. Rank and return top results |## Implementation Details
### TF-IDF Based Vectorization
The TF-IDF (Term Frequency-Inverse Document Frequency) approach involves:
- Creating a vocabulary from all documents
- Computing TF-IDF scores for each term in each document
- Representing documents and queries as TF-IDF vectors
- Using cosine similarity to find relevant documents### Embedding-Based Vectorization
The Word Embedding approach involves:
- Using pre-trained or custom-trained Word2Vec models
- Representing words as dense vectors
- Computing document embeddings by averaging word vectors
- Using vector similarity in embedding space to find relevant documents## Examples
| Query Suggestion | Query Result |
|------------------|--------------|
| ![Query Suggestion](query_suggestion.png) | ![Query Result](query_result.png) || Topic Detection | Similar Documents |
|-----------------|-------------------|
| ![Topic Detection](topic_detection.png) | ![Similar Documents](similar_documents.png) |## Performance Comparison
| Metric | TF-IDF Based | Word Embedding Based |
|--------|--------------|----------------------|
| MAP | 54% | 70% |
| MRR | 63% | 80% |The Word Embedding based approach shows superior performance in both Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) metrics.
## Additional Features
### Query Suggestion
![N-Grams](n_grams.png)
Our system provides query suggestions based on:
1. Processing the user's input query
2. Generating word vectors using Word2Vec
3. Finding similar terms using cosine similarity
4. Ranking and presenting the top suggestions### Documents Clustering
We implement document clustering to group similar documents and identify topics:
- Using K-Means clustering algorithm
- Applying Latent Dirichlet Allocation (LDA) for topic modeling## How to Use
[To be added in a future update]
## Documentation
For complete documentation of the project in Arabic, please refer to the following link:
[Arabic Documentation](https://docs.google.com/document/d/1Fool2lmw9wKLmy9dEnJKBvU3GGOd3ymOAxb2yPikYG8/edit?usp=sharing)
## Future Improvements
- Implement more advanced embedding models (e.g., BERT, GPT)
- Enhance query suggestion with user interaction data
- Improve clustering algorithms for better topic detection
- Optimize performance for larger datasets## Contributors
- [Alaa Aldeen Zamel](https://github.com/alaazamelDev)
- Anas Rish
- Anas Durra
- Mohammed Hadi Barakat
- Mohammed Fares Dabbas## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.