{"id":15157798,"url":"https://github.com/alaazameldev/text-based-search-engine","last_synced_at":"2026-01-20T20:11:15.440Z","repository":{"id":253409833,"uuid":"841488632","full_name":"alaazamelDev/text-based-search-engine","owner":"alaazamelDev","description":"Implementation of a search engine using TF-IDF and Word Embedding-based vectorization techniques for efficient document retrieval","archived":false,"fork":false,"pushed_at":"2024-09-22T16:45:10.000Z","size":1211,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-31T02:02:42.217Z","etag":null,"topics":["chromadb","fastapi","gensim-word2vec","nltk","numpy","precision-recall","python","scikit-learn","tf-idf-vectorizer"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alaazamelDev.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-12T14:04:00.000Z","updated_at":"2024-09-22T16:45:13.000Z","dependencies_parsed_at":"2024-08-16T15:02:14.482Z","dependency_job_id":"d951857c-93f2-4037-b97b-03ef5c69b2e9","html_url":"https://github.com/alaazamelDev/text-based-search-engine","commit_stats":null,"previous_names":["alaazameldev/information-retrieval-engine"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alaazamelDev%2Ftext-based-search-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alaazamelDev%2Ftext-based-search-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alaazamelDev%2Ftext-based-search-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alaazamelDev%2Ftext-based-search-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alaazamelDev","download_url":"https://codeload.github.com/alaazamelDev/text-based-search-engine/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237982289,"owners_count":19397236,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chromadb","fastapi","gensim-word2vec","nltk","numpy","precision-recall","python","scikit-learn","tf-idf-vectorizer"],"created_at":"2024-09-26T20:03:44.581Z","updated_at":"2025-10-24T14:31:05.036Z","avatar_url":"https://github.com/alaazamelDev.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text-Based Search Engine Project\n\n## Project Overview\n\nThis project, developed as an assignment for the Information Retrieval subject, demonstrates the implementation of search engines using two distinct techniques: TF-IDF based vectorization and embedding-based vectorization. Our goal is to showcase efficient and accurate document retrieval in response to user queries, highlighting the differences and advantages of each approach.\n\n## Features\n\n- Dual search engine implementation: TF-IDF and Word Embedding based\n- Query suggestion functionality\n- Document clustering and topic detection\n- Similar document retrieval\n- Efficient offline processing and fast online querying\n\n## Technologies Used\n\n- **Python**: Primary programming language\n- **NumPy**: For numerical computations\n- **Chroma DB**: Vector database for efficient similarity search\n- **Gensim**: For Word2Vec model implementation\n- **Scikit-learn**: For TF-IDF vectorization and other machine learning utilities\n- **FastAPI**: For creating the web API\n- **NLTK**: For text processing and tokenization\n\n## Datasets\n\n- **Antique**: A non-factoid question answering dataset [Link](https://ir-datasets.com/antique.html#antique/train)\n- **Wikipedia**: A subset of Wikipedia articles [Link](https://ir-datasets.com/wikir.html#wikir/en1k/training)\n\n## Process Workflow\n\n### TF-IDF Based Search Engine\n\n\u003ctable\u003e\u003ctr\u003e\n\u003ctd\u003e\u003cimg src=\"tf_of.png\" /\u003e\u003c/td\u003e\n\u003ctd\u003e\u003cimg src=\"tf_on.png\" /\u003e\u003c/td\u003e\n\u003c/tr\u003e\u003c/table\u003e\n\n| Process | Description |\n|---------|-------------|\n| Offline Process | 1. Load and preprocess documents\u003cbr\u003e2. Create vocabulary\u003cbr\u003e3. Compute TF-IDF matrix\u003cbr\u003e4. Store TF-IDF matrix and vocabulary |\n| Online Process | 1. Receive user query\u003cbr\u003e2. Preprocess query\u003cbr\u003e3. Convert query to TF-IDF vector\u003cbr\u003e4. Compute similarity with document vectors\u003cbr\u003e5. Rank and return top results |\n\n### Word2Vec Based Search Engine\n\n\u003ctable\u003e\u003ctr\u003e\n\u003ctd\u003e\u003cimg src=\"emb_of.png\" /\u003e\u003c/td\u003e\n\u003ctd\u003e\u003cimg src=\"emb_on.png\" /\u003e\u003c/td\u003e\n\u003c/tr\u003e\u003c/table\u003e\n\n| Process | Description |\n|---------|-------------|\n| Offline Process | 1. Load and preprocess documents\u003cbr\u003e2. Train or load pre-trained Word2Vec model\u003cbr\u003e3. Compute document embeddings\u003cbr\u003e4. Store embeddings in Chroma DB |\n| Online Process | 1. Receive user query\u003cbr\u003e2. Preprocess query\u003cbr\u003e3. Compute query embedding\u003cbr\u003e4. Perform similarity search in Chroma DB\u003cbr\u003e5. Rank and return top results |\n\n\n## Implementation Details\n\n### TF-IDF Based Vectorization\n\nThe TF-IDF (Term Frequency-Inverse Document Frequency) approach involves:\n- Creating a vocabulary from all documents\n- Computing TF-IDF scores for each term in each document\n- Representing documents and queries as TF-IDF vectors\n- Using cosine similarity to find relevant documents\n\n### Embedding-Based Vectorization\n\nThe Word Embedding approach involves:\n- Using pre-trained or custom-trained Word2Vec models\n- Representing words as dense vectors\n- Computing document embeddings by averaging word vectors\n- Using vector similarity in embedding space to find relevant documents\n\n## Examples\n\n| Query Suggestion | Query Result |\n|------------------|--------------|\n| ![Query Suggestion](query_suggestion.png) | ![Query Result](query_result.png) |\n\n| Topic Detection | Similar Documents |\n|-----------------|-------------------|\n| ![Topic Detection](topic_detection.png) | ![Similar Documents](similar_documents.png) |\n\n## Performance Comparison\n\n| Metric | TF-IDF Based | Word Embedding Based |\n|--------|--------------|----------------------|\n| MAP    | 54%          | 70%                  |\n| MRR    | 63%          | 80%                  |\n\nThe Word Embedding based approach shows superior performance in both Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) metrics.\n\n## Additional Features\n\n### Query Suggestion\n![N-Grams](n_grams.png)\nOur system provides query suggestions based on:\n1. Processing the user's input query\n2. Generating word vectors using Word2Vec\n3. Finding similar terms using cosine similarity\n4. Ranking and presenting the top suggestions\n\n### Documents Clustering\n\nWe implement document clustering to group similar documents and identify topics:\n- Using K-Means clustering algorithm\n- Applying Latent Dirichlet Allocation (LDA) for topic modeling\n\n## How to Use\n\n[To be added in a future update]\n\n## Documentation\n\nFor complete documentation of the project in Arabic, please refer to the following link:\n\n[Arabic Documentation](https://docs.google.com/document/d/1Fool2lmw9wKLmy9dEnJKBvU3GGOd3ymOAxb2yPikYG8/edit?usp=sharing)\n\n## Future Improvements\n\n- Implement more advanced embedding models (e.g., BERT, GPT)\n- Enhance query suggestion with user interaction data\n- Improve clustering algorithms for better topic detection\n- Optimize performance for larger datasets\n\n## Contributors\n\n- [Alaa Aldeen Zamel](https://github.com/alaazamelDev)\n- Anas Rish\n- Anas Durra\n- Mohammed Hadi Barakat\n- Mohammed Fares Dabbas\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falaazameldev%2Ftext-based-search-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falaazameldev%2Ftext-based-search-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falaazameldev%2Ftext-based-search-engine/lists"}