{"id":25212872,"url":"https://github.com/ashwantmanikoth/IntellilSearch","last_synced_at":"2025-10-25T11:32:12.748Z","repository":{"id":274704382,"uuid":"923189059","full_name":"ashwantmanikoth/AIpoweredWebcrawler","owner":"ashwantmanikoth","description":"This is a AI powered crawler that can search the web for information based on your input.","archived":false,"fork":false,"pushed_at":"2025-02-09T20:34:16.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-09T21:27:37.837Z","etag":null,"topics":["crawler","deepseek","groq-api","hybrid-search","llama","llm","pydantic","python","rag","reranking","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashwantmanikoth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-27T19:39:20.000Z","updated_at":"2025-02-09T20:34:19.000Z","dependencies_parsed_at":"2025-01-28T22:22:54.646Z","dependency_job_id":"410678b8-7971-4cea-a606-4fe32629e0d6","html_url":"https://github.com/ashwantmanikoth/AIpoweredWebcrawler","commit_stats":null,"previous_names":["ashwantmanikoth/aipoweredwebcrawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwantmanikoth%2FAIpoweredWebcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwantmanikoth%2FAIpoweredWebcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwantmanikoth%2FAIpoweredWebcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwantmanikoth%2FAIpoweredWebcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashwantmanikoth","download_url":"https://codeload.github.com/ashwantmanikoth/AIpoweredWebcrawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238133809,"owners_count":19421909,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","deepseek","groq-api","hybrid-search","llama","llm","pydantic","python","rag","reranking","retrieval-augmented-generation"],"created_at":"2025-02-10T15:17:56.978Z","updated_at":"2025-10-25T11:32:12.742Z","avatar_url":"https://github.com/ashwantmanikoth.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IntelliSearch using Crawler and RAG\n**IntelliSearch web crawler** is an intelligent web crawler that leverages advanced AI language models (LLMs) along with modern search techniques to deliver precise, context-aware answers. The system employs dense vector retrieval (via Qdrant) and RAG Fusion for re-ranking, and it’s designed to be easily extended with advanced techniques such as Late Interaction and token-level refinement.\n\n## Technologies\n\u003cimg src=\"https://github.com/user-attachments/assets/09772236-6410-4b1f-bd66-66a7d8742be9\" width=\"30\" /\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/0478f3b6-a895-422c-a4da-554d26c1cfc2\" width=\"30\"/\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/4955df76-7b5f-4590-b4a2-13a5abe90b6f\" width=\"30\"/\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/5b2b0e6c-520e-426e-af81-aea0a68c2556\" width=\"30\"/\u003e\n\n## Features\n\n- **Hybrid LLM Integration:**\n  - **Local LLMs:** Run directly on your machine for enhanced data privacy and control.\n  - **Paid API LLMs:** Utilize cutting-edge models like OpenAI’s GPT-4 for superior performance and real-time capabilities.\n\n- **Efficient Vector Search with Qdrant:**\n  - **Search results:** SerpAPI free use tier for Google search results \n  - **Fast \u0026 Accurate:** Qdrant efficiently stores and retrieves dense embeddings to ensure quick and precise search results at scale.\n\n- **Advanced Retrieval Techniques:**\n  - **RAG Fusion Reranking:** Merges multiple search results using reciprocal rank fusion to prioritize the most relevant documents.\n  - **Planned Enhancements:** Integrate Late Interaction techniques (e.g., ColBERT-style token-level re-ranking) and hybrid search methods (combining dense embeddings with BM25).\n\n- **Enterprise-Ready:**\n  - **Customizable for Closed Systems:** Easily tunable for internal databases and proprietary search systems, similar to industry-leading apps similar to Perplexity.\n\n- **User-Friendly Interface:**\n  - **Gradio UI:** A simple, interactive web-based interface for seamless user interactions.\n\n---\n\n## 📂 Folder Structure\n```\nWebcrawlerRAG/\n  ├── components/\n  │   ├── chat_logic.py          # Contains the main logic for handling chat interactions and RAG techniques\n  │   ├── ranking_modes.py       # Contains functions for different ranking modes like reciprocal rank fusion and unique union\n  ├── services/\n  │   ├── search_service.py      # Handles document search and loading\n  ├── utils/\n  │   ├── config.py              # Configuration settings for the project\n  ├── app.py                     # Main application file to launch the Gradio UI\n  ├── models.properties          # Configuration file listing available models\n  ├── requirements.txt           # List of dependencies required for the project\n  ├── README.md                  # Project documentation and instructions\n  ├── .env                       # Environment variables (e.g., API keys, database URLs)\n\n```\n\n##  How It Works\n-  **Document Loading \u0026 Processing**\n    -  The system fetches documents via the search service and splits them into manageable chunks.\n  \n-  **Vector Storage \u0026 Retrieval**\n    -  Chunks are embedded using a dense embedding model and stored in Qdrant. Retrieval is performed using dense vector search.\n\n-  **RAG Fusion Re-ranking**\n    -  Multiple search queries are generated, and results are merged using reciprocal rank fusion or Unique union for broader search use-cases to prioritize accurate matching.\n\n-  **Answer Synthesis**\n    -  The retrieved context is fed into an LLM (local or API-based) to generate a final answer in markdown format with links to sources.\n\n## License\n\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\n\n---\n\nFeel free to star, fork and contribute to this project and share your feedback!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwantmanikoth%2FIntellilSearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashwantmanikoth%2FIntellilSearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwantmanikoth%2FIntellilSearch/lists"}