https://github.com/ashwantmanikoth/intellilsearch
This is a AI powered crawler that can search the web for information based on your input.
https://github.com/ashwantmanikoth/intellilsearch
crawler deepseek groq-api hybrid-search llama llm pydantic python rag reranking retrieval-augmented-generation
Last synced: 3 months ago
JSON representation
This is a AI powered crawler that can search the web for information based on your input.
- Host: GitHub
- URL: https://github.com/ashwantmanikoth/intellilsearch
- Owner: ashwantmanikoth
- License: mit
- Created: 2025-01-27T19:39:20.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-03-02T21:53:01.000Z (3 months ago)
- Last Synced: 2025-03-02T22:20:57.816Z (3 months ago)
- Topics: crawler, deepseek, groq-api, hybrid-search, llama, llm, pydantic, python, rag, reranking, retrieval-augmented-generation
- Language: Python
- Homepage:
- Size: 28.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# IntelliSearch using Crawler and RAG
**IntelliSearch web crawler** is an intelligent web crawler that leverages advanced AI language models (LLMs) along with modern search techniques to deliver precise, context-aware answers. The system employs dense vector retrieval (via Qdrant) and RAG Fusion for re-ranking, and itβs designed to be easily extended with advanced techniques such as Late Interaction and token-level refinement.## Technologies
![]()
![]()
![]()
## Features
- **Hybrid LLM Integration:**
- **Local LLMs:** Run directly on your machine for enhanced data privacy and control.
- **Paid API LLMs:** Utilize cutting-edge models like OpenAIβs GPT-4 for superior performance and real-time capabilities.- **Efficient Vector Search with Qdrant:**
- **Search results:** SerpAPI free use tier for Google search results
- **Fast & Accurate:** Qdrant efficiently stores and retrieves dense embeddings to ensure quick and precise search results at scale.- **Advanced Retrieval Techniques:**
- **RAG Fusion Reranking:** Merges multiple search results using reciprocal rank fusion to prioritize the most relevant documents.
- **Planned Enhancements:** Integrate Late Interaction techniques (e.g., ColBERT-style token-level re-ranking) and hybrid search methods (combining dense embeddings with BM25).- **Enterprise-Ready:**
- **Customizable for Closed Systems:** Easily tunable for internal databases and proprietary search systems, similar to industry-leading apps similar to Perplexity.- **User-Friendly Interface:**
- **Gradio UI:** A simple, interactive web-based interface for seamless user interactions.---
## π Folder Structure
```
WebcrawlerRAG/
βββ components/
β βββ chat_logic.py # Contains the main logic for handling chat interactions and RAG techniques
β βββ ranking_modes.py # Contains functions for different ranking modes like reciprocal rank fusion and unique union
βββ services/
β βββ search_service.py # Handles document search and loading
βββ utils/
β βββ config.py # Configuration settings for the project
βββ app.py # Main application file to launch the Gradio UI
βββ models.properties # Configuration file listing available models
βββ requirements.txt # List of dependencies required for the project
βββ README.md # Project documentation and instructions
βββ .env # Environment variables (e.g., API keys, database URLs)```
## How It Works
- **Document Loading & Processing**
- The system fetches documents via the search service and splits them into manageable chunks.
- **Vector Storage & Retrieval**
- Chunks are embedded using a dense embedding model and stored in Qdrant. Retrieval is performed using dense vector search.- **RAG Fusion Re-ranking**
- Multiple search queries are generated, and results are merged using reciprocal rank fusion or Unique union for broader search use-cases to prioritize accurate matching.- **Answer Synthesis**
- The retrieved context is fed into an LLM (local or API-based) to generate a final answer in markdown format with links to sources.## License
This project is licensed under the MIT License. See the `LICENSE` file for details.
---
Feel free to star, fork and contribute to this project and share your feedback!