An open API service indexing awesome lists of open source software.

https://github.com/ashwantmanikoth/intellilsearch

This is a AI powered crawler that can search the web for information based on your input.
https://github.com/ashwantmanikoth/intellilsearch

crawler deepseek groq-api hybrid-search llama llm pydantic python rag reranking retrieval-augmented-generation

Last synced: 3 months ago
JSON representation

This is a AI powered crawler that can search the web for information based on your input.

Awesome Lists containing this project

README

        

# IntelliSearch using Crawler and RAG
**IntelliSearch web crawler** is an intelligent web crawler that leverages advanced AI language models (LLMs) along with modern search techniques to deliver precise, context-aware answers. The system employs dense vector retrieval (via Qdrant) and RAG Fusion for re-ranking, and it’s designed to be easily extended with advanced techniques such as Late Interaction and token-level refinement.

## Technologies



## Features

- **Hybrid LLM Integration:**
- **Local LLMs:** Run directly on your machine for enhanced data privacy and control.
- **Paid API LLMs:** Utilize cutting-edge models like OpenAI’s GPT-4 for superior performance and real-time capabilities.

- **Efficient Vector Search with Qdrant:**
- **Search results:** SerpAPI free use tier for Google search results
- **Fast & Accurate:** Qdrant efficiently stores and retrieves dense embeddings to ensure quick and precise search results at scale.

- **Advanced Retrieval Techniques:**
- **RAG Fusion Reranking:** Merges multiple search results using reciprocal rank fusion to prioritize the most relevant documents.
- **Planned Enhancements:** Integrate Late Interaction techniques (e.g., ColBERT-style token-level re-ranking) and hybrid search methods (combining dense embeddings with BM25).

- **Enterprise-Ready:**
- **Customizable for Closed Systems:** Easily tunable for internal databases and proprietary search systems, similar to industry-leading apps similar to Perplexity.

- **User-Friendly Interface:**
- **Gradio UI:** A simple, interactive web-based interface for seamless user interactions.

---

## πŸ“‚ Folder Structure
```
WebcrawlerRAG/
β”œβ”€β”€ components/
β”‚ β”œβ”€β”€ chat_logic.py # Contains the main logic for handling chat interactions and RAG techniques
β”‚ β”œβ”€β”€ ranking_modes.py # Contains functions for different ranking modes like reciprocal rank fusion and unique union
β”œβ”€β”€ services/
β”‚ β”œβ”€β”€ search_service.py # Handles document search and loading
β”œβ”€β”€ utils/
β”‚ β”œβ”€β”€ config.py # Configuration settings for the project
β”œβ”€β”€ app.py # Main application file to launch the Gradio UI
β”œβ”€β”€ models.properties # Configuration file listing available models
β”œβ”€β”€ requirements.txt # List of dependencies required for the project
β”œβ”€β”€ README.md # Project documentation and instructions
β”œβ”€β”€ .env # Environment variables (e.g., API keys, database URLs)

```

## How It Works
- **Document Loading & Processing**
- The system fetches documents via the search service and splits them into manageable chunks.

- **Vector Storage & Retrieval**
- Chunks are embedded using a dense embedding model and stored in Qdrant. Retrieval is performed using dense vector search.

- **RAG Fusion Re-ranking**
- Multiple search queries are generated, and results are merged using reciprocal rank fusion or Unique union for broader search use-cases to prioritize accurate matching.

- **Answer Synthesis**
- The retrieved context is fed into an LLM (local or API-based) to generate a final answer in markdown format with links to sources.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

---

Feel free to star, fork and contribute to this project and share your feedback!