https://github.com/ashwantmanikoth/intellilsearch

This is a AI powered crawler that can search the web for information based on your input.
https://github.com/ashwantmanikoth/intellilsearch

crawler deepseek groq-api hybrid-search llama llm pydantic python rag reranking retrieval-augmented-generation

Last synced: 3 months ago
JSON representation

This is a AI powered crawler that can search the web for information based on your input.

Host: GitHub
URL: https://github.com/ashwantmanikoth/intellilsearch
Owner: ashwantmanikoth
License: mit
Created: 2025-01-27T19:39:20.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-03-02T21:53:01.000Z (3 months ago)
Last Synced: 2025-03-02T22:20:57.816Z (3 months ago)
Topics: crawler, deepseek, groq-api, hybrid-search, llama, llm, pydantic, python, rag, reranking, retrieval-augmented-generation
Language: Python
Homepage:
Size: 28.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# IntelliSearch using Crawler and RAG
**IntelliSearch web crawler** is an intelligent web crawler that leverages advanced AI language models (LLMs) along with modern search techniques to deliver precise, context-aware answers. The system employs dense vector retrieval (via Qdrant) and RAG Fusion for re-ranking, and it’s designed to be easily extended with advanced techniques such as Late Interaction and token-level refinement.

## Technologies

## Features

- **Hybrid LLM Integration:**
- **Local LLMs:** Run directly on your machine for enhanced data privacy and control.
- **Paid API LLMs:** Utilize cutting-edge models like OpenAI’s GPT-4 for superior performance and real-time capabilities.

- **Efficient Vector Search with Qdrant:**
- **Search results:** SerpAPI free use tier for Google search results
- **Fast & Accurate:** Qdrant efficiently stores and retrieves dense embeddings to ensure quick and precise search results at scale.

- **Advanced Retrieval Techniques:**
- **RAG Fusion Reranking:** Merges multiple search results using reciprocal rank fusion to prioritize the most relevant documents.
- **Planned Enhancements:** Integrate Late Interaction techniques (e.g., ColBERT-style token-level re-ranking) and hybrid search methods (combining dense embeddings with BM25).

- **Enterprise-Ready:**
- **Customizable for Closed Systems:** Easily tunable for internal databases and proprietary search systems, similar to industry-leading apps similar to Perplexity.

- **User-Friendly Interface:**
- **Gradio UI:** A simple, interactive web-based interface for seamless user interactions.

---

## 📂 Folder Structure
```
WebcrawlerRAG/
├── components/
│ ├── chat_logic.py # Contains the main logic for handling chat interactions and RAG techniques
│ ├── ranking_modes.py # Contains functions for different ranking modes like reciprocal rank fusion and unique union
├── services/
│ ├── search_service.py # Handles document search and loading
├── utils/
│ ├── config.py # Configuration settings for the project
├── app.py # Main application file to launch the Gradio UI
├── models.properties # Configuration file listing available models
├── requirements.txt # List of dependencies required for the project
├── README.md # Project documentation and instructions
├── .env # Environment variables (e.g., API keys, database URLs)

```

## How It Works
- **Document Loading & Processing**
- The system fetches documents via the search service and splits them into manageable chunks.

- **Vector Storage & Retrieval**
- Chunks are embedded using a dense embedding model and stored in Qdrant. Retrieval is performed using dense vector search.

- **RAG Fusion Re-ranking**
- Multiple search queries are generated, and results are merged using reciprocal rank fusion or Unique union for broader search use-cases to prioritize accurate matching.

- **Answer Synthesis**
- The retrieved context is fed into an LLM (local or API-based) to generate a final answer in markdown format with links to sources.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

---

Feel free to star, fork and contribute to this project and share your feedback!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ashwantmanikoth/intellilsearch

Awesome Lists containing this project

README