https://github.com/ria-19/reporag

The Open-Source Repository Intelligence System. A resilient RAG platform for code and documentation. Converse naturally with any codebase, wiki, or issue tracker to accelerate understanding and onboarding 10x
https://github.com/ria-19/reporag

ai-assistant codeanalysis faiss-vector-database github langchain llm-application python rag

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/ria-19/reporag
Owner: ria-19
Created: 2025-10-24T16:10:05.000Z (2 months ago)
Default Branch: master
Last Pushed: 2025-10-30T09:13:45.000Z (about 2 months ago)
Last Synced: 2025-10-30T11:26:01.938Z (about 2 months ago)
Topics: ai-assistant, codeanalysis, faiss-vector-database, github, langchain, llm-application, python, rag
Language: Python
Homepage:
Size: 76.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

````markdown
# RepoRAG 🚀

> Ask questions about any GitHub repository using AI

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.109-green.svg)](https://fastapi.tiangolo.com/)

**RepoRAG** helps developers understand unfamiliar codebases **10x faster** by enabling natural language conversations with code repositories.

---

## ✨ Features

- 🔍 **Multi-Source Ingestion**: GitHub repos, wikis, documentation URLs, YouTube tutorials
- 🧠 **Semantic Search**: Find relevant code by meaning, not just keywords
- 💬 **Conversational AI**: Ask follow-up questions with context
- 🎯 **Code-Aware Chunking**: Preserves function/class boundaries, crucial for code context.
- 🔄 **Hybrid Search**: Combines semantic + keyword search for robust retrieval.
- 📊 **Re-ranking**: Uses a Cross-encoder for highest accuracy on retrieved chunks.
- 🚀 **Production Ready**: FastAPI backend, Docker support for easy deployment.
- 💰 **100% Free**: Uses local LLMs (**Ollama**), eliminating API costs.

---

## 🎥 Demo

### Command Line Examples

```bash
# Index a repository (use the path to your cloned repo)
curl -X POST http://localhost:8000/index -d '{
"repo_path": "./langchain"
}'

# Ask a question against the indexed repository
curl -X POST http://localhost:8000/query -d '{
"question": "How does the RetrievalQA chain work?"
}'
````

-----

## 🏗️ Architecture

A high-level view of the Request-Answer flow:

```text
┌─────────────────────────────────────────────────────────┐
│ User Query │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Query Router & Processor │
│ • Routes to specialized handlers │
│ • Optimizes query for search │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Hybrid Search (BM25 + Vector) │
│ • Retrieves top-100 candidates │
│ • Combines keyword + semantic │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Cross-Encoder Re-ranking │
│ • Re-ranks to top-10 │
│ • Higher accuracy │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ LLM Generation (Ollama) │
│ • Specialized prompts per query type │
│ • Grounded in retrieved context │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Response Validation │
│ • Checks grounding in context │
│ • Validates relevance │
└─────────────────────────────────────────────────────────┘
```

-----

## 🚀 Quick Start

### Prerequisites

- **Python 3.8+**
- **Ollama** installed ([installation guide](https://ollama.ai/))

### Installation

```bash
# Clone repository
git clone [https://github.com/ria-19/reporag.git](https://github.com/ria-19/reporag.git)
cd reporag

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download Ollama model (e.g., llama2)
ollama pull llama2
```

### Basic Usage (Python SDK)

```python
from reporag import RepoRAGSystem

# Initialize system
system = RepoRAGSystem()

# Index a repository with optional documentation/videos
system.ingest_repository(
repo_path='./your-repo',
urls=['[https://docs.yourproject.com](https://docs.yourproject.com)'],
videos=['[https://youtube.com/watch?v=xxx](https://youtube.com/watch?v=xxx)']
)

# Ask questions
result = system.query("What is the main architecture?")
print(result['answer'])

# Conversational mode
system.chat("What are the main modules?")
system.chat("Tell me more about the first one")
```

### API Server

```bash
# Start server
python api.py

# Visit http://localhost:8000/docs for interactive API documentation
```

-----

## 📚 Documentation

### How It Works

### Performance

| Metric | Value |
|:---|:---|
| Indexing Speed | \~500 files/min |
| Query Latency | 2-5s (including LLM) |
| Memory Usage | \~2GB for 10K files |
| Accuracy (MRR@5) | 0.82 |

-----

## 🧪 Advanced Usage

### Hybrid Search

```python
# Enable hybrid search for better results
system.enable_hybrid_search()
# 70% semantic, 30% keyword
result = system.query("authentication function", alpha=0.7)
```

### Query Routing

```python
# Automatic routing to specialized handlers
result = system.query_with_enhancements(
"How does the login system work?", # → Routes to 'code_search'
use_routing=True
)
```

### Custom Filters

```python
# Search only Python files
result = system.query(
"database connection",
filters={'language': 'python'}
)
```

-----

## 🐳 Docker Deployment

### Build and Run

```bash
# Build the Docker image
docker build -t reporag .

# Run the container (mapping port 8000 and mounting a volume for persistent data)
docker run -p 8000:8000 -v $(pwd)/data:/app/data reporag
```

-----

## 📊 Evaluation Results

Tested on 50 open-source repositories:

| Metric | Score |
|:---|:---|
| Answer Accuracy | To Be Updated |
| Source Attribution | To Be Updated |
| Hallucination Rate | To Be Updated |
| User Satisfaction | To Be Updated |

-----

## 🛠️ Development

### Project Structure

```text
reporag/
├── src/
│ ├── document_processor.py # Multi-source ingestion
│ ├── chunking.py # Code-aware chunking
│ ├── vector_db.py # FAISS vector database
│ ├── rag_system.py # Complete RAG pipeline
│ ├── advanced_features.py # Hybrid search, re-ranking
│ └── api.py # FastAPI backend
├── tests/
│ ├── test_chunking.py
│ ├── test_retrieval.py
│ └── test_generation.py
├── examples/
│ └── notebooks/
├── requirements.txt
├── README.md
└── LICENSE
```

### Running Tests

```bash
pytest tests/
```

### Contributing

We welcome contributions\! Please see [`CONTRIBUTING.md`](https://www.google.com/search?q=CONTRIBUTING.md) for guidelines.

-----

## 🎯 Roadmap

- Support for more languages (**Rust, Go, Swift**)
- GitHub integration (automatic syncing)
- Multi-repo querying
- Code generation from docs
- VSCode extension
- Self-hosted web UI
- Fine-tuned models for code understanding

-----

## 📝 Citation

If you use RepoRAG in your research or project, please cite:

```bibtex
@software{reporag2025,
author = {Riya},
title = {RepoRAG: Question Answering for Code Repositories},
year = {2025},
url = {[https://github.com/ria-19/reporag](https://github.com/ria-19/reporag)}
}
```

-----

## 📄 License

**MIT License** - see [LICENSE](https://www.google.com/search?q=LICENSE) file for details

-----

## 🙏 Acknowledgments

- Built with [LangChain](https://github.com/hwchase17/langchain)
- Embeddings from [Sentence Transformers](https://www.sbert.net/)
- Vector search by [FAISS](https://github.com/facebookresearch/faiss)
- LLM by [Ollama](https://ollama.ai/)

-----

## 💬 Community

- **Discord**: [Join our server](https://discord.gg/reporag)
- **Twitter**: [@reporag](https://twitter.com/reporag)
- **Issues**: [GitHub Issues](https://github.com/yourusername/reporag/issues)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ria-19/reporag

Awesome Lists containing this project

README