An open API service indexing awesome lists of open source software.

https://github.com/ria-19/reporag

The Open-Source Repository Intelligence System. A resilient RAG platform for code and documentation. Converse naturally with any codebase, wiki, or issue tracker to accelerate understanding and onboarding 10x
https://github.com/ria-19/reporag

ai-assistant codeanalysis faiss-vector-database github langchain llm-application python rag

Last synced: about 2 months ago
JSON representation

The Open-Source Repository Intelligence System. A resilient RAG platform for code and documentation. Converse naturally with any codebase, wiki, or issue tracker to accelerate understanding and onboarding 10x

Awesome Lists containing this project

README

          

````markdown
# RepoRAG ๐Ÿš€

> Ask questions about any GitHub repository using AI

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.109-green.svg)](https://fastapi.tiangolo.com/)

**RepoRAG** helps developers understand unfamiliar codebases **10x faster** by enabling natural language conversations with code repositories.

---

## โœจ Features

- ๐Ÿ” **Multi-Source Ingestion**: GitHub repos, wikis, documentation URLs, YouTube tutorials
- ๐Ÿง  **Semantic Search**: Find relevant code by meaning, not just keywords
- ๐Ÿ’ฌ **Conversational AI**: Ask follow-up questions with context
- ๐ŸŽฏ **Code-Aware Chunking**: Preserves function/class boundaries, crucial for code context.
- ๐Ÿ”„ **Hybrid Search**: Combines semantic + keyword search for robust retrieval.
- ๐Ÿ“Š **Re-ranking**: Uses a Cross-encoder for highest accuracy on retrieved chunks.
- ๐Ÿš€ **Production Ready**: FastAPI backend, Docker support for easy deployment.
- ๐Ÿ’ฐ **100% Free**: Uses local LLMs (**Ollama**), eliminating API costs.

---

## ๐ŸŽฅ Demo

### Command Line Examples

```bash
# Index a repository (use the path to your cloned repo)
curl -X POST http://localhost:8000/index -d '{
ย  "repo_path": "./langchain"
}'

# Ask a question against the indexed repository
curl -X POST http://localhost:8000/query -d '{
ย  "question": "How does the RetrievalQA chain work?"
}'
````

-----

## ๐Ÿ—๏ธ Architecture

A high-level view of the Request-Answer flow:

```text
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ User Query โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Query Router & Processor โ”‚
โ”‚ โ€ข Routes to specialized handlers โ”‚
โ”‚ โ€ข Optimizes query for search โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Hybrid Search (BM25 + Vector) โ”‚
โ”‚ โ€ข Retrieves top-100 candidates โ”‚
โ”‚ โ€ข Combines keyword + semantic โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Cross-Encoder Re-ranking โ”‚
โ”‚ โ€ข Re-ranks to top-10 โ”‚
โ”‚ โ€ข Higher accuracy โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ LLM Generation (Ollama) โ”‚
โ”‚ โ€ข Specialized prompts per query type โ”‚
โ”‚ โ€ข Grounded in retrieved context โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Response Validation โ”‚
โ”‚ โ€ข Checks grounding in context โ”‚
โ”‚ โ€ข Validates relevance โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

-----

## ๐Ÿš€ Quick Start

### Prerequisites

- **Python 3.8+**
- **Ollama** installed ([installation guide](https://ollama.ai/))

### Installation

```bash
# Clone repository
git clone [https://github.com/ria-19/reporag.git](https://github.com/ria-19/reporag.git)
cd reporag

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download Ollama model (e.g., llama2)
ollama pull llama2
```

### Basic Usage (Python SDK)

```python
from reporag import RepoRAGSystem

# Initialize system
system = RepoRAGSystem()

# Index a repository with optional documentation/videos
system.ingest_repository(
repo_path='./your-repo',
urls=['[https://docs.yourproject.com](https://docs.yourproject.com)'],
videos=['[https://youtube.com/watch?v=xxx](https://youtube.com/watch?v=xxx)']
)

# Ask questions
result = system.query("What is the main architecture?")
print(result['answer'])

# Conversational mode
system.chat("What are the main modules?")
system.chat("Tell me more about the first one")
```

### API Server

```bash
# Start server
python api.py

# Visit http://localhost:8000/docs for interactive API documentation
```

-----

## ๐Ÿ“š Documentation

### How It Works

| Stage | Process | Key Details |
| :--- | :--- | :--- |
| **Document Ingestion** | Loads code, docs, videos, filters binaries, extracts metadata. | Supports all major code languages. |
| **Code-Aware Chunking** | Respects function/class boundaries, adds context headers. | Maintains 200-token overlap for continuity. |
| **Embedding Generation** | Uses **all-MiniLM-L6-v2** (384 dimensions). | Batched processing for efficiency. |
| **Vector Storage** | **FAISS** for fast similarity search. | Persistent disk storage for indexed data. |
| **Retrieval** | Hybrid search (**BM25 + semantic**) with Cross-encoder re-ranking. | Query routing by type (e.g., code-search, documentation-search). |
| **Generation** | Specialized prompts per query type. | Grounded in retrieved context via Ollama. |

### Performance

| Metric | Value |
|:---|:---|
| Indexing Speed | \~500 files/min |
| Query Latency | 2-5s (including LLM) |
| Memory Usage | \~2GB for 10K files |
| Accuracy (MRR@5) | 0.82 |

-----

## ๐Ÿงช Advanced Usage

### Hybrid Search

```python
# Enable hybrid search for better results
system.enable_hybrid_search()
# 70% semantic, 30% keyword
result = system.query("authentication function", alpha=0.7)
```

### Query Routing

```python
# Automatic routing to specialized handlers
result = system.query_with_enhancements(
"How does the login system work?", # โ†’ Routes to 'code_search'
use_routing=True
)
```

### Custom Filters

```python
# Search only Python files
result = system.query(
"database connection",
filters={'language': 'python'}
)
```

-----

## ๐Ÿณ Docker Deployment

### Build and Run

```bash
# Build the Docker image
docker build -t reporag .

# Run the container (mapping port 8000 and mounting a volume for persistent data)
docker run -p 8000:8000 -v $(pwd)/data:/app/data reporag
```

-----

## ๐Ÿ“Š Evaluation Results

Tested on 50 open-source repositories:

| Metric | Score |
|:---|:---|
| Answer Accuracy | To Be Updated |
| Source Attribution | To Be Updated |
| Hallucination Rate | To Be Updated |
| User Satisfaction | To Be Updated |

-----

## ๐Ÿ› ๏ธ Development

### Project Structure

```text
reporag/
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ document_processor.py # Multi-source ingestion
โ”‚ โ”œโ”€โ”€ chunking.py # Code-aware chunking
โ”‚ โ”œโ”€โ”€ vector_db.py # FAISS vector database
โ”‚ โ”œโ”€โ”€ rag_system.py # Complete RAG pipeline
โ”‚ โ”œโ”€โ”€ advanced_features.py # Hybrid search, re-ranking
โ”‚ โ””โ”€โ”€ api.py # FastAPI backend
โ”œโ”€โ”€ tests/
โ”‚ โ”œโ”€โ”€ test_chunking.py
โ”‚ โ”œโ”€โ”€ test_retrieval.py
โ”‚ โ””โ”€โ”€ test_generation.py
โ”œโ”€โ”€ examples/
โ”‚ โ””โ”€โ”€ notebooks/
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ LICENSE
```

### Running Tests

```bash
pytest tests/
```

### Contributing

We welcome contributions\! Please see [`CONTRIBUTING.md`](https://www.google.com/search?q=CONTRIBUTING.md) for guidelines.

-----

## ๐ŸŽฏ Roadmap

- Support for more languages (**Rust, Go, Swift**)
- GitHub integration (automatic syncing)
- Multi-repo querying
- Code generation from docs
- VSCode extension
- Self-hosted web UI
- Fine-tuned models for code understanding

-----

## ๐Ÿ“ Citation

If you use RepoRAG in your research or project, please cite:

```bibtex
@software{reporag2025,
author = {Riya},
title = {RepoRAG: Question Answering for Code Repositories},
year = {2025},
url = {[https://github.com/ria-19/reporag](https://github.com/ria-19/reporag)}
}
```

-----

## ๐Ÿ“„ License

**MIT License** - see [LICENSE](https://www.google.com/search?q=LICENSE) file for details

-----

## ๐Ÿ™ Acknowledgments

- Built with [LangChain](https://github.com/hwchase17/langchain)
- Embeddings from [Sentence Transformers](https://www.sbert.net/)
- Vector search by [FAISS](https://github.com/facebookresearch/faiss)
- LLM by [Ollama](https://ollama.ai/)

-----

## ๐Ÿ’ฌ Community

- **Discord**: [Join our server](https://discord.gg/reporag)
- **Twitter**: [@reporag](https://twitter.com/reporag)
- **Issues**: [GitHub Issues](https://github.com/yourusername/reporag/issues)