{"id":31644556,"url":"https://github.com/neabytelab/ai-indexing","last_synced_at":"2026-04-11T04:31:38.931Z","repository":{"id":314996454,"uuid":"1057652657","full_name":"NeaByteLab/AI-Indexing","owner":"NeaByteLab","description":"Code indexing framework for converting source code into structured repository maps optimized for semantic search LLM understanding.","archived":false,"fork":false,"pushed_at":"2025-09-16T03:47:05.000Z","size":37,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-16T05:51:59.516Z","etag":null,"topics":["ai","aider","benchmarking","code-analysis","code-indexing","developer-tools","go","indexing","javascript","llm","llm-evaluation","model-comparison","ollama","performance-testing","prompts","python","repomap","semantic-search","testing","typescript"],"latest_commit_sha":null,"homepage":"https://ollama.com","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NeaByteLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-16T03:08:33.000Z","updated_at":"2025-09-16T03:48:23.000Z","dependencies_parsed_at":"2025-09-17T15:18:05.144Z","dependency_job_id":null,"html_url":"https://github.com/NeaByteLab/AI-Indexing","commit_stats":null,"previous_names":["neabytelab/ai-indexing"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/NeaByteLab/AI-Indexing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeaByteLab%2FAI-Indexing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeaByteLab%2FAI-Indexing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeaByteLab%2FAI-Indexing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeaByteLab%2FAI-Indexing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NeaByteLab","download_url":"https://codeload.github.com/NeaByteLab/AI-Indexing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeaByteLab%2FAI-Indexing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278722768,"owners_count":26034461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","aider","benchmarking","code-analysis","code-indexing","developer-tools","go","indexing","javascript","llm","llm-evaluation","model-comparison","ollama","performance-testing","prompts","python","repomap","semantic-search","testing","typescript"],"created_at":"2025-10-07T04:53:58.252Z","updated_at":"2025-10-07T04:54:03.213Z","avatar_url":"https://github.com/NeaByteLab.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Code Indexing Prompts \u0026 Testing\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./screenshot/preview.png\" alt=\"Tool Calling and Semantic Search in Action\" width=\"800\" /\u003e\n\u003c/div\u003e\n\n\u003e This folder contains prompts and testing for converting source code into structured repository maps that look like Aider's repomap format. The maps contain key symbols (classes, functions, interfaces) with their signatures, optimized for semantic search and LLM understanding. [Reference: Aider Repomap](https://aider.chat/docs/repomap.html)\n\n### **What's Included**\n\n1. **System Prompts** - Structured prompts for LLM processing\n2. **Testing Data** - Performance comparison across different models\n3. **Examples** - Sample code and expected repository map outputs\n4. **Documentation** - How to use the prompts for indexing\n5. **CLI Scripts** - Ready-to-use tool calling examples and workflows ([📁 View Scripts](./scripts/))\n\n### **🚀 Quick Start with CLI Scripts**\n\nGet started immediately with our command-line interface that provides ready-to-use examples:\n\n```bash\n# Initialize the database\nnpx tsx ./scripts/index.ts init\n\n# Index your codebase\nnpx tsx ./scripts/index.ts indexing\n\n# Run tool calling example\nnpx tsx ./scripts/index.ts tool-calling\n\n# Show help\nnpx tsx ./scripts/index.ts help\n```\n\n\u003e **⚠️ Configuration Required**: Set your API key if you're using the Ollama endpoint. Edit the code in the `./scripts` folder to configure your credentials.\n\n**Available Commands:**\n\n- `init` - Initialize the database with proper schema\n- `indexing` - Process and index your entire codebase\n- `tool-calling` - Demonstrate semantic search and tool calling workflow\n- `help` - Display available commands and usage\n\n**Example:**\n\n```typescript\ninterface User {\n  id: number;\n  name: string;\n  email: string;\n}\n\nclass UserService {\n  private users: User[] = [];\n\n  addUser(user: User): void {\n    this.users.push(user);\n  }\n\n  getUserById(id: number): User | null {\n    return this.users.find((u) =\u003e u.id === id) || null;\n  }\n}\n```\n\n**Becomes:**\n\n```\n@@ /src/utils/helper.ts\n⋮... // User interface and service definitions\n│interface User {\n│  id: number\n│  name: string\n│  email: string\n│}\n⋮... // UserService class with CRUD operations\n│class UserService {\n│  private users: User[]\n│  addUser(user: User): void\n│  getUserById(id: number): User | null\n│}\n```\n\n---\n\n## 🔄 Workflow\n\n### Indexing Process\n\nThe indexing workflow converts raw source code into structured repository maps that are optimized for semantic search and LLM understanding. This process creates a searchable knowledge base of your codebase without requiring complex AST parsing.\n\n```mermaid\nflowchart LR\n    A[📁 Scan Workspace] --\u003e B[📄 Read Files]\n    B --\u003e C[🤖 Send to LLM]\n    C --\u003e D[🗂️ Generate Repo Map]\n    D --\u003e E[💾 Store to Database]\n```\n\n**Detailed Steps:**\n\n1. **📁 Scan Workspace** - Discover all source code files recursively\n2. **📄 Read Files** - Extract file contents with absolute paths for context\n3. **🤖 Send to LLM** - Process files with specialized prompts for code understanding\n4. **🗂️ Generate Repo Map** - Create Aider-format maps with key symbols and context\n5. **💾 Store to Database** - Persist maps for fast semantic search retrieval\n\n\u003e **Note**: Vector databases are not required for this indexing approach. Both **SQL** and **NoSQL** databases are suitable for storing the generated repository maps. The structured format allows for efficient text-based search and retrieval.\n\n### Retrieval Process\n\nThe retrieval workflow enables intelligent code assistance by allowing the LLM to dynamically search and read relevant code sections based on user queries. This creates a conversational coding experience where the AI can understand context and provide targeted help.\n\n**System Requirements:**\n\n- **Semantic Search Tool** (Required) - Enables finding relevant code sections based on meaning\n- **Read File Tool** (Required) - Allows reading specific files for detailed analysis\n\n**Process Flow:**\n\n- **User Query** → User asks for help (e.g., \"please help me fix this code\")\n- **LLM Analysis** → AI determines if additional context is needed\n- **Context Search** → If needed, LLM calls Semantic Search to find relevant code\n- **Detailed Analysis** → If more details required, LLM reads specific files\n- **Response Generation** → LLM processes all information and provides solution\n- **Iterative Loop** → Process repeats for follow-up questions until user is satisfied\n\n```mermaid\nsequenceDiagram\n    participant U as 👤 User\n    participant L as 🤖 LLM\n    participant S as 🔍 Semantic Search\n    participant F as 📄 Read File\n\n    U-\u003e\u003eL: \"Please help me fix this code\"\n    L-\u003e\u003eL: Analyze request\n\n    alt Need Context?\n        L-\u003e\u003eS: Search for relevant code\n        S--\u003e\u003eL: Return search results\n        L-\u003e\u003eL: Process search results\n\n        alt Need More Details?\n            L-\u003e\u003eF: Read specific file\n            F--\u003e\u003eL: Return file content\n            L-\u003e\u003eL: Analyze file content\n        end\n    end\n\n    L-\u003e\u003eL: Generate response\n    L--\u003e\u003eU: Send answer with solution\n\n    alt More Questions?\n        U-\u003e\u003eL: Follow-up question\n        Note over U,L: Process repeats...\n    else No More Questions\n        Note over U: ✅ End\n    end\n```\n\n---\n\n## 📝 How to Index Code\n\nThis section explains how to implement the code indexing system. The process requires minimal setup and provides structured code analysis for semantic search.\n\n### **Quick Steps**\n\n1. **Send code to LLM** with specialized system prompt that understands code structure\n2. **Get repository map** in standardized Aider format with key symbols and context\n3. **Save for semantic search** by storing the structured maps in your preferred database\n\n### **Implementation Details**\n\n**Prerequisites:**\n\n- Access to an LLM API (Ollama, OpenAI, Anthropic, etc.)\n- Database for storing repository maps (SQLite, PostgreSQL, MongoDB, etc.)\n- File system access to scan and read source code files\n\n**Key Benefits:**\n\n- **No AST Parsing Required** - Let the LLM handle code understanding\n- **Language Agnostic** - Works with any programming language the LLM understands\n- **Context Aware** - Captures relationships between symbols and their usage\n- **Search Optimized** - Output format is designed for efficient semantic search\n\n### **Example Request**\n\n````json\n{\n  \"model\": \"deepseek-v3.1:671b\",\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"Convert code to repository map format...\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"File: /src/utils/helper.ts\\n\\n```typescript\\ninte...\"\n    }\n  ],\n  \"format\": {\n    \"type\": \"object\",\n    \"properties\": {\n      \"content\": { \"type\": \"string\" },\n      \"description\": { \"type\": \"string\" },\n      \"keywords\": { \"type\": \"string\" }\n    },\n    \"required\": [\"content\", \"description\", \"keywords\"]\n  }\n}\n````\n\n### **Example Output**\n\n```json\n{\n  \"content\": \"@@ /src/utils/helper.ts\\n⋮... // User ....\",\n  \"description\": \"User interface and service for managing users\",\n  \"keywords\": \"keyword 1, keyword 2, ...\"\n}\n```\n\n**🤖 [View System Prompt](./prompt/system.md)** - _System prompt for code indexing_  \n**👤 [View User Prompt](./prompt/user.md)** - _User prompt template for code indexing_\n\n---\n\n## 📊 Performance Comparison\n\nOur comprehensive testing across different AI models reveals significant performance variations. The benchmarks were conducted using real-world codebases with varying complexity levels, measuring both processing speed and output accuracy.\n\n**Testing Methodology:**\n\n- **Dataset**: 100+ code files across multiple languages (TypeScript, Python, JavaScript, Go)\n- **Metrics**: Processing time per file and accuracy of generated repository maps\n- **Accuracy**: Measured by manual review of symbol extraction, signature correctness, and context preservation\n- **Test Environment**: MacBook Pro (M3 Pro, 11 cores, 18GB RAM) running macOS 26.0\n\n**📋 [View Testing Documentation](./testing.md)** - _Detailed testing procedures and methodology_\n\n### 🌐 Cloud Models\n\n| Model                             | Speed      | Accuracy | Duration | Content | Think Mode | Recommendation            |\n| --------------------------------- | ---------- | -------- | -------- | ------- | ---------- | ------------------------- |\n| **deepseek-v3.1:671b** (no-think) | 50.85 t/s  | 70%      | 11.99s   | 2,117   | No         | ⭐ **Good Content**       |\n| **deepseek-v3.1:671b** (think)    | 63.58 t/s  | 65%      | 86.74s   | 1,503   | Yes        | ⚠️ **Slow but Detailed**  |\n| **gpt-oss:120b** (no-think)       | 141.99 t/s | 70%      | 7.21s    | 1,364   | No         | ⭐ **Fast \u0026 Good**        |\n| **gpt-oss:120b** (think-high)     | 185.83 t/s | 60%      | 45.66s   | 672     | High       | ⚠️ **Fast but Short**     |\n| **gpt-oss:120b** (think-low)      | 148.75 t/s | 70%      | 4.15s    | 1,707   | Low        | ⭐ **Best Balance**       |\n| **gpt-oss:120b** (think-medium)   | 170.24 t/s | 65%      | 7.32s    | 1,521   | Medium     | ⭐ **Good Speed/Quality** |\n| **gpt-oss:20b** (no-think)        | 216.40 t/s | 70%      | 37.65s   | 1,174   | No         | ⭐ **Fast Processing**    |\n| **gpt-oss:20b** (think-high)      | 0.00 t/s   | 0%       | 0.00s    | 0       | High       | ❌ **Failed**             |\n| **gpt-oss:20b** (think-low)       | 0.00 t/s   | 0%       | 0.00s    | 0       | Low        | ❌ **Failed**             |\n| **gpt-oss:20b** (think-medium)    | 318.40 t/s | 70%      | 5.27s    | 1,223   | Medium     | ⭐ **Fastest**            |\n| **qwen3-coder:480b** (no-think)   | 72.75 t/s  | 70%      | 8.47s    | 2,223   | No         | ⭐ **Best Content**       |\n\n### 🏠 Offline/Local Models\n\n| Model                              | Speed     | Accuracy | Duration | Content | Think Mode | Recommendation           |\n| ---------------------------------- | --------- | -------- | -------- | ------- | ---------- | ------------------------ |\n| **qwen2.5-coder:1.5b** (no-think)  | 27.63 t/s | 55%      | 11.40s   | 1,055   | No         | ⚠️ **Fair Quality**      |\n| **qwen3:1.7b** (no-think)          | 23.41 t/s | 70%      | 13.54s   | 1,058   | No         | ⭐ **Good Performance**  |\n| **qwen3:1.7b** (think)             | 44.13 t/s | 70%      | 6.00s    | 779     | Yes        | ⭐ **Good Balance**      |\n| **qwen3:4b-instruct** (no-think)   | 14.30 t/s | 70%      | 46.08s   | 2,454   | No         | ⭐ **Good but Slow**     |\n| **qwen3:8b** (no-think)            | 6.57 t/s  | 70%      | 84.40s   | 2,035   | No         | ⚠️ **Very Slow**         |\n| **qwen3:8b** (think)               | 10.00 t/s | 70%      | 54.65s   | 2,040   | Yes        | ⚠️ **Slow but Detailed** |\n| **deepcoder:1.5b** (no-think)      | 39.74 t/s | 30%      | 17.19s   | 787     | No         | ❌ **Poor Performance**  |\n| **deepseek-coder:1.3b** (no-think) | 16.76 t/s | 10%      | 8.77s    | 175     | No         | ❌ **Very Poor**         |\n\n\u003e **💻 Hardware Specifications**: All offline models tested on MacBook Pro (M3 Pro, 11 cores, 18GB RAM) running macOS 26.0\n\n**📁 [View Detailed Comparison Results](./comparison)** - _Raw performance data and test results_\n\n---\n\n## 💡 Pro Tips \u0026 Best Practices\n\n- **Improve Semantic Accuracy**: Filter by nearest path to enhance search precision\n- **Targeted Context**: Use specific file targeting and edit prompts to include relevant context\n- **Cloud Models**: We recommend **GPT-OSS 20B (think-medium)** for fastest processing (318.40 tokens/s)\n- **Best Balance**: Use **GPT-OSS 120B (think-low)** for optimal speed/quality balance (148.75 tokens/s, 70% accuracy)\n- **Offline Capability**: Use **Qwen3:1.7b (think)** for best offline performance (44.13 tokens/s, 70% accuracy)\n- **Content Quality**: Use **Qwen3-coder:480b (no-think)** for richest content (2,223 characters)\n- **Reliability**: Use **DeepSeek models (clouds)** for 100% success rate\n- **Avoid**: **deepcoder:1.5b** and **deepseek-coder:1.3b** performed poorly (30% and 10% scores)\n- **Result**: LLM will never be confused again with no complexity behind the scenes\n\n\u003e **Implementation Example**: The [`scripts`](./scripts/) folder provides examples of how to implement this in a simple way, using SQLite for the database. In real applications, you can also index the database and filter with proper algorithms.\n\n\u003e **Pro Tip**: Use root path detection with [`Project-Root`](https://github.com/NeaByteLab/Project-Root) to filter file paths and improve semantic search accuracy. This ensures the system gets proper context by matching project paths, resulting in more accurate and relevant content retrieval.\n\n---\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## 📚 References\n\n- [Aider Repomap Documentation](https://aider.chat/docs/repomap.html) - _Inspiration for repo map format_\n- [Ollama Tool Support](https://ollama.com/blog/tool-support) - _Tool calling capabilities for retrieval workflow_\n- [Ollama Blog](https://ollama.com/blog) - _Latest features and capabilities_\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneabytelab%2Fai-indexing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneabytelab%2Fai-indexing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneabytelab%2Fai-indexing/lists"}