{"id":23749383,"url":"https://github.com/dewitt4/github-semantic-search","last_synced_at":"2026-03-11T02:30:18.551Z","repository":{"id":266184385,"uuid":"897637771","full_name":"dewitt4/github-semantic-search","owner":"dewitt4","description":"GitHub Semantic Search Optimizer","archived":false,"fork":false,"pushed_at":"2024-12-03T01:24:29.000Z","size":16,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-31T15:18:29.647Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dewitt4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-03T01:13:43.000Z","updated_at":"2024-12-03T01:27:32.000Z","dependencies_parsed_at":"2024-12-03T02:38:52.522Z","dependency_job_id":null,"html_url":"https://github.com/dewitt4/github-semantic-search","commit_stats":null,"previous_names":["dewitt4/github-semantic-search"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dewitt4%2Fgithub-semantic-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dewitt4%2Fgithub-semantic-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dewitt4%2Fgithub-semantic-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dewitt4%2Fgithub-semantic-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dewitt4","download_url":"https://codeload.github.com/dewitt4/github-semantic-search/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239908988,"owners_count":19716891,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-31T15:18:33.433Z","updated_at":"2025-02-20T20:26:12.998Z","avatar_url":"https://github.com/dewitt4.png","language":"Python","readme":"\n# GitHub Semantic Search Indexer\n\nA powerful tool that enables semantic search capabilities for GitHub repositories. Instead of just searching for exact matches, this tool understands the meaning behind your queries and finds relevant code and documentation across your repository.\n\n## Features\n\n- 🔍 Semantic search using OpenAI's text-embedding-3-small model\n- 💾 Efficient vector storage with ChromaDB\n- 🔄 Automatic repository cloning and updating\n- 📄 Support for multiple file types including Python, JavaScript, Java, and more\n- 📦 Intelligent text chunking for better search results\n- 🚀 Batch processing for better performance\n- 📊 Progress tracking for long-running operations\n\n## Installation\n\n1. Clone this repository:\n```bash\ngit clone https://github.com/dewitt4/github-semantic-search\ncd github-semantic-search\n```\n\n2. Install the required packages:\n```bash\npip install chromadb openai tiktoken gitpython tqdm\n```\n\n3. Set up your OpenAI API key:\n```bash\nexport OPENAI_API_KEY='your-api-key'\n```\n\n## Usage\n\n### Basic Usage\n\n```python\nfrom github_search import GitHubSearchIndexer\n\n# Initialize the indexer\nindexer = GitHubSearchIndexer(\n    repo_url=\"https://github.com/username/repository\",\n    target_dir=\"repo_data\"\n)\n\n# Index the repository\nindexer.index_repository()\n\n# Search the repository\nresults = indexer.search(\"database connection handling\")\n\n# Print results\nfor result in results:\n    print(f\"\\nFile: {result['path']}\")\n    print(f\"Relevance Score: {1 - result['distance']:.4f}\")\n    print(\"Content snippet:\")\n    print(result['content'][:200] + \"...\")\n```\n\n### Customizing Search\n\nYou can customize the number of results:\n\n```python\n# Get top 10 results\nresults = indexer.search(\"database connection handling\", n_results=10)\n```\n\n### Supported File Types\n\nThe indexer processes the following file extensions:\n- Python (.py)\n- JavaScript (.js, .jsx)\n- TypeScript (.tsx)\n- Java (.java)\n- C++ (.cpp, .h)\n- C# (.cs)\n- Ruby (.rb)\n- PHP (.php)\n- Go (.go)\n- Rust (.rs)\n- Documentation (.md, .rst, .txt)\n- Configuration files (.json, .yml, .yaml, .toml, .ini)\n\n## How It Works\n\n1. **Repository Cloning**: The tool clones the target repository or updates it if it already exists.\n\n2. **File Processing**: \n   - Files are read and processed based on their file extensions\n   - Large files are automatically split into smaller chunks for better search accuracy\n   - Text is encoded using the cl100k_base tokenizer\n\n3. **Embedding Generation**: \n   - Each chunk of text is converted into embeddings using OpenAI's text-embedding-3-small model\n   - Embeddings capture the semantic meaning of the code and documentation\n\n4. **Vector Storage**: \n   - Embeddings are stored efficiently using ChromaDB\n   - Enables fast similarity search and retrieval\n\n5. **Search**: \n   - Queries are converted to embeddings\n   - ChromaDB finds the most similar chunks based on embedding similarity\n   - Results are returned with relevance scores and file metadata\n\n## Configuration\n\nThe indexer accepts several parameters:\n\n```python\nGitHubSearchIndexer(\n    repo_url: str,          # URL of the GitHub repository\n    target_dir: str = \"repo_data\",  # Directory to store repository and index\n)\n```\n\nAdditional configuration during indexing:\n\n```python\nindexer.index_repository(\n    batch_size: int = 100   # Number of documents to process in each batch\n)\n```\n\n## Performance Considerations\n\n- The initial indexing process may take some time depending on:\n  - Repository size\n  - Number of files\n  - OpenAI API response time\n- Subsequent updates are faster as they only process changed files\n- Search queries are typically very fast due to ChromaDB's efficient similarity search\n\n## Dependencies\n\n- `chromadb`: Vector storage and similarity search\n- `openai`: Text embedding generation\n- `tiktoken`: Text tokenization\n- `gitpython`: Repository management\n- `tqdm`: Progress tracking\n\n## Error Handling\n\nThe tool includes robust error handling:\n- Gracefully handles different file encodings\n- Skips problematic files with error reporting\n- Validates API key availability\n- Provides clear error messages for common issues\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Acknowledgments\n\n- Uses OpenAI's text-embedding-3-small model for high-quality embeddings\n- Built with ChromaDB for efficient vector storage\n- Inspired by the need for better code search tools\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdewitt4%2Fgithub-semantic-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdewitt4%2Fgithub-semantic-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdewitt4%2Fgithub-semantic-search/lists"}