Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dewitt4/github-semantic-search
GitHub Semantic Search Optimizer
https://github.com/dewitt4/github-semantic-search
Last synced: 16 days ago
JSON representation
GitHub Semantic Search Optimizer
- Host: GitHub
- URL: https://github.com/dewitt4/github-semantic-search
- Owner: dewitt4
- License: mit
- Created: 2024-12-03T01:13:43.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-12-03T01:24:29.000Z (about 1 month ago)
- Last Synced: 2024-12-03T02:26:31.868Z (about 1 month ago)
- Language: Python
- Size: 15.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GitHub Semantic Search Indexer
A powerful tool that enables semantic search capabilities for GitHub repositories. Instead of just searching for exact matches, this tool understands the meaning behind your queries and finds relevant code and documentation across your repository.
## Features
- 🔍 Semantic search using OpenAI's text-embedding-3-small model
- 💾 Efficient vector storage with ChromaDB
- 🔄 Automatic repository cloning and updating
- 📄 Support for multiple file types including Python, JavaScript, Java, and more
- 📦 Intelligent text chunking for better search results
- 🚀 Batch processing for better performance
- 📊 Progress tracking for long-running operations## Installation
1. Clone this repository:
```bash
git clone https://github.com/dewitt4/github-semantic-search
cd github-semantic-search
```2. Install the required packages:
```bash
pip install chromadb openai tiktoken gitpython tqdm
```3. Set up your OpenAI API key:
```bash
export OPENAI_API_KEY='your-api-key'
```## Usage
### Basic Usage
```python
from github_search import GitHubSearchIndexer# Initialize the indexer
indexer = GitHubSearchIndexer(
repo_url="https://github.com/username/repository",
target_dir="repo_data"
)# Index the repository
indexer.index_repository()# Search the repository
results = indexer.search("database connection handling")# Print results
for result in results:
print(f"\nFile: {result['path']}")
print(f"Relevance Score: {1 - result['distance']:.4f}")
print("Content snippet:")
print(result['content'][:200] + "...")
```### Customizing Search
You can customize the number of results:
```python
# Get top 10 results
results = indexer.search("database connection handling", n_results=10)
```### Supported File Types
The indexer processes the following file extensions:
- Python (.py)
- JavaScript (.js, .jsx)
- TypeScript (.tsx)
- Java (.java)
- C++ (.cpp, .h)
- C# (.cs)
- Ruby (.rb)
- PHP (.php)
- Go (.go)
- Rust (.rs)
- Documentation (.md, .rst, .txt)
- Configuration files (.json, .yml, .yaml, .toml, .ini)## How It Works
1. **Repository Cloning**: The tool clones the target repository or updates it if it already exists.
2. **File Processing**:
- Files are read and processed based on their file extensions
- Large files are automatically split into smaller chunks for better search accuracy
- Text is encoded using the cl100k_base tokenizer3. **Embedding Generation**:
- Each chunk of text is converted into embeddings using OpenAI's text-embedding-3-small model
- Embeddings capture the semantic meaning of the code and documentation4. **Vector Storage**:
- Embeddings are stored efficiently using ChromaDB
- Enables fast similarity search and retrieval5. **Search**:
- Queries are converted to embeddings
- ChromaDB finds the most similar chunks based on embedding similarity
- Results are returned with relevance scores and file metadata## Configuration
The indexer accepts several parameters:
```python
GitHubSearchIndexer(
repo_url: str, # URL of the GitHub repository
target_dir: str = "repo_data", # Directory to store repository and index
)
```Additional configuration during indexing:
```python
indexer.index_repository(
batch_size: int = 100 # Number of documents to process in each batch
)
```## Performance Considerations
- The initial indexing process may take some time depending on:
- Repository size
- Number of files
- OpenAI API response time
- Subsequent updates are faster as they only process changed files
- Search queries are typically very fast due to ChromaDB's efficient similarity search## Dependencies
- `chromadb`: Vector storage and similarity search
- `openai`: Text embedding generation
- `tiktoken`: Text tokenization
- `gitpython`: Repository management
- `tqdm`: Progress tracking## Error Handling
The tool includes robust error handling:
- Gracefully handles different file encodings
- Skips problematic files with error reporting
- Validates API key availability
- Provides clear error messages for common issues## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- Uses OpenAI's text-embedding-3-small model for high-quality embeddings
- Built with ChromaDB for efficient vector storage
- Inspired by the need for better code search tools