Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/definitive-io/code-indexer-loop
Code Indexer Loop is a Python library for indexing and retrieving source code files through an integrated vector database that's continuously and efficiently updated.
https://github.com/definitive-io/code-indexer-loop
code-search embeddings managed-by-terraform python vector-search
Last synced: about 1 month ago
JSON representation
Code Indexer Loop is a Python library for indexing and retrieving source code files through an integrated vector database that's continuously and efficiently updated.
- Host: GitHub
- URL: https://github.com/definitive-io/code-indexer-loop
- Owner: definitive-io
- License: apache-2.0
- Created: 2023-09-07T14:12:02.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-09T11:54:28.000Z (6 months ago)
- Last Synced: 2024-07-24T01:59:50.687Z (2 months ago)
- Topics: code-search, embeddings, managed-by-terraform, python, vector-search
- Language: Python
- Homepage:
- Size: 41 KB
- Stars: 169
- Watchers: 4
- Forks: 7
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Code Indexer Loop
[![PyPI version](https://badge.fury.io/py/code-indexer-loop.svg?v=2)](https://pypi.org/project/code-indexer-loop/)
[![License](https://img.shields.io/github/license/definitive-io/code-indexer-loop?v=2)](LICENSE)
[![Forks](https://img.shields.io/github/forks/definitive-io/code-indexer-loop?v=2)](https://github.com/definitive-io/code-indexer-loop/network)
[![Stars](https://img.shields.io/github/stars/definitive-io/code-indexer-loop?v=2)](https://github.com/definitive-io/code-indexer-loop/stargazers)
[![Twitter](https://img.shields.io/twitter/url/https/twitter.com?style=social&label=Follow%20%40DefinitiveIO)](https://twitter.com/definitiveio)
[![Discord](https://dcbadge.vercel.app/api/server/CPJJfq87Vx?compact=true&style=flat)](https://discord.gg/CPJJfq87Vx)**Code Indexer Loop** is a Python library designed to index and retrieve code snippets.
It uses the useful indexing utilities of the **LlamaIndex** library and the multi-language **tree-sitter** library to parse the code from many popular programming languages. **tiktoken** is used to right-size retrieval based on number of tokens and **LangChain** is used to obtain embeddings (defaults to **OpenAI**'s `text-embedding-ada-002`) and store them in an embedded **ChromaDB** vector database. **watchdog** is used for continuous updating of the index based on file system events.
Read the [launch blog post](https://www.definitive.io/blog/open-sourcing-code-indexer-loop) for more details about why we've built this!
## Installation:
Use `pip` to install Code Indexer Loop from PyPI.
```
pip install code-indexer-loop
```## Usage:
1. Import necessary modules:
```python
from code_indexer_loop.api import CodeIndexer
```
2. Create a CodeIndexer object and have it watch for changes:
```python
indexer = CodeIndexer(src_dir="path/to/code/", watch=True)
```
3. Use `.query` to perform a search query:
```python
query = "pandas"
print(indexer.query(query)[0:30])
```Note: make sure the `OPENAI_API_KEY` environment variable is set. This is needed for generating the embeddings.
You can also use `indexer.query_nodes` to get the nodes of a query or `indexer.query_documents` to receive the entire source code files.
Note that if you edit any of the source code files in the `src_dir` it will efficiently re-index those files using `watchdog` and an `md5` based caching mechanism. This results in up-to-date embeddings every time you query the index.
## Examples
Check out the [basic_usage](examples/basic_usage.ipynb) notebook for a quick overview of the API.## Token limits
You can configure token limits for the chunks through the CodeIndexer constructor:```python
indexer = CodeIndexer(
src_dir="path/to/code/", watch=True,
target_chunk_tokens = 300,
max_chunk_tokens = 1000,
enforce_max_chunk_tokens = False,
coalesce = 50
token_model = "gpt-4"
)
```Note you can choose whether the `max_chunk_tokens` is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the `max_chunk_tokens`.
The `coalesce` argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for `coalesce` is also tokens.
## tree-sitter
Using `tree-sitter` for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.### Supported languages:
C, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScriptNote, we're mainly testing Python support. Use other languages at your own peril.
## Contributing
Pull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within `dev` dependencies to maintain the code standard.### Tests
Run the unit tests by invoking `pytest` in the root.## License
Please see the LICENSE file provided with the source code.## Attribution
We'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic [here](https://docs.sweep.dev/blogs/chunking-2m-files) and [here](https://docs.sweep.dev/blogs/chunking-improvements). The implementation in `code_indexer_loop` is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction (`"".join(chunks) == original_source_code`).