{"id":14156547,"url":"https://github.com/definitive-io/code-indexer-loop","last_synced_at":"2025-08-06T02:33:53.694Z","repository":{"id":193369367,"uuid":"688519890","full_name":"definitive-io/code-indexer-loop","owner":"definitive-io","description":"Code Indexer Loop is a Python library for indexing and retrieving source code files through an integrated vector database that's continuously and efficiently updated.","archived":false,"fork":false,"pushed_at":"2024-04-09T11:54:28.000Z","size":42,"stargazers_count":169,"open_issues_count":0,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-07-24T01:59:50.687Z","etag":null,"topics":["code-search","embeddings","managed-by-terraform","python","vector-search"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/definitive-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-09-07T14:12:02.000Z","updated_at":"2024-07-15T14:07:52.000Z","dependencies_parsed_at":"2024-02-15T02:25:58.633Z","dependency_job_id":"a4e6a3e6-b01d-4ee1-86b4-391414743c1a","html_url":"https://github.com/definitive-io/code-indexer-loop","commit_stats":null,"previous_names":["definitive-io/code-indexer-loop"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/definitive-io%2Fcode-indexer-loop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/definitive-io%2Fcode-indexer-loop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/definitive-io%2Fcode-indexer-loop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/definitive-io%2Fcode-indexer-loop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/definitive-io","download_url":"https://codeload.github.com/definitive-io/code-indexer-loop/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":215735861,"owners_count":15923388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-search","embeddings","managed-by-terraform","python","vector-search"],"created_at":"2024-08-17T08:06:15.964Z","updated_at":"2024-08-17T08:07:27.160Z","avatar_url":"https://github.com/definitive-io.png","language":"Python","funding_links":[],"categories":["python"],"sub_categories":[],"readme":"# Code Indexer Loop\n\n[![PyPI version](https://badge.fury.io/py/code-indexer-loop.svg?v=2)](https://pypi.org/project/code-indexer-loop/)\n[![License](https://img.shields.io/github/license/definitive-io/code-indexer-loop?v=2)](LICENSE)\n[![Forks](https://img.shields.io/github/forks/definitive-io/code-indexer-loop?v=2)](https://github.com/definitive-io/code-indexer-loop/network)\n[![Stars](https://img.shields.io/github/stars/definitive-io/code-indexer-loop?v=2)](https://github.com/definitive-io/code-indexer-loop/stargazers)\n[![Twitter](https://img.shields.io/twitter/url/https/twitter.com?style=social\u0026label=Follow%20%40DefinitiveIO)](https://twitter.com/definitiveio)\n[![Discord](https://dcbadge.vercel.app/api/server/CPJJfq87Vx?compact=true\u0026style=flat)](https://discord.gg/CPJJfq87Vx)\n\n\n**Code Indexer Loop** is a Python library designed to index and retrieve code snippets. \n\nIt uses the useful indexing utilities of the **LlamaIndex** library and the multi-language **tree-sitter** library to parse the code from many popular programming languages. **tiktoken** is used to right-size retrieval based on number of tokens and **LangChain** is used to obtain embeddings (defaults to **OpenAI**'s `text-embedding-ada-002`) and store them in an embedded **ChromaDB** vector database. **watchdog** is used for continuous updating of the index based on file system events.\n\nRead the [launch blog post](https://www.definitive.io/blog/open-sourcing-code-indexer-loop) for more details about why we've built this!\n\n## Installation:\nUse `pip` to install Code Indexer Loop from PyPI.\n```\npip install code-indexer-loop\n```\n\n## Usage:\n1. Import necessary modules:\n```python\nfrom code_indexer_loop.api import CodeIndexer\n```\n2. Create a CodeIndexer object and have it watch for changes:\n```python\nindexer = CodeIndexer(src_dir=\"path/to/code/\", watch=True)\n```\n3. Use `.query` to perform a search query:\n```python\nquery = \"pandas\"\nprint(indexer.query(query)[0:30])\n```\n\nNote: make sure the `OPENAI_API_KEY` environment variable is set. This is needed for generating the embeddings.\n\nYou can also use `indexer.query_nodes` to get the nodes of a query or `indexer.query_documents` to receive the entire source code files.\n\nNote that if you edit any of the source code files in the `src_dir` it will efficiently re-index those files using `watchdog` and an `md5` based caching mechanism. This results in up-to-date embeddings every time you query the index.\n\n## Examples\nCheck out the [basic_usage](examples/basic_usage.ipynb) notebook for a quick overview of the API.\n\n## Token limits\nYou can configure token limits for the chunks through the CodeIndexer constructor:\n\n```python\nindexer = CodeIndexer(\n    src_dir=\"path/to/code/\", watch=True,\n    target_chunk_tokens = 300,\n    max_chunk_tokens = 1000,\n    enforce_max_chunk_tokens = False,\n    coalesce = 50\n    token_model = \"gpt-4\"\n)\n```\n\nNote you can choose whether the `max_chunk_tokens` is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the `max_chunk_tokens`.\n\nThe `coalesce` argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for `coalesce` is also tokens.\n\n## tree-sitter\nUsing `tree-sitter` for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.\n\n### Supported languages:\nC, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScript\n\nNote, we're mainly testing Python support. Use other languages at your own peril.\n\n## Contributing\nPull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within `dev` dependencies to maintain the code standard.\n\n### Tests\nRun the unit tests by invoking `pytest` in the root.\n\n## License\nPlease see the LICENSE file provided with the source code.\n\n## Attribution\nWe'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic [here](https://docs.sweep.dev/blogs/chunking-2m-files) and [here](https://docs.sweep.dev/blogs/chunking-improvements). The implementation in `code_indexer_loop` is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction (`\"\".join(chunks) == original_source_code`).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefinitive-io%2Fcode-indexer-loop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefinitive-io%2Fcode-indexer-loop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefinitive-io%2Fcode-indexer-loop/lists"}