https://github.com/cocoindex-io/cocoindex-code
A super light-weight embedded code search engine CLI (AST based) that just works - saves 70% token and improves speed for coding agent π Star if you like it!
https://github.com/cocoindex-io/cocoindex-code
agents ast cocoindex code-search coding-agent context-engineering indexing mcp python tree-sitter
Last synced: 1 day ago
JSON representation
A super light-weight embedded code search engine CLI (AST based) that just works - saves 70% token and improves speed for coding agent π Star if you like it!
- Host: GitHub
- URL: https://github.com/cocoindex-io/cocoindex-code
- Owner: cocoindex-io
- License: apache-2.0
- Created: 2026-02-01T02:16:56.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-03-21T04:43:36.000Z (13 days ago)
- Last Synced: 2026-03-21T09:47:46.224Z (13 days ago)
- Topics: agents, ast, cocoindex, code-search, coding-agent, context-engineering, indexing, mcp, python, tree-sitter
- Language: Python
- Homepage: https://cocoindex.io/
- Size: 791 KB
- Stars: 1,078
- Watchers: 10
- Forks: 71
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
- awesome-mcp-list - cocoindex-io/cocoindex-code - io/cocoindex-code?style=social)](https://github.com/cocoindex-io/cocoindex-code): A super light-weight embedded MCP server for coding agents. Uses tree-sitter for code search and indexing, saves 70% tokens and improves speed. (Uncategorized / Uncategorized)
README
AST-based semantic code search that just works

A lightweight, effective **(AST-based)** semantic code search tool for your codebase. Built on [CocoIndex](https://github.com/cocoindex-io/cocoindex) β a Rust-based ultra performant data transformation engine. Use it from the CLI, or integrate with Claude, Codex, Cursor β any coding agent β via [Skill](#skill-recommended) or [MCP](#mcp-server).
- Instant token saving by 70%.
- **1 min setup** β install and go, zero config needed!
[](https://discord.com/invite/zpA9S2DR7s)
[](https://github.com/cocoindex-io/cocoindex)
[](https://cocoindex.io/docs/getting_started/quickstart)
[](https://opensource.org/licenses/Apache-2.0)
[](https://pepy.tech/projects/cocoindex)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/CI.yml)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/release.yml)
π Please help star [CocoIndex](https://github.com/cocoindex-io/cocoindex) if you like this project!
[Deutsch](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=de) |
[English](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=en) |
[EspaΓ±ol](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=es) |
[franΓ§ais](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=fr) |
[ζ₯ζ¬θͺ](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=ja) |
[νκ΅μ΄](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=ko) |
[PortuguΓͺs](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=pt) |
[Π ΡΡΡΠΊΠΈΠΉ](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=ru) |
[δΈζ](https://readme-i18n.com/cocoindex-io/cocoindex-code?lang=zh)
## Get Started β zero config, let's go!
### Install
Using [pipx](https://pipx.pypa.io/stable/installation/):
```bash
pipx install cocoindex-code # first install
pipx upgrade cocoindex-code # upgrade
```
Using [uv](https://docs.astral.sh/uv/getting-started/installation/):
```bash
uv tool install --upgrade cocoindex-code --prerelease explicit --with "cocoindex>=1.0.0a24"
```
The default embedding model runs locally ([sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) β no API key required, completely free.
Next, set up your [coding agent integration](#coding-agent-integration) β or jump to [Manual CLI Usage](#manual-cli-usage) if you prefer direct control.
## Coding Agent Integration
### Skill (Recommended)
Install the `ccc` skill so your coding agent automatically uses semantic search when needed:
```bash
npx skills add cocoindex-io/cocoindex-code
```
That's it β no `ccc init` or `ccc index` needed. The skill teaches the agent to handle initialization, indexing, and searching on its own. It will automatically keep the index up to date as you work.
The agent uses semantic search automatically when it would be helpful. You can also nudge it explicitly β just ask it to search the codebase, e.g. *"find how user sessions are managed"*, or type `/ccc` to invoke the skill directly.
Works with [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and other skill-compatible agents.
### MCP Server
Alternatively, use `ccc mcp` to run as an MCP server:
Claude Code
```bash
claude mcp add cocoindex-code -- ccc mcp
```
Codex
```bash
codex mcp add cocoindex-code -- ccc mcp
```
OpenCode
```bash
opencode mcp add
```
Enter MCP server name: `cocoindex-code`
Select MCP server type: `local`
Enter command to run: `ccc mcp`
Or use opencode.json:
```json
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"cocoindex-code": {
"type": "local",
"command": [
"ccc", "mcp"
]
}
}
}
```
Once configured, the agent automatically decides when semantic code search is helpful β finding code by description, exploring unfamiliar codebases, fuzzy/conceptual matches, or locating implementations without knowing exact names.
> **Note:** The `cocoindex-code` command (without subcommand) still works as an MCP server for backward compatibility. It auto-creates settings from environment variables on first run.
MCP Tool Reference
When running as an MCP server (`ccc mcp`), the following tool is exposed:
**`search`** β Search the codebase using semantic similarity.
```
search(
query: str, # Natural language query or code snippet
limit: int = 5, # Maximum results (1-100)
offset: int = 0, # Pagination offset
refresh_index: bool = True, # Refresh index before querying
languages: list[str] | None = None, # Filter by language (e.g. ["python", "typescript"])
paths: list[str] | None = None, # Filter by path glob (e.g. ["src/utils/*"])
)
```
Returns matching code chunks with file path, language, code content, line numbers, and similarity score.
## Manual CLI Usage
You can also use the CLI directly β useful for manual control, running indexing after changing settings, checking status, or searching outside an agent.
```bash
ccc init # initialize project (creates settings)
ccc index # build the index
ccc search "authentication logic" # search!
```
The background daemon starts automatically on first use.
> **Tip:** `ccc index` auto-initializes if you haven't run `ccc init` yet, so you can skip straight to indexing.
### CLI Reference
| Command | Description |
|---------|-------------|
| `ccc init` | Initialize a project β creates settings files, adds `.cocoindex_code/` to `.gitignore` |
| `ccc index` | Build or update the index (auto-inits if needed). Shows streaming progress. |
| `ccc search ` | Semantic search across the codebase |
| `ccc status` | Show index stats (chunk count, file count, language breakdown) |
| `ccc mcp` | Run as MCP server in stdio mode |
| `ccc doctor` | Run diagnostics β checks settings, daemon, model, file matching, and index health |
| `ccc reset` | Delete index databases. `--all` also removes settings. `-f` skips confirmation. |
| `ccc daemon status` | Show daemon version, uptime, and loaded projects |
| `ccc daemon restart` | Restart the background daemon |
| `ccc daemon stop` | Stop the daemon |
### Search Options
```bash
ccc search database schema # basic search
ccc search --lang python --lang markdown schema # filter by language
ccc search --path 'src/utils/*' query handler # filter by path
ccc search --offset 10 --limit 5 database schema # pagination
ccc search --refresh database schema # update index first, then search
```
By default, `ccc search` scopes results to your current working directory (relative to the project root). Use `--path` to override.
## Docker
A Docker image is available for teams who want a reproducible, dependency-free
setup β no Python, `uv`, or system dependencies required on the host.
The recommended approach is a **persistent container**: start it once, and use
`docker exec` to run CLI commands or connect MCP sessions to it. The daemon
inside stays warm across sessions, so the embedding model is loaded only once.
### Step 1 β Start the container
```bash
docker run -d --name cocoindex-code \
--volume "$(pwd):/workspace" \
--volume cocoindex-db:/db \
--volume cocoindex-model-cache:/root/.cache \
ghcr.io/cocoindex-io/cocoindex-code:latest
```
- `/workspace` β mount your project root here
- `cocoindex-db` β index databases live inside the container (fast native I/O, no cross-OS volume issues)
- `cocoindex-model-cache` β persists the embedding model across image upgrades
### Step 2 β Index your codebase
```bash
docker exec -it cocoindex-code ccc index
```
### Step 3 β Connect your coding agent
Claude Code
```bash
claude mcp add cocoindex-code -- docker exec -i cocoindex-code ccc mcp
```
Or via `.mcp.json`:
```json
{
"mcpServers": {
"cocoindex-code": {
"type": "stdio",
"command": "docker",
"args": ["exec", "-i", "cocoindex-code", "ccc", "mcp"]
}
}
}
```
Codex
```bash
codex mcp add cocoindex-code -- docker exec -i cocoindex-code ccc mcp
```
### CLI usage inside the container
All `ccc` commands work via `docker exec`:
```bash
docker exec -it cocoindex-code ccc index
docker exec -it cocoindex-code ccc search "authentication logic"
docker exec -it cocoindex-code ccc status
```
Or set an alias on your host so it feels native:
```bash
alias ccc='docker exec -it cocoindex-code ccc'
```
### Configuration via environment variables
Pass configuration to `docker run` with `-e`:
```bash
# Extra extensions (e.g. Typesafe Config, SBT build files)
-e COCOINDEX_CODE_EXTRA_EXTENSIONS="conf,sbt"
# Exclude build artefacts (Scala/SBT example)
-e COCOINDEX_CODE_EXCLUDE_PATTERNS='["**/target/**","**/.bloop/**","**/.metals/**"]'
# Swap in a code-optimised embedding model
-e COCOINDEX_CODE_EMBEDDING_MODEL=voyage/voyage-code-3
-e VOYAGE_API_KEY=your-key
```
### Build the image locally
```bash
docker build -t cocoindex-code:local -f docker/Dockerfile .
```
## Features
- **Semantic Code Search**: Find relevant code using natural language queries when grep doesn't work well, and save tokens immediately.
- **Ultra Performant**: β‘ Built on top of ultra performant [Rust indexing engine](https://github.com/cocoindex-io/cocoindex). Only re-indexes changed files for fast updates.
- **Multi-Language Support**: Python, JavaScript/TypeScript, Rust, Go, Java, C/C++, C#, SQL, Shell, and more.
- **Embedded**: Portable and just works, no database setup required!
- **Flexible Embeddings**: Local SentenceTransformers by default (free!) or 100+ cloud providers via LiteLLM.
## Configuration
Configuration lives in two YAML files, both created automatically by `ccc init`.
### User Settings (`~/.cocoindex_code/global_settings.yml`)
Shared across all projects. Controls the embedding model and environment variables for the daemon.
```yaml
embedding:
provider: sentence-transformers # or "litellm"
model: sentence-transformers/all-MiniLM-L6-v2
device: mps # optional: cpu, cuda, mps (auto-detected if omitted)
envs: # extra environment variables for the daemon
OPENAI_API_KEY: your-key # only needed if not already in your shell environment
```
> **Note:** The daemon inherits your shell environment. If an API key (e.g. `OPENAI_API_KEY`) is already set as an environment variable, you don't need to duplicate it in `envs`. The `envs` field is only for values that aren't in your environment.
### Project Settings (`/.cocoindex_code/settings.yml`)
Per-project. Controls which files to index.
```yaml
include_patterns:
- "**/*.py"
- "**/*.js"
- "**/*.ts"
- "**/*.rs"
- "**/*.go"
# ... (sensible defaults for 28+ file types)
exclude_patterns:
- "**/.*" # hidden directories
- "**/__pycache__"
- "**/node_modules"
- "**/dist"
# ...
language_overrides:
- ext: inc # treat .inc files as PHP
lang: php
chunkers:
- ext: toml # use a custom chunker for .toml files
module: example_toml_chunker:toml_chunker
```
> `.cocoindex_code/` is automatically added to `.gitignore` during init.
Use `chunkers` when you want to control how a file type is split into chunks before indexing.
`module: example_toml_chunker:toml_chunker` means:
- `example_toml_chunker` is a local Python module
- `toml_chunker` is the function inside that module
In practice, this usually means:
- you create a Python file in your project, for example `example_toml_chunker.py`
- you add a function in that file
- you point `settings.yml` at it with `module.path:function_name`
The function should use this signature:
```python
from pathlib import Path
from cocoindex_code.chunking import Chunk
def my_chunker(path: Path, content: str) -> tuple[str | None, list[Chunk]]:
...
```
- `path` is the file being indexed
- `content` is the full text of that file
- return `language_override` as a string like `"toml"` if you want to override language detection
- return `None` as `language_override` if you want to keep the detected language
- return a `list[Chunk]` with the chunks you want stored in the index
See [`src/cocoindex_code/chunking.py`](./src/cocoindex_code/chunking.py) for the public types and [`tests/example_toml_chunker.py`](./tests/example_toml_chunker.py) for a complete example.
## Embedding Models
By default, a local SentenceTransformers model ([sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) is used β no API key required. To use a different model, edit `~/.cocoindex_code/global_settings.yml`.
> The `envs` entries below are only needed if the key isn't already in your shell environment β the daemon inherits your environment automatically.
Ollama (Local)
```yaml
embedding:
model: ollama/nomic-embed-text
```
Set `OLLAMA_API_BASE` in `envs:` if your Ollama server is not at `http://localhost:11434`.
OpenAI
```yaml
embedding:
model: text-embedding-3-small
envs:
OPENAI_API_KEY: your-api-key
```
Azure OpenAI
```yaml
embedding:
model: azure/your-deployment-name
envs:
AZURE_API_KEY: your-api-key
AZURE_API_BASE: https://your-resource.openai.azure.com
AZURE_API_VERSION: "2024-06-01"
```
Gemini
```yaml
embedding:
model: gemini/gemini-embedding-001
envs:
GEMINI_API_KEY: your-api-key
```
Mistral
```yaml
embedding:
model: mistral/mistral-embed
envs:
MISTRAL_API_KEY: your-api-key
```
Voyage (Code-Optimized)
```yaml
embedding:
model: voyage/voyage-code-3
envs:
VOYAGE_API_KEY: your-api-key
```
Cohere
```yaml
embedding:
model: cohere/embed-v4.0
envs:
COHERE_API_KEY: your-api-key
```
AWS Bedrock
```yaml
embedding:
model: bedrock/amazon.titan-embed-text-v2:0
envs:
AWS_ACCESS_KEY_ID: your-access-key
AWS_SECRET_ACCESS_KEY: your-secret-key
AWS_REGION_NAME: us-east-1
```
Nebius
```yaml
embedding:
model: nebius/BAAI/bge-en-icl
envs:
NEBIUS_API_KEY: your-api-key
```
Any [LiteLLM-supported model](https://docs.litellm.ai/docs/embedding/supported_embedding) works. When using a LiteLLM model, set `provider: litellm` (or omit `provider` β LiteLLM is the default for non-`sentence-transformers` models).
### Local SentenceTransformers Models
Set `provider: sentence-transformers` and use any [SentenceTransformers](https://www.sbert.net/) model (no API key required).
**Example β general purpose text model:**
```yaml
embedding:
provider: sentence-transformers
model: nomic-ai/nomic-embed-text-v1.5
```
**GPU-optimised code retrieval:**
[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) delivers significantly better code retrieval than the default model. It is 137M parameters, requires ~1 GB VRAM, and has an 8192-token context window.
```yaml
embedding:
provider: sentence-transformers
model: nomic-ai/CodeRankEmbed
```
**Note:** Switching models requires re-indexing your codebase (`ccc reset && ccc index`) since the vector dimensions differ.
## Supported Languages
| Language | Aliases | File Extensions |
|----------|---------|-----------------|
| c | | `.c` |
| cpp | c++ | `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp` |
| csharp | csharp, cs | `.cs` |
| css | | `.css`, `.scss` |
| dtd | | `.dtd` |
| fortran | f, f90, f95, f03 | `.f`, `.f90`, `.f95`, `.f03` |
| go | golang | `.go` |
| html | | `.html`, `.htm` |
| java | | `.java` |
| javascript | js | `.js` |
| json | | `.json` |
| kotlin | | `.kt`, `.kts` |
| lua | | `.lua` |
| markdown | md | `.md`, `.mdx` |
| pascal | pas, dpr, delphi | `.pas`, `.dpr` |
| php | | `.php` |
| python | | `.py` |
| r | | `.r` |
| ruby | | `.rb` |
| rust | rs | `.rs` |
| scala | | `.scala` |
| solidity | | `.sol` |
| sql | | `.sql` |
| swift | | `.swift` |
| toml | | `.toml` |
| tsx | | `.tsx` |
| typescript | ts | `.ts` |
| xml | | `.xml` |
| yaml | | `.yaml`, `.yml` |
### Custom Database Location
By default, index databases (`cocoindex.db` and `target_sqlite.db`) live alongside settings in `/.cocoindex_code/`. When running in Docker, you may want the databases on the container's native filesystem for performance (LMDB doesn't work well on mounted volumes) while keeping the source code and settings on a mounted volume.
Set `COCOINDEX_CODE_DB_PATH_MAPPING` to remap database locations by path prefix:
```bash
COCOINDEX_CODE_DB_PATH_MAPPING=/workspace=/db-files
```
With this mapping, a project at `/workspace/myrepo` stores its databases in `/db-files/myrepo/` instead of `/workspace/myrepo/.cocoindex_code/`. Settings files remain in the original location.
Multiple mappings are comma-separated and resolved in order (first match wins):
```bash
COCOINDEX_CODE_DB_PATH_MAPPING=/workspace=/db-files,/workspace2=/db-files2
```
Both source and target must be absolute paths. If no mapping matches, the default location is used.
## Troubleshooting
Run `ccc doctor` to diagnose common issues. It checks your settings, daemon health, embedding model, file matching, and index status β all in one command.
### `sqlite3.Connection object has no attribute enable_load_extension`
Some Python installations (e.g. the one pre-installed on macOS) ship with a SQLite library that doesn't enable extensions.
**macOS fix:** Install Python through [Homebrew](https://brew.sh/):
```bash
brew install python3
```
Then re-install cocoindex-code (see [Get Started](#get-started--zero-config-lets-go) for install options):
Using pipx:
```bash
pipx install cocoindex-code # first install
pipx upgrade cocoindex-code # upgrade
```
Using uv (install or upgrade):
```bash
uv tool install --upgrade cocoindex-code --prerelease explicit --with "cocoindex>=1.0.0a24"
```
## Legacy: Environment Variables
If you previously configured `cocoindex-code` via environment variables, the `cocoindex-code` MCP command still reads them and auto-migrates to YAML settings on first run. We recommend switching to the YAML settings for new setups.
| Environment Variable | YAML Equivalent |
|---------------------|-----------------|
| `COCOINDEX_CODE_EMBEDDING_MODEL` | `embedding.model` in `global_settings.yml` |
| `COCOINDEX_CODE_DEVICE` | `embedding.device` in `global_settings.yml` |
| `COCOINDEX_CODE_ROOT_PATH` | Run `ccc init` in your project root instead |
| `COCOINDEX_CODE_EXCLUDED_PATTERNS` | `exclude_patterns` in project `settings.yml` |
| `COCOINDEX_CODE_EXTRA_EXTENSIONS` | `include_patterns` + `language_overrides` in project `settings.yml` |
## Large codebase / Enterprise
[CocoIndex](https://github.com/cocoindex-io/cocoindex) is an ultra efficient indexing engine that also works on large codebases at scale for enterprises. In enterprise scenarios it is a lot more efficient to share indexes with teammates when there are large or many repos. We also have advanced features like branch dedupe etc designed for enterprise users.
If you need help with remote setup, please email our maintainer linghua@cocoindex.io, happy to help!
## Contributing
We welcome contributions! Before you start, please install the [pre-commit](https://pre-commit.com/) hooks so that linting, formatting, type checking, and tests run automatically before each commit:
```bash
pip install pre-commit
pre-commit install
```
This catches common issues β trailing whitespace, lint errors (Ruff), type errors (mypy), and test failures β before they reach CI.
For more details, see our [contributing guide](https://cocoindex.io/docs/contributing/guide).
## License
Apache-2.0