{"id":25595800,"url":"https://github.com/duriantaco/pykomodo","last_synced_at":"2025-04-13T01:08:32.550Z","repository":{"id":277700941,"uuid":"924135356","full_name":"duriantaco/pykomodo","owner":"duriantaco","description":"A Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks.","archived":false,"fork":false,"pushed_at":"2025-03-23T16:16:42.000Z","size":8978,"stargazers_count":24,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-13T01:08:24.264Z","etag":null,"topics":["chunking","llm","python","python3"],"latest_commit_sha":null,"homepage":"https://pykomodo.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/duriantaco.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"License","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-29T13:34:17.000Z","updated_at":"2025-04-11T00:00:30.000Z","dependencies_parsed_at":"2025-03-23T15:36:49.135Z","dependency_job_id":null,"html_url":"https://github.com/duriantaco/pykomodo","commit_stats":null,"previous_names":["duriantaco/pykomodo"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duriantaco%2Fpykomodo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duriantaco%2Fpykomodo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duriantaco%2Fpykomodo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duriantaco%2Fpykomodo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/duriantaco","download_url":"https://codeload.github.com/duriantaco/pykomodo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248650762,"owners_count":21139681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunking","llm","python","python3"],"created_at":"2025-02-21T11:34:43.042Z","updated_at":"2025-04-13T01:08:32.545Z","avatar_url":"https://github.com/duriantaco.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/KOMODO.png\" alt=\"KOMODO Logo\" width=\"200\"\u003e\n\u003c/p\u003e\n\nA Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks. The tool provides intelligent file filtering, multi-threaded processing, and advanced chunking capabilities optimized for machine learning contexts.\n\n## Core Features\n\n* Parallel Processing: Multi-threaded file reading with configurable thread pools\n\n* Smart File Filtering:\n    * Built-in patterns for common excludes (.git, node_modules, pycache, etc.)\n    * Customizable ignore/unignore patterns\n    * Intelligent binary file detection\n\n* Flexible Chunking:\n    * Equal-parts chunking: Split content into N equal chunks\n    * Size-based chunking: Split by maximum chunk size\n    * Semantic (AST-based) chunking for Python files  \n    * Dry-run mode: If you only want to see which files **would** be chunked\n    * **NEW** Token based chunking: Split by tokens for LLMs\n\n* LLM Optimizations:\n    * Metadata extraction (functions, classes, imports, docstrings)\n    * Content relevance scoring\n    * Redundancy removal across chunks\n    * Configurable context window sizes\n  \n* Chunking PDF Files:\n  * Split PDF content by pages and paragraphs (rather than lines)\n  * Perform basic text cleanup to handle multi-column layouts, or text from HTML-like elements if present\n  * Create multiple chunks for large PDFs while preserving some logical structure\n\n* We scan your repos for api keys and automatically redact it. `.env` files are also ignored\n\n## Installation\n\n```bash\npip install komodo==0.1.5\n```\n\nLink to pypi: https://pypi.org/project/pykomodo/\n\n## Quick Start\n\n### Command Line Usage\n\nHere’s a complete list of available command-line options for the `komodo` tool:\n\n| Option                | Description                                                                                   | Default Value      |\n|-----------------------|-----------------------------------------------------------------------------------------------|--------------------|\n| `--version`           | Show the version of komodo         | N/A                |\n| `dirs`                | Directories to process (space-separated; e.g., `komodo dir1/ dir2/`).                         | Current directory (`.`) |\n| `--equal-chunks N`    | Split content into N equal chunks. Mutually exclusive with `--max-chunk-size`.                | None               |\n| `--max-chunk-size M`  | Maximum size per chunk (tokens without `--semantic-chunks`; lines for `.py` with it).         | None               |\n| `--max-tokens N`    | Maximum tokens per chunk (uses token-based chunking).                                      | None               |\n| `--output-dir DIR`    | Directory where chunk files are saved.                                                        | `\"chunks\"`         |\n| `--ignore PATTERN`    | Add a pattern to ignore (repeatable, e.g., `--ignore \"*.log\"`).                               | None               |\n| `--unignore PATTERN`  | Add a pattern to unignore (repeatable, overrides ignores).                                    | None               |\n| `--dry-run`           | List files that would be processed without creating chunks.                                   | False              |\n| `--priority PATTERN,SCORE` | Set priority for file patterns (repeatable, e.g., `--priority \"*.py,10\"`).                | None               |\n| `--num-threads N`     | Number of threads for parallel processing.                                                    | 4                  |\n| `--enhanced`          | Use `EnhancedParallelChunker` for LLM optimizations.                                          | False              |\n| `--semantic-chunks`   | Enable AST-based chunking for `.py` files (splits by functions/classes).                      | False              |\n| `--context-window N`  | Target LLM context window size in bytes (used with `--enhanced`).                             | 4096               |\n| `--min-relevance F`   | Minimum relevance score for chunks (0.0-1.0, used with `--enhanced`).                         | 0.3                |\n| `--no-metadata`       | Disable metadata extraction (used with `--enhanced`).                                         | False (metadata enabled) |\n| `--keep-redundant`    | Keep redundant content across chunks (used with `--enhanced`).                                | False (removes redundancy) |\n| `--no-summaries`      | Disable summary generation (used with `--enhanced`; currently a placeholder in code).          | False (summaries enabled) |\n| `--file-type TYPE`    | Only process files of this extension (e.g., `pdf`, `py`).                                     | None               |\n\n**Notes:**\n- Options like `--equal-chunks` and `--max-chunk-size` cannot be used together (enforced by the CLI).\n- Use `--dry-run` to test your ignore/unignore patterns or priority rules without generating output.\n\n#### Basic usage \n\n```bash\n# Split into 5 equal chunks\nkomodo . --equal-chunks 5\n\n# Process multiple directories\nkomodo path1/ path2/ --max-chunk-size 1000\n```\n\n#### Chunking Modes\n\nKomodo offers flexible chunking strategies, with behavior varying based on options and the chunker type (`ParallelChunker` or `EnhancedParallelChunker` with `--enhanced`).\n\n- **Fixed Number of Chunks (`--equal-chunks N`)**:\n  - **Base Chunker**: Keeps files whole, distributing them into N chunks with approximately equal total character counts. i.e. 5 different chunks or 5 text files. \n    ```bash\n    komodo . --equal-chunks 5 --output-dir chunks\n    ```\n\n  - **Enhanced Chunker**: Combines all file contents into one text blob, then splits into N chunks of roughly equal byte size, potentially splitting files mid-content.\n    ```bash\n    komodo . --equal-chunks 5 --enhanced\n    ```\n\n- **Fixed Size Chunks (--max-chunk-size M)**:\nWithout `--semantic-chunks`: Splits each file into chunks with at most M tokens (words), keeping lines whole. i.e. x number of chunks with 2000 tokens each or 5000 tokens each etc. \n\n  ```bash\n  komodo . --max-chunk-size 2000\n  ```\n\n  **Important: You must specify either --equal-chunks or --max-chunk-size, but not both.**\n\n\n- **With --semantic-chunks**:\n\n* For .py files: Aims for chunks of M lines, grouping top-level functions/classes as atomic units. If a function exceeds M lines, it becomes a single chunk.\n* For non-.py files: Still splits by M tokens.\n\n  ```bash\n  komodo . --max-chunk-size 200 --semantic-chunks\n  ```\n\n- ****NEW**** **With --max-tokens**:\n\n  ```bash\n  komodo . --max-tokens 1000 --output-dir chunks\n  ```\n\n* Precise token limits: Chunks content based on token counts rather than line counts\n* Tiktoken integration: Uses OpenAI's tiktoken library when available for accurate LLM token counting\n* Fallback tokenization: Falls back to word-based splitting when tiktoken is unavailable\n\n- **PDF Chunking**:\n\n  Uses PyMuPDF to split PDFs by pages and paragraphs, respecting --max-chunk-size in tokens.\n\n    ```bash\n    komodo . --max-chunk-size 500 /path/to/output --file-type pdf\n    ```\n\n    or \n\n    ```bash\n    komodo . --equal-chunks 10 --output-dir /path/to/output --file-type pdf\n    ```\n\n    **IMPORTANT: Do note that for PDFs with a lot of images, this PDF chunker will NOT WORK. This current PDF chunker is NOT capable of chunking formulas/images** \n\n#### Ignoring \u0026 Unignoring Files\n\n* Add ignore patterns with --ignore.\n* Unignore specific patterns with --unignore.\n* Komodo also has built-in ignores like .git, __pycache__, node_modules, etc.\n\n  ```bash\n  # Skip everything in \"results/\" (relative) and \"docs/\" (relative)\n  komodo . --equal-chunks 5 \\\n    --ignore \"results/**\" \\\n    --ignore \"docs/**\"\n\n  # Skip an absolute path\n  komodo . --equal-chunks 5 \\\n    --ignore \"/Users/oha/komodo/results/**\"\n\n  # Skip all .rst files, but unignore README.rst\n  komodo . --equal-chunks 5 \\\n    --ignore \"*.rst\" \\\n    --unignore \"README.rst\"\n  ```\n\n  **Note: If node_modules fails to be ignored, run this command instead `komodo . --equal-chunks 5 --file-type js --ignore \"**/node_modules/**\"`. The key here is that you are specifying the file type.**\n\n  ##### Safest (Recursive) Ignoring\n\n  If you want to ensure that Komodo skips all files inside a particular directory (including all subfolders), you can use the ** wildcard before and after the folder name:\n\n    ```bash\n    # safest mode: skip everything in \"results/\" and \"docs/\" recursively\n    komodo . --equal-chunks 5 \\\n      --ignore \"**/results/**\" \\\n      --ignore \"**/docs/**\"\n    ```\n\n  **Pro Tip: If in doubt, just use **/folder/** to recursively ignore that folder and everything beneath it. This is the most reliable way to avoid processing unwanted files in subdirectories.**\n\n  ##### Fixed Number of Chunks with ignore mode\n\n  * `--ignore \"/Users/oha/treeline/results/**\"` tells the chunker to skip any files in that absolute directory path.\n  * `--ignore \"docs/*\"` tells it to skip any files under a relative folder named docs/.\n\n    ```bash\n    komodo . --equal-chunks 5 --ignore \"/Users/oha/treeline/results/**\" --ignore \"docs/*\" \n    ```\n\n  ##### Priority Rules\n\n  Priority Rules help determine which files should be processed first or given more importance. Files with higher priority scores are processed first\n\n  ```bash\n  # With equal chunks, 10 which is .py is higher than 5, so 10 will get processed first\n  komodo . \\\n    --equal-chunks 5 \\\n    --priority \"*.py,10\" \\ \n    --priority \"*.md,5\" \\\n    --output-dir chunks\n\n  # Or with max chunk size\n  komodo . \\\n    --max-chunk-size 1000 \\\n    --priority \"*.py,10\" \\\n    --priority \"*.md,5\" \\\n    --output-dir chunks\n  ```\n\n#### LLM Optimization Options\nEnable metadata extraction and content optimization:\n\n```bash\nkomodo . \\\n  --equal-chunks 5 \\\n  --enhanced \\\n  --context-window 4096 \\\n  --min-relevance 0.3\n```\n\n```bash     \nkomodo . \\\n  --equal-chunks 5 \\\n  --enhanced \\\n  --keep-redundant \\\n  --min-relevance 0.5\n```\n\n```bash\nkomodo . \\\n  --equal-chunks 5 \\\n  --enhanced \\\n  --no-metadata \\\n  --context-window 8192\n```\n\n### Dry Run\n\nIf you only want to see which files **would** be chunked (and in what priority order), without actually writing any output chunks, you can specify `--dry-run`. This is especially helpful if you’re testing new ignore/unignore patterns or priority rules. Note again, there will be **NO CHUNKING** being done. This is just to let you see what files will be chunked.\n\nExample:\n\n```bash\n## vanilla approach \nkomodo . --equal-chunks 5 --dry-run\n\n## with priorities for .py files. these get processed faster. but note this is just a dry run\nkomodo . --equal-chunks 5 --dry-run \\\n    --priority \"*.py,10\" \\\n    --priority \"*.md,5\"\n```\n\nNo chunks are created. Komodo simply prints the would-be processed files, sorted by priority. This is an easy way to confirm your ignore patterns and see exactly which files the chunker will pick up.\n\n### Python API Usage\n\nBasic usage:\n\n```python\nfrom komodo import ParallelChunker\n\n# Split into 5 equal chunks\nchunker = ParallelChunker(\n    equal_chunks=5,\n    output_dir=\"chunks\"\n)\nchunker.process_directory(\"path/to/code\")\n```\n\nAdvanced configuration:\n\n```python\nchunker = ParallelChunker(\n    equal_chunks=5,  # or max_chunk_size=1000\n    \n    user_ignore=[\"*.log\", \"node_modules/**\"],\n    user_unignore=[\"important.log\"],\n    binary_extensions=[\"exe\", \"dll\", \"so\", \"bin\"],\n    \n    priority_rules=[\n        (\"*.py\", 10),\n        (\"*.md\", 5),\n        (\"*.txt\", 1)\n    ],\n    \n    output_dir=\"chunks\",\n    num_threads=4\n)\n\nchunker.process_directories([\"src/\", \"docs/\", \"tests/\"])\n```\n\nBasic configuration with file_type:\n```python\nimport os\nfrom pykomodo.multi_dirs_chunker import ParallelChunker\n\nos.makedirs(\"/Users/test/komodo/pdf\", exist_ok=True)\noutput_dir = \"/Users/test/komodo/pdf\"\n\nchunker = ParallelChunker(\n    max_chunk_size=1000,\n    output_dir=output_dir,\n    file_type=\"pdf\" \n)\n\nchunker.process_directory(\"/Users/test/komodo/\")\n\nprint(\"PDF processing completed successfully!\")\n```\n\n## Advanced LLM Features\n\n### Metadata Extraction\nEach chunk automatically extracts and includes:\n- Function definitions\n- Class declarations\n- Import statements\n- Docstrings\n\n### Relevance Scoring\nChunks are scored based on:\n- Code/comment ratio\n- Function/class density\n- Documentation quality\n- Import significance\n\n### Redundancy Removal\nAutomatically removes duplicate content across chunks while preserving unique context.\n\nExample with LLM optimizations:\n\n```python\nchunker = ParallelChunker(\n    equal_chunks=5,\n    extract_metadata=True,\n    remove_redundancy=True,\n    context_window=4096,\n    min_relevance_score=0.3\n)\n```\n\n### File Type Restriction\n\nThe file_type parameter of the ParallelChunker constructor lets you restrict which file extensions you process.\n\n```python\nimport os\nfrom pykomodo.multi_dirs_chunker import ParallelChunker\n\nos.makedirs(\"/path/to/dir\", exist_ok=True)\noutput_dir = \"/path/to/dir\"\n\nchunker = ParallelChunker(\n    max_chunk_size=1000,\n    output_dir=output_dir,\n    file_type=\"pdf\" \n)\n\nchunker.process_directory(\"/path/to/dir\")\n\nprint(\"PDF processing completed successfully!\")\n```\n\n### Typed Classes \u0026 Pydantic-Based Configuration\n\nKomodo’s main classes (`ParallelChunker`, `EnhancedParallelChunker`, etc.) now include **type hints**. Nothing changes at runtime, but if you’re using an IDE or a type checker like `mypy`, you’ll get improved error checking and auto-completion - or hopefully. \n\nYou can also use **Pydantic** to configure Komodo with strongly typed settings. For instance:\n\n```python\nfrom pydantic import BaseModel, Field\nfrom typing import List, Optional\nfrom pykomodo.multi_dirs_chunker import ParallelChunker\nfrom pykomodo.enhanced_chunker import EnhancedParallelChunker\n\nclass KomodoConfig(BaseModel):\n    directories: List[str] = Field(default_factory=lambda: [\".\"], description=\"Directories to process.\")\n    equal_chunks: Optional[int] = None\n    max_chunk_size: Optional[int] = None\n    output_dir: str = \"chunks\"\n    semantic_chunking: bool = False\n    enhanced: bool = False\n    context_window: int = 4096\n    min_relevance_score: float = 0.3\n    remove_redundancy: bool = True\n    extract_metadata: bool = True\n\ndef run_chunker_with_config(config: KomodoConfig):\n    ChunkerClass = EnhancedParallelChunker if config.enhanced else ParallelChunker\n\n    chunker = ChunkerClass(\n        equal_chunks=config.equal_chunks,\n        max_chunk_size=config.max_chunk_size,\n        output_dir=config.output_dir,\n        semantic_chunking=config.semantic_chunking,\n        context_window=config.context_window if config.enhanced else None,\n        min_relevance_score=config.min_relevance_score if config.enhanced else None,\n        remove_redundancy=config.remove_redundancy if config.enhanced else None,\n        extract_metadata=config.extract_metadata if config.enhanced else None,\n    )\n\n    chunker.process_directories(config.directories)\n    chunker.close()\n\nif __name__ == \"__main__\":\n    # example use with typed + validated config\n    cfg = KomodoConfig(directories=[\"src/\", \"docs/\"], equal_chunks=5, enhanced=True)\n    run_chunker_with_config(cfg)\n```\n\n## Common Use Cases\n\n### 1. Preparing Context for LLMs\n\nSplit a large codebase into equal chunks suitable for LLM context windows:\n\n```python\nchunker = ParallelChunker(\n    equal_chunks=5,\n    priority_rules=[\n        (\"*.py\", 10),    \n        (\"README*\", 8), \n    ],\n    user_ignore=[\"tests/**\", \"**/__pycache__/**\"],\n    output_dir=\"llm_chunks\"\n)\nchunker.process_directory(\"my_project\")\n```\n\n## Built-in Ignore Patterns\n\nThe chunker automatically ignores common non-text and build-related files:\n\n- `**/.git/**`\n- `**/.idea/**`\n- `__pycache__`\n- `*.pyc`\n- `*.pyo`\n- `**/node_modules/**`\n- `target`\n- `venv`\n\n## Common Gotchas\n\n1. Leading Slash for Absolute Paths\n\n  * If you omit the leading `/` in a pattern like `/Users/oha/...`, Komodo treats it as relative and won’t match your actual absolute path.\n\n2. `/**` vs. `/*`\n\n* `folder/**` matches all files and subfolders under folder.\n* `folder/*` only matches the immediate contents of folder, not deeper subdirectories.\n* Overwriting Multiple `--ignore` Flags\n\n3. Folder Name vs. Actual Path\n\n* If your path is really `src/komodo/content/results`, but you only wrote `results/**`, you may need a double-star approach `(**/results/**)` to cover deeper paths.\n\n# Acknowledgments\nThis project was inspired by [repomix](https://github.com/yamadashy/repomix), a repository content chunking tool.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nApache 2.0","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduriantaco%2Fpykomodo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fduriantaco%2Fpykomodo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduriantaco%2Fpykomodo/lists"}